date:20170411

[jira] [Commented] (SPARK-20291) NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, DoubleType)

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965459#comment-15965459
 ] 

Apache Spark commented on SPARK-20291:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/17618

> NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, 
> DoubleType) 
> ---
>
> Key: SPARK-20291
> URL: https://issues.apache.org/jira/browse/SPARK-20291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 2.1.1, 2.2.0
>
>
> `NaNvl(float value, null)` will be converted into `NaNvl(float value, 
> Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), 
> Cast(null, DoubleType))`.
> This will cause mismatching in the output type when the input type is float.
> By adding extra rule in TypeCoercion can resolve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20244) Incorrect input size in UI with pyspark

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965455#comment-15965455
 ] 

Apache Spark commented on SPARK-20244:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17617

> Incorrect input size in UI with pyspark
> ---
>
> Key: SPARK-20244
> URL: https://issues.apache.org/jira/browse/SPARK-20244
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Artur Sukhenko
>Priority: Minor
> Attachments: pyspark_incorrect_inputsize.png, 
> sparkshell_correct_inputsize.png
>
>
> In Spark UI (Details for Stage) Input Size is  64.0 KB when running in 
> PySparkShell. 
> Also it is incorrect in Tasks table:
> 64.0 KB / 132120575 in pyspark
> 252.0 MB / 132120575 in spark-shell
> I will attach screenshots.
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> $ yes > /tmp/yes.txt
> $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> $ ./bin/pyspark
> {code}
> Python 2.7.5 (default, Nov  6 2016, 00:28:07) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
> SparkSession available as 'spark'.{code}
> >>> a = sc.textFile("/tmp/yes.txt")
> >>> a.count()
> Open Spark UI and check Stage 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20244) Incorrect input size in UI with pyspark

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20244:


Assignee: Apache Spark

> Incorrect input size in UI with pyspark
> ---
>
> Key: SPARK-20244
> URL: https://issues.apache.org/jira/browse/SPARK-20244
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Artur Sukhenko
>Assignee: Apache Spark
>Priority: Minor
> Attachments: pyspark_incorrect_inputsize.png, 
> sparkshell_correct_inputsize.png
>
>
> In Spark UI (Details for Stage) Input Size is  64.0 KB when running in 
> PySparkShell. 
> Also it is incorrect in Tasks table:
> 64.0 KB / 132120575 in pyspark
> 252.0 MB / 132120575 in spark-shell
> I will attach screenshots.
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> $ yes > /tmp/yes.txt
> $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> $ ./bin/pyspark
> {code}
> Python 2.7.5 (default, Nov  6 2016, 00:28:07) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
> SparkSession available as 'spark'.{code}
> >>> a = sc.textFile("/tmp/yes.txt")
> >>> a.count()
> Open Spark UI and check Stage 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20244) Incorrect input size in UI with pyspark

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20244:


Assignee: (was: Apache Spark)

> Incorrect input size in UI with pyspark
> ---
>
> Key: SPARK-20244
> URL: https://issues.apache.org/jira/browse/SPARK-20244
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Artur Sukhenko
>Priority: Minor
> Attachments: pyspark_incorrect_inputsize.png, 
> sparkshell_correct_inputsize.png
>
>
> In Spark UI (Details for Stage) Input Size is  64.0 KB when running in 
> PySparkShell. 
> Also it is incorrect in Tasks table:
> 64.0 KB / 132120575 in pyspark
> 252.0 MB / 132120575 in spark-shell
> I will attach screenshots.
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> $ yes > /tmp/yes.txt
> $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> $ ./bin/pyspark
> {code}
> Python 2.7.5 (default, Nov  6 2016, 00:28:07) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
> SparkSession available as 'spark'.{code}
> >>> a = sc.textFile("/tmp/yes.txt")
> >>> a.count()
> Open Spark UI and check Stage 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965452#comment-15965452
 ] 

jin xing commented on SPARK-19659:
--

Yes, I think it's a good idea to leverage memory manager instead of using 
spark.reducer.maxBytesShuffleToMemory.
I will try refine the PR :)

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20291) NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, DoubleType)

2017-04-11 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20291:

Fix Version/s: 2.1.1

> NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, 
> DoubleType) 
> ---
>
> Key: SPARK-20291
> URL: https://issues.apache.org/jira/browse/SPARK-20291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 2.1.1, 2.2.0
>
>
> `NaNvl(float value, null)` will be converted into `NaNvl(float value, 
> Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), 
> Cast(null, DoubleType))`.
> This will cause mismatching in the output type when the input type is float.
> By adding extra rule in TypeCoercion can resolve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20304) AssertNotNull should not include path in string representation

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20304:


Assignee: Reynold Xin  (was: Apache Spark)

> AssertNotNull should not include path in string representation
> --
>
> Key: SPARK-20304
> URL: https://issues.apache.org/jira/browse/SPARK-20304
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> AssertNotNull's toString/simpleString dumps the entire walkedTypePath. 
> walkedTypePath is used for error message reporting and shouldn't be part of 
> the output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20304) AssertNotNull should not include path in string representation

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965442#comment-15965442
 ] 

Apache Spark commented on SPARK-20304:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17616

> AssertNotNull should not include path in string representation
> --
>
> Key: SPARK-20304
> URL: https://issues.apache.org/jira/browse/SPARK-20304
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> AssertNotNull's toString/simpleString dumps the entire walkedTypePath. 
> walkedTypePath is used for error message reporting and shouldn't be part of 
> the output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20304) AssertNotNull should not include path in string representation

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20304:


Assignee: Apache Spark  (was: Reynold Xin)

> AssertNotNull should not include path in string representation
> --
>
> Key: SPARK-20304
> URL: https://issues.apache.org/jira/browse/SPARK-20304
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> AssertNotNull's toString/simpleString dumps the entire walkedTypePath. 
> walkedTypePath is used for error message reporting and shouldn't be part of 
> the output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20304) AssertNotNull should not include path in string representation

2017-04-11 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-20304:
---

 Summary: AssertNotNull should not include path in string 
representation
 Key: SPARK-20304
 URL: https://issues.apache.org/jira/browse/SPARK-20304
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


AssertNotNull's toString/simpleString dumps the entire walkedTypePath. 
walkedTypePath is used for error message reporting and shouldn't be part of the 
output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2017-04-11 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965411#comment-15965411
 ] 

Saisai Shao commented on SPARK-20286:
-

bq. Maybe the best approach is to change from cachedExecutorIdleTimeout to 
executorIdleTimeout on all cached executors when the last RDD has been 
unpersisted, and then restart the time counter (unpersist will then count as an 
action).

Yes, I think this is a feasible solution. I can help out it if you're not 
familiar with that code.

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2017-04-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965408#comment-15965408
 ] 

Miguel Pérez commented on SPARK-20286:
--

Yes. I suppose this is only checked when an executors passes from active to 
idle status (maybe when the last action has finished), and it's not changed 
when unpersist is called. But this is problematic, for example, if the job is 
interactive. Imagine you have a Zeppelin notebook or a Spark shell. The same 
driver could have different purposes and the session could be opened a long 
time. Once you have called a single persist, all these executors will have to 
wait {{dynamicAllocation.cachedExecutorIdleTimeout}} (which is Infinity by 
default) or the user executing another action (executors will become active 
again and then idle, but this time without cached data).

I don't know the solution. Maybe the best approach is to change from 
cachedExecutorIdleTimeout to executorIdleTimeout on all cached executors when 
the last RDD has been unpersisted, and then restart the time counter (unpersist 
will then count as an action).

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20303) Rename createTempFunction to registerFunction

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965405#comment-15965405
 ] 

Apache Spark commented on SPARK-20303:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17615

> Rename createTempFunction to registerFunction
> -
>
> Key: SPARK-20303
> URL: https://issues.apache.org/jira/browse/SPARK-20303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Session catalog API `createTempFunction` is being used by Hive build-in 
> functions, persistent functions, and temporary functions. Thus, the name is 
> confusing. This PR is to replace it by `registerFunction`. Also we can move 
> construction of `FunctionBuilder` and `ExpressionInfo` into the new 
> `registerFunction`, instead of duplicating the logics everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20303) Rename createTempFunction to registerFunction

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20303:


Assignee: Xiao Li  (was: Apache Spark)

> Rename createTempFunction to registerFunction
> -
>
> Key: SPARK-20303
> URL: https://issues.apache.org/jira/browse/SPARK-20303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Session catalog API `createTempFunction` is being used by Hive build-in 
> functions, persistent functions, and temporary functions. Thus, the name is 
> confusing. This PR is to replace it by `registerFunction`. Also we can move 
> construction of `FunctionBuilder` and `ExpressionInfo` into the new 
> `registerFunction`, instead of duplicating the logics everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20303) Rename createTempFunction to registerFunction

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20303:


Assignee: Apache Spark  (was: Xiao Li)

> Rename createTempFunction to registerFunction
> -
>
> Key: SPARK-20303
> URL: https://issues.apache.org/jira/browse/SPARK-20303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Session catalog API `createTempFunction` is being used by Hive build-in 
> functions, persistent functions, and temporary functions. Thus, the name is 
> confusing. This PR is to replace it by `registerFunction`. Also we can move 
> construction of `FunctionBuilder` and `ExpressionInfo` into the new 
> `registerFunction`, instead of duplicating the logics everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20303) Rename createTempFunction to registerFunction

2017-04-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20303:

Summary: Rename createTempFunction to registerFunction  (was: Replace 
createTempFunction by registerFunction)

> Rename createTempFunction to registerFunction
> -
>
> Key: SPARK-20303
> URL: https://issues.apache.org/jira/browse/SPARK-20303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Session catalog API `createTempFunction` is being used by Hive build-in 
> functions, persistent functions, and temporary functions. Thus, the name is 
> confusing. This PR is to replace it by `registerFunction`. Also we can move 
> construction of `FunctionBuilder` and `ExpressionInfo` into the new 
> `registerFunction`, instead of duplicating the logics everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20303) Replace createTempFunction by registerFunction

2017-04-11 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20303:
---

 Summary: Replace createTempFunction by registerFunction
 Key: SPARK-20303
 URL: https://issues.apache.org/jira/browse/SPARK-20303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


Session catalog API `createTempFunction` is being used by Hive build-in 
functions, persistent functions, and temporary functions. Thus, the name is 
confusing. This PR is to replace it by `registerFunction`. Also we can move 
construction of `FunctionBuilder` and `ExpressionInfo` into the new 
`registerFunction`, instead of duplicating the logics everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20302) Short circuit cast when from and to types are structurally the same

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965354#comment-15965354
 ] 

Apache Spark commented on SPARK-20302:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17614

> Short circuit cast when from and to types are structurally the same
> ---
>
> Key: SPARK-20302
> URL: https://issues.apache.org/jira/browse/SPARK-20302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> When we perform a cast expression and the from and to types are structurally 
> the same (having the same structure but different field names), we should be 
> able to skip the actual cast.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20302) Short circuit cast when from and to types are structurally the same

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20302:


Assignee: Apache Spark  (was: Reynold Xin)

> Short circuit cast when from and to types are structurally the same
> ---
>
> Key: SPARK-20302
> URL: https://issues.apache.org/jira/browse/SPARK-20302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> When we perform a cast expression and the from and to types are structurally 
> the same (having the same structure but different field names), we should be 
> able to skip the actual cast.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20302) Short circuit cast when from and to types are structurally the same

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20302:


Assignee: Reynold Xin  (was: Apache Spark)

> Short circuit cast when from and to types are structurally the same
> ---
>
> Key: SPARK-20302
> URL: https://issues.apache.org/jira/browse/SPARK-20302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> When we perform a cast expression and the from and to types are structurally 
> the same (having the same structure but different field names), we should be 
> able to skip the actual cast.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20302) Short circuit cast when from and to types are structurally the same

2017-04-11 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-20302:

Summary: Short circuit cast when from and to types are structurally the 
same  (was: Optimize cast when from and to types are structurally the same)

> Short circuit cast when from and to types are structurally the same
> ---
>
> Key: SPARK-20302
> URL: https://issues.apache.org/jira/browse/SPARK-20302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> When we perform a cast expression and the from and to types are structurally 
> the same (having the same structure but different field names), we should be 
> able to skip the actual cast.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20302) Optimize cast when from and to types are structurally the same

2017-04-11 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-20302:
---

 Summary: Optimize cast when from and to types are structurally the 
same
 Key: SPARK-20302
 URL: https://issues.apache.org/jira/browse/SPARK-20302
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


When we perform a cast expression and the from and to types are structurally 
the same (having the same structure but different field names), we should be 
able to skip the actual cast.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19993) Caching logical plans containing subquery expressions does not work.

2017-04-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19993.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17330
[https://github.com/apache/spark/pull/17330]

> Caching logical plans containing subquery expressions does not work.
> 
>
> Key: SPARK-19993
> URL: https://issues.apache.org/jira/browse/SPARK-19993
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
> Fix For: 2.2.0
>
>
> Here is a simple repro that depicts the problem. In this case the second 
> invocation of the sql should have been from the cache. However the lookup 
> fails currently.
> {code}
> scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from 
> s2 where s1.c1 = s2.c1)")
> ds: org.apache.spark.sql.DataFrame = [c1: int]
> scala> ds.cache
> res13: ds.type = [c1: int]
> scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where 
> s1.c1 = s2.c1)").explain(true)
> == Analyzed Logical Plan ==
> c1: int
> Project [c1#86]
> +- Filter c1#86 IN (list#78 [c1#86])
>:  +- Project [c1#87]
>: +- Filter (outer(c1#86) = c1#87)
>:+- SubqueryAlias s2
>:   +- Relation[c1#87] parquet
>+- SubqueryAlias s1
>   +- Relation[c1#86] parquet
> == Optimized Logical Plan ==
> Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87))
> :- Relation[c1#86] parquet
> +- Relation[c1#87] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19993) Caching logical plans containing subquery expressions does not work.

2017-04-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19993:
---

Assignee: Dilip Biswal

> Caching logical plans containing subquery expressions does not work.
> 
>
> Key: SPARK-19993
> URL: https://issues.apache.org/jira/browse/SPARK-19993
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
> Fix For: 2.2.0
>
>
> Here is a simple repro that depicts the problem. In this case the second 
> invocation of the sql should have been from the cache. However the lookup 
> fails currently.
> {code}
> scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from 
> s2 where s1.c1 = s2.c1)")
> ds: org.apache.spark.sql.DataFrame = [c1: int]
> scala> ds.cache
> res13: ds.type = [c1: int]
> scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where 
> s1.c1 = s2.c1)").explain(true)
> == Analyzed Logical Plan ==
> c1: int
> Project [c1#86]
> +- Filter c1#86 IN (list#78 [c1#86])
>:  +- Project [c1#87]
>: +- Filter (outer(c1#86) = c1#87)
>:+- SubqueryAlias s2
>:   +- Relation[c1#87] parquet
>+- SubqueryAlias s1
>   +- Relation[c1#86] parquet
> == Optimized Logical Plan ==
> Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87))
> :- Relation[c1#86] parquet
> +- Relation[c1#87] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20291) NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, DoubleType)

2017-04-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20291.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17606
[https://github.com/apache/spark/pull/17606]

> NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, 
> DoubleType) 
> ---
>
> Key: SPARK-20291
> URL: https://issues.apache.org/jira/browse/SPARK-20291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 2.2.0
>
>
> `NaNvl(float value, null)` will be converted into `NaNvl(float value, 
> Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), 
> Cast(null, DoubleType))`.
> This will cause mismatching in the output type when the input type is float.
> By adding extra rule in TypeCoercion can resolve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-11 Thread Fei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965308#comment-15965308
 ] 

Fei Wang commented on SPARK-20184:
--

Tested with a smaller table 100,000 rows.
Codegen on: 2.6s
Codegen off: 1.5s

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2017-04-11 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965306#comment-15965306
 ] 

Saisai Shao commented on SPARK-20286:
-

So the fix is that if there's no RDD get persisted, executors idle time should 
be changed to {{dynamicAllocation.executorIdleTimeout}}.

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965302#comment-15965302
 ] 

Wenchen Fan commented on SPARK-19659:
-

Seems your model is, all shuffle blocks should be fetched into disk, but we 
have a buffer to put some of the shuffle blocks in memory, and the buffer size 
is controlled by {{spark.reducer.maxBytesShuffleToMemory}}.

Overall I think this is a good model, but instead of using 
{{spark.reducer.maxBytesShuffleToMemory}}, can we leverage memory manager and 
allocate as much memory as possible for this buffer?

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20301) Flakiness in StreamingAggregationSuite

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965297#comment-15965297
 ] 

Apache Spark commented on SPARK-20301:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/17613

> Flakiness in StreamingAggregationSuite
> --
>
> Key: SPARK-20301
> URL: https://issues.apache.org/jira/browse/SPARK-20301
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>  Labels: flaky-test
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20301) Flakiness in StreamingAggregationSuite

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20301:


Assignee: Burak Yavuz  (was: Apache Spark)

> Flakiness in StreamingAggregationSuite
> --
>
> Key: SPARK-20301
> URL: https://issues.apache.org/jira/browse/SPARK-20301
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>  Labels: flaky-test
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20301) Flakiness in StreamingAggregationSuite

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20301:


Assignee: Apache Spark  (was: Burak Yavuz)

> Flakiness in StreamingAggregationSuite
> --
>
> Key: SPARK-20301
> URL: https://issues.apache.org/jira/browse/SPARK-20301
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>  Labels: flaky-test
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-11 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965296#comment-15965296
 ] 

Yan Facai (颜发才) commented on SPARK-20199:
-

Yes, as [~pralabhkumar] said, DecisionTree hardcodes featureSubsetStrategy.  
How about adding setFeatureSubsetStrategy for DecisionTree?

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>Priority: Minor
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20301) Flakiness in StreamingAggregationSuite

2017-04-11 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-20301:
---

 Summary: Flakiness in StreamingAggregationSuite
 Key: SPARK-20301
 URL: https://issues.apache.org/jira/browse/SPARK-20301
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz


https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20301) Flakiness in StreamingAggregationSuite

2017-04-11 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-20301:

Labels: flaky-test  (was: )

> Flakiness in StreamingAggregationSuite
> --
>
> Key: SPARK-20301
> URL: https://issues.apache.org/jira/browse/SPARK-20301
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>  Labels: flaky-test
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965273#comment-15965273
 ] 

Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:24 AM:
---

Let me leave some pointers about related PRs - 
https://github.com/apache/spark/pull/8566 and 
https://github.com/apache/spark/pull/6617. cc [~lian cheng]]


was (Author: hyukjin.kwon):
Let me leave some pointers about related PRs - 
https://github.com/apache/spark/pull/8566 and 
https://github.com/apache/spark/pull/6617. cc [~liancheng]

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965273#comment-15965273
 ] 

Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:23 AM:
---

Let me leave some pointers about related PRs - 
https://github.com/apache/spark/pull/8566 and 
https://github.com/apache/spark/pull/6617. cc [~liancheng]


was (Author: hyukjin.kwon):
Let me leave some pointers about related PRs - 
https://github.com/apache/spark/pull/8566 and 
https://github.com/apache/spark/pull/8566. cc [~liancheng]

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965273#comment-15965273
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

Let me leave some pointers about related PRs - 
https://github.com/apache/spark/pull/8566 and 
https://github.com/apache/spark/pull/8566. cc [~liancheng]

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965265#comment-15965265
 ] 

Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:21 AM:
---

Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is 
resolvable maybe?


was (Author: hyukjin.kwon):
Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is 
resolvable maybe?

Up to my knowledge, this option means to follow Parquet's specification rather 
than the current way used by Spark. So, if other implementation follows 
Parquet's specification, I guess this is the correct option for compatibility.

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965268#comment-15965268
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

Oh wait, I am sorry. It does follow the newer standard - 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal 
and I missed the documentation.
Wouldn't it be then bugs in Impala or Hive?

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965265#comment-15965265
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is 
resolvable maybe?

Up to my knowledge, this option means to follow Parquet's specification rather 
than the current way used by Spark. So, if other implementation follows 
Parquet's specification, I guess this is the correct option for compatibility.

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965264#comment-15965264
 ] 

Hyukjin Kwon commented on SPARK-20294:
--

Also, it infers schema from the first row if sampling ratio is not given up to 
my knowledge.

{code}
>>> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
>>> small_rdd.toDF().printSchema

>>> small_rdd = sc.parallelize([(1, 2), ('a', 'b')])
>>> small_rdd.toDF().printSchema

>>> small_rdd = sc.parallelize([(1, 1.1), ('a', 'b')])
>>> small_rdd.toDF().printSchema

{code}

> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-04-11 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963741#comment-15963741
 ] 

Yan Facai (颜发才) edited comment on SPARK-3383 at 4/12/17 2:02 AM:
-

I think the task contains two subtask:

1. separate `split` with `bin`: Now for each categorical feature, there is 1 
bin per split. That's said, for N categories, the communicate cost is 2^(N-1) - 
1 bins. However, if we only get stats for each category, and construct splits 
finally. Namely, 1 bin per category. The communicate cost is N bins.

2. As said in Description, store all but the last bin, and also store the total 
statistics for each node. The communicate cost will be N-1 bins.

I have a question:
1. why unordered features only are allowed in multiclass classification?



was (Author: facai):
I think the task contains two subtask:

1. separate `split` with `bin`: Now for each categorical feature, there is 1 
bin per split. That's said, for N categories, the communicate cost is 2^{N-1} - 
1 bins. However, if we only get stats for each category, and construct splits 
finally. Namely, 1 bin per category. The communicate cost is N bins.

2. As said in Description, store all but the last bin, and also store the total 
statistics for each node. The communicate cost will be N-1 bins.

I have a question:
1. why unordered features only are allowed in multiclass classification?


> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-04-11 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965263#comment-15965263
 ] 

Yan Facai (颜发才) commented on SPARK-3383:


How about the idea? 

1. We use `bin` to represent value, which is quantized value for continuous 
feature and category for discrete feature. In DTStatsAggregator, only collect 
stats of bins. At the stage, all operators are same, no matter it is continuous 
or discrete feature.

2. in `binsToBestSplit`, 
+ Continuous / order discrete feature has N bins, and then construct N - 1 
splits. 
+ Unorder discrete feature has N bins, and then construct all possible 
combination, namely, 2^(N-1) - 1 splits.

3. in `binsToBestSplit`, collect all splits and calculate their impurity. Order 
these splits and find the best one.

> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965259#comment-15965259
 ] 

Hyukjin Kwon edited comment on SPARK-20294 at 4/12/17 2:00 AM:
---

I am resolving this because this does not look a problem. I guess we are unable 
to infer the schema when the data is empty.

I think the workaround can be ...

{code}
>>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')])
>>> small_rdd.toDF(schema, sampleRatio=0.1).show()
+---+---+
| _1| _2|
+---+---+
|  1|  2|
|  2|foo|
+---+---+
{code}

if it needs 1%.

Please reopen this if anyone thinks differently or I misunderstood.


was (Author: hyukjin.kwon):
I am resolving this because this does not look a problem. I guess we are unable 
to infer the schema when the data is empty.

I think the workaround can be ...

{code}
>>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')])
>>> schema = small_rdd.toDF().schema
>>> small_rdd.toDF(schema).show()
+---+---+
| _1| _2|
+---+---+
|  1|  2|
|  2|foo|
+---+---+
{code}

or maybe

{code}
>>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')])
>>> schema = small_rdd.toDF(sampleRatio=0.5).schema
>>> small_rdd.toDF(schema).show()
+---+---+
| _1| _2|
+---+---+
|  1|  2|
|  2|foo|
+---+---+
{code}

Please reopen this if anyone thinks differently or I misunderstood.

> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20294.
--
Resolution: Invalid

I am resolving this because this does not look a problem. I guess we are unable 
to infer the schema when the data is empty.

I think the workaround can be ...

{code}
>>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')])
>>> schema = small_rdd.toDF().schema
>>> small_rdd.toDF(schema).show()
+---+---+
| _1| _2|
+---+---+
|  1|  2|
|  2|foo|
+---+---+
{code}

or maybe

{code}
>>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')])
>>> schema = small_rdd.toDF(sampleRatio=0.5).schema
>>> small_rdd.toDF(schema).show()
+---+---+
| _1| _2|
+---+---+
|  1|  2|
|  2|foo|
+---+---+
{code}

Please reopen this if anyone thinks differently or I misunderstood.

> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Mostafa Mokhtar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965255#comment-15965255
 ] 

Mostafa Mokhtar commented on SPARK-20297:
-

[~hyukjin.kwon]
Data written by Spark is readable by Hive and Impala when 
spark.sql.parquet.writeLegacyFormat is enabled. 

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965257#comment-15965257
 ] 

Hyukjin Kwon commented on SPARK-20294:
--

Just for other guys,

{code}
>>> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
>>> small_rdd.toDF(sampleRatio=0.01).show()
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/session.py", line 57, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
  File ".../spark/python/pyspark/sql/session.py", line 524, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
  File ".../spark/python/pyspark/sql/session.py", line 364, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
  File ".../spark/python/pyspark/sql/session.py", line 356, in _inferSchema
schema = rdd.map(_infer_schema).reduce(_merge_type)
  File ".../spark/python/pyspark/rdd.py", line 838, in reduce
raise ValueError("Can not reduce() empty RDD")
ValueError: Can not reduce() empty RDD
{code}


> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20297:
-
Priority: Major  (was: Critical)

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20297:
-
Component/s: (was: Spark Core)
 SQL

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Critical
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965245#comment-15965245
 ] 

Hyukjin Kwon commented on SPARK-20297:
--

For me, it sounds like related with {{spark.sql.parquet.writeLegacyFormat}}. I 
haven't tested and double-checked it by myself but I assume Hive guesses the 
decimal as fixed-bytes but Spark actually writes out them as INT32 for 1 <= 
precision <= 9 and INT64 for 10 <= precision <= 18.

Do you mind if I ask to try out with {{spark.sql.parquet.writeLegacyFormat}} 
enabled?

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Critical
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2017-04-11 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965244#comment-15965244
 ] 

Yan Facai (颜发才) edited comment on SPARK-10788 at 4/12/17 1:35 AM:
--

[~josephkb] As categories A, B and C are independent, why not collect 
statistics only for cateogry? I mean 1 bin per category, instead of 1 bin per 
split. 

Splits are calculated in the last step in `binsToBestSplit`. So communication 
cost is N bins.


was (Author: facai):
[~josephkb] As categories A, B and C are independent, why not collect 
statistics only for cateogry? Splits are calculated in the last step in 
`binsToBestSplit`. So communication cost is N bins.

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0
>
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2017-04-11 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965244#comment-15965244
 ] 

Yan Facai (颜发才) commented on SPARK-10788:
-

[~josephkb] As categories A, B and C are independent, why not collect 
statistics only for cateogry? Splits are calculated in the last step in 
`binsToBestSplit`. So communication cost is N bins.

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0
>
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue

2017-04-11 Thread Ruhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruhui Wang updated SPARK-20295:
---
Description: 
when run  tpcds-q95, and set  spark.sql.adaptive.enabled = true the physical 
plan firstly:

Sort
:  +- Exchange(coordinator id: 1)
: +- Project***
::-Sort **
::  +- Exchange(coordinator id: 2)
:: :- Project ***
:+- Sort
::  +- Exchange(coordinator id: 3)


 spark.sql.exchange.reuse is opened, then physical plan will become below:
Sort
:  +- Exchange(coordinator id: 1)
: +- Project***
::-Sort **
::  +- Exchange(coordinator id: 2)
:: :- Project ***
:+- Sort
::  +- ReusedExchange  Exchange(coordinator id: 2)

If spark.sql.adaptive.enabled = true,  the code stack is : 
ShuffleExchange#doExecute --> postShuffleRDD function --> 
doEstimationIfNecessary . In this function, 
assert(exchanges.length == numExchanges) will be error, as left side has only 
one element, but right is equal to 2.

If this is a bug of spark.sql.adaptive.enabled and exchange resue?

  was:
when spark.sql.exchange.reuse is opened, then run a query with self join(such 
as tpcds-q95), the physical plan will become below randomly:

WholeStageCodegen
:  +- Project [id#0L]
: +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
::- Project [id#0L]
::  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
:: :- Range 0, 1, 4, 1024, [id#0L]
:: +- INPUT
:+- INPUT
:- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
:  +- WholeStageCodegen
: :  +- Range 0, 1, 4, 1024, [id#1L]
+- ReusedExchange [id#2L], BroadcastExchange 
HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))

If spark.sql.adaptive.enabled = true,  the code stack is : 
ShuffleExchange#doExecute --> postShuffleRDD function --> 
doEstimationIfNecessary . In this function, 
assert(exchanges.length == numExchanges) will be error, as left side has only 
one element, but right is equal to 2.

If this is a bug of spark.sql.adaptive.enabled and exchange resue?


> when  spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
> --
>
> Key: SPARK-20295
> URL: https://issues.apache.org/jira/browse/SPARK-20295
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 2.1.0
>Reporter: Ruhui Wang
>
> when run  tpcds-q95, and set  spark.sql.adaptive.enabled = true the physical 
> plan firstly:
> Sort
> :  +- Exchange(coordinator id: 1)
> : +- Project***
> ::-Sort **
> ::  +- Exchange(coordinator id: 2)
> :: :- Project ***
> :+- Sort
> ::  +- Exchange(coordinator id: 3)
>  spark.sql.exchange.reuse is opened, then physical plan will become below:
> Sort
> :  +- Exchange(coordinator id: 1)
> : +- Project***
> ::-Sort **
> ::  +- Exchange(coordinator id: 2)
> :: :- Project ***
> :+- Sort
> ::  +- ReusedExchange  Exchange(coordinator id: 2)
> If spark.sql.adaptive.enabled = true,  the code stack is : 
> ShuffleExchange#doExecute --> postShuffleRDD function --> 
> doEstimationIfNecessary . In this function, 
> assert(exchanges.length == numExchanges) will be error, as left side has only 
> one element, but right is equal to 2.
> If this is a bug of spark.sql.adaptive.enabled and exchange resue?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-11 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965220#comment-15965220
 ] 

Reynold Xin commented on SPARK-20202:
-

There are no currently targeted version, are there?


> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-11 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965218#comment-15965218
 ] 

holdenk commented on SPARK-20202:
-

Would it possible make sense to untarget this from the maintenance releases 
(1.6.X, 2.0.X, 2.1.X) and instead focus on the future versions?

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7128) Add generic bagging algorithm to spark.ml

2017-04-11 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965121#comment-15965121
 ] 

yuhao yang commented on SPARK-7128:
---

I would vote for adding this now. 

This is quite helpful in practical applications like fraud detection, and 
feynmanliang has started with a solid prototype. I can help finish it if this 
is on the roadmap.

> Add generic bagging algorithm to spark.ml
> -
>
> Key: SPARK-7128
> URL: https://issues.apache.org/jira/browse/SPARK-7128
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Bagging algorithm 
> which can work with any Classifier or Regressor.  Creating this feature will 
> require researching the possible variants and extensions of bagging which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-04-11 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-20300:
-

 Summary: Python API for ALSModel.recommendForAllUsers,Items
 Key: SPARK-20300
 URL: https://issues.apache.org/jira/browse/SPARK-20300
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley


Python API for ALSModel methods recommendForAllUsers, recommendForAllItems



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-11 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964969#comment-15964969
 ] 

Xin Wu commented on SPARK-20256:


Yes. I am working on it. 
My proposal is to revert the SPARK-18050 change, then add a try-catch over 
externalCatalog.createDatabase(...)  and log the error of existing default 
database from Hive into DEBUG log. 

I am trying to create a unit-test case to simulate the permission issue, which 
I have some difficulty. 



> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.s

[jira] [Created] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-11 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-20299:
---

 Summary: NullPointerException when null and string are in a tuple 
while encoding Dataset
 Key: SPARK-20299
 URL: https://issues.apache.org/jira/browse/SPARK-20299
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


When creating a Dataset from a tuple with {{null}} and a string, NPE is 
reported. When either is removed, it works fine.

{code}
scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level 
Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
input object), - root class: "scala.Tuple2")._2 AS _2#475
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
  at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
  at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
  at 
org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
  ... 58 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-11 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964945#comment-15964945
 ] 

Dongjoon Hyun commented on SPARK-20256:
---

Hi, [~xwu0226].
Is there any progress?

> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> );
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl

[jira] [Assigned] (SPARK-20298) Spelling mistake: charactor

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20298:


Assignee: Apache Spark

> Spelling mistake: charactor
> ---
>
> Key: SPARK-20298
> URL: https://issues.apache.org/jira/browse/SPARK-20298
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Brendan Dwyer
>Assignee: Apache Spark
>Priority: Trivial
>
> "charactor" should be "character"
> {code}
> R/pkg/R/DataFrame.R:2821:  stop("path should be charactor, NULL 
> or omitted.")
> R/pkg/R/DataFrame.R:2828:  stop("mode should be charactor or 
> omitted. It is 'error' by default.")
> R/pkg/R/DataFrame.R:3043:  stop("value should be an integer, 
> numeric, charactor or named list.")
> R/pkg/R/DataFrame.R:3055:  stop("Each item in value should be 
> an integer, numeric or charactor.")
> R/pkg/R/DataFrame.R:3601:  stop("outputMode should be charactor 
> or omitted.")
> R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or 
> omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2929:   "path should be 
> charactor, NULL or omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2931:   "mode should be 
> charactor or omitted. It is 'error' by default.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2950:   "path should be 
> charactor, NULL or omitted.")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20298) Spelling mistake: charactor

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20298:


Assignee: (was: Apache Spark)

> Spelling mistake: charactor
> ---
>
> Key: SPARK-20298
> URL: https://issues.apache.org/jira/browse/SPARK-20298
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Brendan Dwyer
>Priority: Trivial
>
> "charactor" should be "character"
> {code}
> R/pkg/R/DataFrame.R:2821:  stop("path should be charactor, NULL 
> or omitted.")
> R/pkg/R/DataFrame.R:2828:  stop("mode should be charactor or 
> omitted. It is 'error' by default.")
> R/pkg/R/DataFrame.R:3043:  stop("value should be an integer, 
> numeric, charactor or named list.")
> R/pkg/R/DataFrame.R:3055:  stop("Each item in value should be 
> an integer, numeric or charactor.")
> R/pkg/R/DataFrame.R:3601:  stop("outputMode should be charactor 
> or omitted.")
> R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or 
> omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2929:   "path should be 
> charactor, NULL or omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2931:   "mode should be 
> charactor or omitted. It is 'error' by default.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2950:   "path should be 
> charactor, NULL or omitted.")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20298) Spelling mistake: charactor

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964921#comment-15964921
 ] 

Apache Spark commented on SPARK-20298:
--

User 'bdwyer2' has created a pull request for this issue:
https://github.com/apache/spark/pull/17611

> Spelling mistake: charactor
> ---
>
> Key: SPARK-20298
> URL: https://issues.apache.org/jira/browse/SPARK-20298
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Brendan Dwyer
>Priority: Trivial
>
> "charactor" should be "character"
> {code}
> R/pkg/R/DataFrame.R:2821:  stop("path should be charactor, NULL 
> or omitted.")
> R/pkg/R/DataFrame.R:2828:  stop("mode should be charactor or 
> omitted. It is 'error' by default.")
> R/pkg/R/DataFrame.R:3043:  stop("value should be an integer, 
> numeric, charactor or named list.")
> R/pkg/R/DataFrame.R:3055:  stop("Each item in value should be 
> an integer, numeric or charactor.")
> R/pkg/R/DataFrame.R:3601:  stop("outputMode should be charactor 
> or omitted.")
> R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or 
> omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2929:   "path should be 
> charactor, NULL or omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2931:   "mode should be 
> charactor or omitted. It is 'error' by default.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2950:   "path should be 
> charactor, NULL or omitted.")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20298) Spelling mistake: charactor

2017-04-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964889#comment-15964889
 ] 

Sean Owen commented on SPARK-20298:
---

Go ahead, though this is not generally worth a JIRA

> Spelling mistake: charactor
> ---
>
> Key: SPARK-20298
> URL: https://issues.apache.org/jira/browse/SPARK-20298
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Brendan Dwyer
>Priority: Trivial
>
> "charactor" should be "character"
> {code}
> R/pkg/R/DataFrame.R:2821:  stop("path should be charactor, NULL 
> or omitted.")
> R/pkg/R/DataFrame.R:2828:  stop("mode should be charactor or 
> omitted. It is 'error' by default.")
> R/pkg/R/DataFrame.R:3043:  stop("value should be an integer, 
> numeric, charactor or named list.")
> R/pkg/R/DataFrame.R:3055:  stop("Each item in value should be 
> an integer, numeric or charactor.")
> R/pkg/R/DataFrame.R:3601:  stop("outputMode should be charactor 
> or omitted.")
> R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or 
> omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2929:   "path should be 
> charactor, NULL or omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2931:   "mode should be 
> charactor or omitted. It is 'error' by default.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2950:   "path should be 
> charactor, NULL or omitted.")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19505) AttributeError on Exception.message in Python3; hides true exceptions in cloudpickle.py and broadcast.py

2017-04-11 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-19505:
---

Assignee: David Gingrich

> AttributeError on Exception.message in Python3; hides true exceptions in 
> cloudpickle.py and broadcast.py
> 
>
> Key: SPARK-19505
> URL: https://issues.apache.org/jira/browse/SPARK-19505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Assignee: David Gingrich
>Priority: Minor
> Fix For: 2.2.0
>
>
> cloudpickle.py and broadcast.py both catch Exceptions then look at the 
> 'message' field.  The message field doesn't exist in Python 3 so it rethrows 
> and the original exception is lost.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19505) AttributeError on Exception.message in Python3; hides true exceptions in cloudpickle.py and broadcast.py

2017-04-11 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-19505.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16845
[https://github.com/apache/spark/pull/16845]

> AttributeError on Exception.message in Python3; hides true exceptions in 
> cloudpickle.py and broadcast.py
> 
>
> Key: SPARK-19505
> URL: https://issues.apache.org/jira/browse/SPARK-19505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Minor
> Fix For: 2.2.0
>
>
> cloudpickle.py and broadcast.py both catch Exceptions then look at the 
> 'message' field.  The message field doesn't exist in Python 3 so it rethrows 
> and the original exception is lost.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20131) Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964811#comment-15964811
 ] 

Apache Spark commented on SPARK-20131:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17610

> Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data 
> should be deserialized properly
> ---
>
> Key: SPARK-20131
> URL: https://issues.apache.org/jira/browse/SPARK-20131
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite&test_name=SPARK-18560+Receiver+data+should+be+deserialized+properly.
> Error Message
> {code}
> latch.await(60L, SECONDS) was false
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was 
> false
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 
> org.apache.spark.streaming.St

[jira] [Updated] (SPARK-20131) Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly

2017-04-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20131:
-
Summary: Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 
Receiver data should be deserialized properly  (was: Flaky Test: 
org.apache.spark.streaming.StreamingContextSuite)

> Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data 
> should be deserialized properly
> ---
>
> Key: SPARK-20131
> URL: https://issues.apache.org/jira/browse/SPARK-20131
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite&test_name=SPARK-18560+Receiver+data+should+be+deserialized+properly.
> Error Message
> {code}
> latch.await(60L, SECONDS) was false
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was 
> false
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at

[jira] [Commented] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs

2017-04-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964756#comment-15964756
 ] 

Apache Spark commented on SPARK-20296:
--

User 'jtoka' has created a pull request for this issue:
https://github.com/apache/spark/pull/17609

> UnsupportedOperationChecker text on distinct aggregations differs from docs
> ---
>
> Key: SPARK-20296
> URL: https://issues.apache.org/jira/browse/SPARK-20296
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Jason Tokayer
>Priority: Trivial
>
> In the unsupported operations section in the docs 
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  it states that "Distinct operations on streaming Datasets are not 
> supported.". However, in 
> ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```,
>  the error message is ```Distinct aggregations are not supported on streaming 
> DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete 
> output mode. Consider using approximate distinct aggregation```.
> It seems that the error message is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20296:


Assignee: Apache Spark

> UnsupportedOperationChecker text on distinct aggregations differs from docs
> ---
>
> Key: SPARK-20296
> URL: https://issues.apache.org/jira/browse/SPARK-20296
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Jason Tokayer
>Assignee: Apache Spark
>Priority: Trivial
>
> In the unsupported operations section in the docs 
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  it states that "Distinct operations on streaming Datasets are not 
> supported.". However, in 
> ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```,
>  the error message is ```Distinct aggregations are not supported on streaming 
> DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete 
> output mode. Consider using approximate distinct aggregation```.
> It seems that the error message is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs

2017-04-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20296:


Assignee: (was: Apache Spark)

> UnsupportedOperationChecker text on distinct aggregations differs from docs
> ---
>
> Key: SPARK-20296
> URL: https://issues.apache.org/jira/browse/SPARK-20296
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Jason Tokayer
>Priority: Trivial
>
> In the unsupported operations section in the docs 
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  it states that "Distinct operations on streaming Datasets are not 
> supported.". However, in 
> ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```,
>  the error message is ```Distinct aggregations are not supported on streaming 
> DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete 
> output mode. Consider using approximate distinct aggregation```.
> It seems that the error message is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20289) Use StaticInvoke rather than NewInstance for boxing primitive types

2017-04-11 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20289.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Use StaticInvoke rather than NewInstance for boxing primitive types
> ---
>
> Key: SPARK-20289
> URL: https://issues.apache.org/jira/browse/SPARK-20289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> Dataset typed API currently uses NewInstance to box primitive types (i.e. 
> calling the constructor). Instead, it'd be slightly more idiomatic in Java to 
> use PrimitiveType.valueOf, which can be invoked using StaticInvoke expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20298) Spelling mistake: charactor

2017-04-11 Thread Brendan Dwyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964741#comment-15964741
 ] 

Brendan Dwyer commented on SPARK-20298:
---

I can work on this

> Spelling mistake: charactor
> ---
>
> Key: SPARK-20298
> URL: https://issues.apache.org/jira/browse/SPARK-20298
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Brendan Dwyer
>Priority: Trivial
>
> "charactor" should be "character"
> {code}
> R/pkg/R/DataFrame.R:2821:  stop("path should be charactor, NULL 
> or omitted.")
> R/pkg/R/DataFrame.R:2828:  stop("mode should be charactor or 
> omitted. It is 'error' by default.")
> R/pkg/R/DataFrame.R:3043:  stop("value should be an integer, 
> numeric, charactor or named list.")
> R/pkg/R/DataFrame.R:3055:  stop("Each item in value should be 
> an integer, numeric or charactor.")
> R/pkg/R/DataFrame.R:3601:  stop("outputMode should be charactor 
> or omitted.")
> R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or 
> omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2929:   "path should be 
> charactor, NULL or omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2931:   "mode should be 
> charactor or omitted. It is 'error' by default.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2950:   "path should be 
> charactor, NULL or omitted.")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20298) Spelling mistake: charactor

2017-04-11 Thread Brendan Dwyer (JIRA)

Brendan Dwyer created SPARK-20298:
-

 Summary: Spelling mistake: charactor
 Key: SPARK-20298
 URL: https://issues.apache.org/jira/browse/SPARK-20298
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Brendan Dwyer
Priority: Trivial


"charactor" should be "character"
{code}
R/pkg/R/DataFrame.R:2821:  stop("path should be charactor, NULL or 
omitted.")
R/pkg/R/DataFrame.R:2828:  stop("mode should be charactor or 
omitted. It is 'error' by default.")
R/pkg/R/DataFrame.R:3043:  stop("value should be an integer, 
numeric, charactor or named list.")
R/pkg/R/DataFrame.R:3055:  stop("Each item in value should be 
an integer, numeric or charactor.")
R/pkg/R/DataFrame.R:3601:  stop("outputMode should be charactor or 
omitted.")
R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or omitted.")
R/pkg/inst/tests/testthat/test_sparkSQL.R:2929:   "path should be 
charactor, NULL or omitted.")
R/pkg/inst/tests/testthat/test_sparkSQL.R:2931:   "mode should be 
charactor or omitted. It is 'error' by default.")
R/pkg/inst/tests/testthat/test_sparkSQL.R:2950:   "path should be 
charactor, NULL or omitted.")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2017-04-11 Thread Mostafa Mokhtar (JIRA)

Mostafa Mokhtar created SPARK-20297:
---

 Summary: Parquet Decimal(12,2) written by Spark is unreadable by 
Hive and Impala
 Key: SPARK-20297
 URL: https://issues.apache.org/jira/browse/SPARK-20297
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Mostafa Mokhtar
Priority: Critical


While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
columns stored in Parquet written by Spark are not readable by Hive or Impala.

Repro 
{code}
CREATE TABLE customer_acctbal(
  c_acctbal decimal(12,2))
STORED AS Parquet;

insert into customer_acctbal values (7539.95);
{code}

Error from Hive
{code}
Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: 
Can not read value at 1 in block 0 in file 
hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
Time taken: 0.122 seconds
{code}

Error from Impala
{code}
File 
'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
 has an incompatible Parquet schema for column 
'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
DECIMAL(12,2), Parquet schema:
optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
{code}

Table info 
{code}
hive> describe formatted customer_acctbal;
OK
# col_name  data_type   comment

c_acctbal   decimal(12,2)

# Detailed Table Information
Database:   tpch_nested_3000_parquet
Owner:  mmokhtar
CreateTime: Mon Apr 10 17:47:24 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode:   None
Retention:  0
Location:   
hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   true
numFiles1
numRows 0
rawDataSize 0
totalSize   120
transient_lastDdlTime   1491871644

# Storage Information
SerDe Library:  
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets:-1
Bucket Columns: []
Sort Columns:   []
Storage Desc Params:
serialization.format1
Time taken: 0.032 seconds, Fetched: 31 row(s)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs

2017-04-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20296:
--
Priority: Trivial  (was: Minor)

Sounds fine, you can submit a PR to make the messages consistent.

> UnsupportedOperationChecker text on distinct aggregations differs from docs
> ---
>
> Key: SPARK-20296
> URL: https://issues.apache.org/jira/browse/SPARK-20296
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Jason Tokayer
>Priority: Trivial
>
> In the unsupported operations section in the docs 
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  it states that "Distinct operations on streaming Datasets are not 
> supported.". However, in 
> ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```,
>  the error message is ```Distinct aggregations are not supported on streaming 
> DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete 
> output mode. Consider using approximate distinct aggregation```.
> It seems that the error message is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs

2017-04-11 Thread Jason Tokayer (JIRA)

Jason Tokayer created SPARK-20296:
-

 Summary: UnsupportedOperationChecker text on distinct aggregations 
differs from docs
 Key: SPARK-20296
 URL: https://issues.apache.org/jira/browse/SPARK-20296
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Jason Tokayer
Priority: Minor


In the unsupported operations section in the docs 
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
 it states that "Distinct operations on streaming Datasets are not supported.". 
However, in 
```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```, 
the error message is ```Distinct aggregations are not supported on streaming 
DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete 
output mode. Consider using approximate distinct aggregation```.

It seems that the error message is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964596#comment-15964596
 ] 

jin xing commented on SPARK-19659:
--

*bytesShuffleToMemory* is different from *bytesInFlight*. *bytesInFlight* is 
for only one *ShuffleBlockFetcherIterator* and get decreased when the remote 
blocks is received. *bytesShuffleToMemory* is for all 
ShuffleBlockFetcherIterators and get decreased only when reference count of 
ByteBuf is zero(though the memory maybe still inside cache and not really 
destroyed).

If *spark.reducer.maxReqsInFlight* is only for memory control, I think 
*spark.reducer.maxBytesShuffleToMemory* is an improvement. In the current PR, I 
want to simplify the logic and the memory is tracked by *bytesShuffleToMemory* 
and memory usage is not tracked by MemoryManager.

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6305) Add support for log4j 2.x to Spark

2017-04-11 Thread Dan Dutrow (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964554#comment-15964554
 ] 

Dan Dutrow edited comment on SPARK-6305 at 4/11/17 4:18 PM:


Any update on the current status of this ticket? It would be particularly 
beneficial to our program, and presumably many others, if the KafkaAppender 
(available in log4j 2.x) could be provided to spark workers for real-time log 
aggregation. 
https://logging.apache.org/log4j/2.x/manual/appenders.html#KafkaAppender This 
is not an easy thing to override in a spark application due to native log4j 
loading with the CoarseGrainedExecutorBackend before the application.


was (Author: dutrow):
Please provide an update on the current status of this ticket? It would be 
particularly beneficial to our program, and presumably many others, if the 
KafkaAppender (available in log4j 2.x) could be provided to spark workers for 
real-time log aggregation. 
https://logging.apache.org/log4j/2.x/manual/appenders.html#KafkaAppender This 
is not an easy thing to override in a spark application due to native log4j 
loading with the CoarseGrainedExecutorBackend before the application.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964587#comment-15964587
 ] 

jin xing edited comment on SPARK-19659 at 4/11/17 4:13 PM:
---

[~cloud_fan]
Thanks a lot for taking look into this and sorry for late reply.

My proposal is as below:
1. Track average size and also the outliers(which are larger than 2*avgSize) in 
MapStatus;
2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled 
to memory;
3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is 
above this configuration, blocks will shuffle to disk;

*bytesShuffleToMemory* is increased when send fetch request(note that at this 
point, remote blocks maybe not fetched into memory yet, but we add the max 
memory to be used in the fetch request) and get decreased when the ByteBuf is 
released.

*spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for 
fetching remote blocks across all the ShuffleBlockFetcherIterators(there maybe 
multiple shuffle-read happening at the same time). When memory usage(indicated 
by *bytesShuffleToMemory*) is above *spark.reducer.maxBytesShuffleToMemory*, 
shuffle remote blocks to disk instead of memory.



was (Author: jinxing6...@126.com):
[~cloud_fan]
Thanks a lot for taking look into this and sorry for late reply.

My proposal is as below:
1. Track average size and also the outliers(which are larger than 2*avgSize) in 
MapStatus;
2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled 
to memory;
3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is 
above this configuration, blocks will shuffle to disk;

*bytesShuffleToMemory* is increased when send fetch request(note that at this 
point, remote blocks maybe not fetched into memory yet, but we add the max 
memory to be used in the fetch request) and get decreased when the ByteBuf is 
released.

*spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for 
fetching remote blocks across all the *ShuffleBlockFetcherIterator*s(there 
maybe multiple shuffle-read happening at the same time). When memory 
usage(indicated by *bytesShuffleToMemory*) is above 
*spark.reducer.maxBytesShuffleToMemory*, shuffle remote blocks to disk instead 
of memory.


> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964587#comment-15964587
 ] 

jin xing edited comment on SPARK-19659 at 4/11/17 4:11 PM:
---

[~cloud_fan]
Thanks a lot for taking look into this and sorry for late reply.

My proposal is as below:
1. Track average size and also the outliers(which are larger than 2*avgSize) in 
MapStatus;
2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled 
to memory;
3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is 
above this configuration, blocks will shuffle to disk;

*bytesShuffleToMemory* is increased when send fetch request(note that at this 
point, remote blocks maybe not fetched into memory yet, but we add the max 
memory to be used in the fetch request) and get decreased when the ByteBuf is 
released.

*spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for 
fetching remote blocks across all the *ShuffleBlockFetcherIterator*s(there 
maybe multiple shuffle-read happening at the same time). When memory 
usage(indicated by *bytesShuffleToMemory*) is above 
*spark.reducer.maxBytesShuffleToMemory*, shuffle remote blocks to disk instead 
of memory.



was (Author: jinxing6...@126.com):
[~cloud_fan]
Thanks a lot for taking look into this and sorry for late reply.

My proposal is as below:
1. Track average size and also the outliers(which are larger than 2*avgSize) in 
MapStatus;
2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled 
to memory;
3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is 
above this configuration, blocks will shuffle to disk;

*bytesShuffleToMemory* is increased when send fetch request(note that at this 
point, remote blocks maybe not fetched into memory yet, but we add the max 
memory to be used in the fetch request) and get decreased when the ByteBuf is 
released.

*spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for 
fetching remote blocks across all the ShuffleBlockFetcherIterators(there maybe 
multiple shuffle-read happening at the same time). When memory usage(indicated 
by bytesShuffleToMemory) is above spark.reducer.maxBytesShuffleToMemory, 
shuffle remote blocks to disk instead of memory.


> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964587#comment-15964587
 ] 

jin xing edited comment on SPARK-19659 at 4/11/17 4:11 PM:
---

[~cloud_fan]
Thanks a lot for taking look into this and sorry for late reply.

My proposal is as below:
1. Track average size and also the outliers(which are larger than 2*avgSize) in 
MapStatus;
2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled 
to memory;
3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is 
above this configuration, blocks will shuffle to disk;

*bytesShuffleToMemory* is increased when send fetch request(note that at this 
point, remote blocks maybe not fetched into memory yet, but we add the max 
memory to be used in the fetch request) and get decreased when the ByteBuf is 
released.

*spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for 
fetching remote blocks across all the ShuffleBlockFetcherIterators(there maybe 
multiple shuffle-read happening at the same time). When memory usage(indicated 
by bytesShuffleToMemory) is above spark.reducer.maxBytesShuffleToMemory, 
shuffle remote blocks to disk instead of memory.



was (Author: jinxing6...@126.com):
[~cloud_fan]
Thanks a lot for taking look into this and sorry for late reply.

My proposal is as below:
1. Track average size and also the outliers(which are larger than 2*avgSize) in 
MapStatus;
2. Add bytesShuffleToMemory, which tracks the size of remote blocks shuffled to 
memory;
3. Add spark.reducer.maxBytesShuffleToMemory, when bytesShuffleToMemory is 
above this configuration, blocks will shuffle to disk;

bytesShuffleToMemory is increased when send fetch request(note that at this 
point, remote blocks maybe not fetched into memory yet, but we add the max 
memory to be used in the fetch request) and get decreased when the ByteBuf is 
released.

spark.reducer.maxBytesShuffleToMemory is the max memory to be used for fetching 
remote blocks across all the ShuffleBlockFetcherIterators(there maybe multiple 
shuffle-read happening at the same time). When memory usage(indicated by 
bytesShuffleToMemory) is above spark.reducer.maxBytesShuffleToMemory, shuffle 
remote blocks to disk instead of memory.


> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964587#comment-15964587
 ] 

jin xing commented on SPARK-19659:
--

[~cloud_fan]
Thanks a lot for taking look into this and sorry for late reply.

My proposal is as below:
1. Track average size and also the outliers(which are larger than 2*avgSize) in 
MapStatus;
2. Add bytesShuffleToMemory, which tracks the size of remote blocks shuffled to 
memory;
3. Add spark.reducer.maxBytesShuffleToMemory, when bytesShuffleToMemory is 
above this configuration, blocks will shuffle to disk;

bytesShuffleToMemory is increased when send fetch request(note that at this 
point, remote blocks maybe not fetched into memory yet, but we add the max 
memory to be used in the fetch request) and get decreased when the ByteBuf is 
released.

spark.reducer.maxBytesShuffleToMemory is the max memory to be used for fetching 
remote blocks across all the ShuffleBlockFetcherIterators(there maybe multiple 
shuffle-read happening at the same time). When memory usage(indicated by 
bytesShuffleToMemory) is above spark.reducer.maxBytesShuffleToMemory, shuffle 
remote blocks to disk instead of memory.


> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2017-04-11 Thread Dan Dutrow (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964554#comment-15964554
 ] 

Dan Dutrow commented on SPARK-6305:
---

Please provide an update on the current status of this ticket? It would be 
particularly beneficial to our program, and presumably many others, if the 
KafkaAppender (available in log4j 2.x) could be provided to spark workers for 
real-time log aggregation. 
https://logging.apache.org/log4j/2.x/manual/appenders.html#KafkaAppender This 
is not an easy thing to override in a spark application due to native log4j 
loading with the CoarseGrainedExecutorBackend before the application.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964548#comment-15964548
 ] 

Wenchen Fan commented on SPARK-19659:
-

[~jinxing6...@126.com] can you say more about fetching shuffle blocks into 
disk? How should users tune the {{spark.reducer.maxBytesShuffleToMemory}} 
config? Shall we have something similar to {{maxReqsInFlight}} to control how 
many request we can have to fetch blocks to disk at the same time?

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964508#comment-15964508
 ] 

jin xing commented on SPARK-19659:
--

[~irashid]
Tracking memory used by Netty by swapping in our own PooledByteBufAllocator is 
really a good idea. Memory usage will be increased when allocate byte buffer in 
PoolByteBufAllocator and get decreased when ByteBuf's reference count is zero.
Checking source code of Netty, I found that there is  cache inside 
PoolByteBufAllocator. When memory is released, it will be returned to cache or 
chunklist, not destroyed necessarily. In my understanding, we can get the 
memory usage by tracking PooledByteBufAllocator, but the value is not the real 
footprint.(i.e. when memory is released, it maybe in cache other than 
destroyed.)

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-11 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964476#comment-15964476
 ] 

Imran Rashid commented on SPARK-19659:
--

currently the smallest unit is an entire block.  That certainly could be 
changed, but that is another larger change to the way spark uses netty.

bq. If the unit is block, I think it's really hard to avoid OOM entirely, as if 
the estimated block size is wrong, fetching this block may cause OOM and we can 
do nothing about it.

I don't totally agree with this.  Yes, the estimated block size could be wrong. 
 The main problem now, though, is that the error is totally unbounded.  I 
really don't think it should be hard to change spark to provide reasonable 
bounds on that error.  Even if that first step doesn't entirely eliminate the 
chance of OOM, it would certainly help.  And that can continue to be improved 
upon.

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16784) Configurable log4j settings

2017-04-11 Thread Josh Bacon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964471#comment-15964471
 ] 

Josh Bacon commented on SPARK-16784:


[~tscholak]  I have not created a follow up issue for this. For your situation 
I'd suggest trying --driver-java-options='..' instead of 
--conf='spark.driver.extraJavaOptions=...'  because the latter is applied after 
driver jvm actually starts (too late for log4j). My use-case I abandoned 
attempting to configured log4j for executors, but was able to work with 
driver/application logs in both cluster and client mode (standalone) via baking 
my log4j.properties files into my apps Uber jar resources. I think the need of 
this issue is a new feature for distributing files/log4j.properties in the 
cluster before the actual Spark Driver starts which I'd imagine is not a 
pressing enough to warrant actual development at the moment.

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable

2017-04-11 Thread Sergei Lebedev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964463#comment-15964463
 ] 

Sergei Lebedev commented on SPARK-20284:


Yes, Scala is a bad citizen in the JVM land and comes w/o any support for 
try-with-resources. IIUC scala-arm would manage just fine without Closeable 
because it uses structural types. However, I think there is no reason not to 
implement Closeable/AutoCloseable even if Spark/Scala code does not need this.

> Make SerializationStream and DeserializationStream extend Closeable
> ---
>
> Key: SPARK-20284
> URL: https://issues.apache.org/jira/browse/SPARK-20284
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Sergei Lebedev
>Priority: Trivial
>
> Both {{SerializationStream}} and {{DeserializationStream}} implement 
> {{close}} but do not extend {{Closeable}}. As a result, these streams cannot 
> be used in try-with-resources.
> Was this intentional or rather nobody ever needed that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException

2017-04-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964438#comment-15964438
 ] 

Björn-Elmar Macek edited comment on SPARK-11788 at 4/11/17 2:39 PM:


I have an issue which maybe related and occurs in 2.1.0. I created a table 
which contains a Timestamp column. An describe results in:

user_id int null
userCreatedAt   timestamp   null
country_iso string  null

When i query as follows ...

select  date_format(userCreatedAt, ""), userCreatedAt,  
date_format(userCreatedAt, "") = "2017" from result where 
date_format(userCreatedAt, "") = "2017"  order by userCreatedAt

... the third column contains only true values as it should be due to the 
corresponding expression being in the where clause.

When i execute the same query on the same table in which i replaced 
userCreatedAt by a  java.sql.Timestamp created from the userCreatedAt column, 
the result contains lots of false values. This is correct, since the year of 
those timestamp is not 2017. But i would expect them to not be included in the 
result. Also, date_format(userCreatedAt, "") returns the correct year. 

Can anybody reproduce this issue?


was (Author: macek):
I have an issue which maybe related and occurs in 2.1.0. I created a table 
which contains a Timestamp column. An describe results in:

user_id int null
userCreatedAt   timestamp   null
country_iso string  null

When i query as follows ...

select  date_format(userCreatedAt, ""), userCreatedAt,  
date_format(userCreatedAt, "") = "2017" from result where 
date_format(userCreatedAt, "") = "2017"  order by userCreatedAt

... the third column contains only true values as it should be due to the 
corresponding expression being in the where clause.

When i execute the same query on the same table in which i replaced 
userCreatedAt by a  java.sql.Timestamp created from the userCreatedAt column, 
the result contains lots of false values. This is correct, since the year of 
those timestamp is not 2017. But i would expect them to not be included in the 
result. Also, date_format(userCreatedAt, "") returns the correct year. 

Can anybody reproduce this issue and will it be fixed?

> Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC 
> dataframes causes SQLServerException
> 
>
> Key: SPARK-11788
> URL: https://issues.apache.org/jira/browse/SPARK-11788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Martin Tapp
>Assignee: Huaxin Gao
> Fix For: 1.5.3, 1.6.0
>
>
> I have a MSSQL table that has a timestamp column and am reading it using 
> DataFrameReader.jdbc. Adding a where clause which compares a timestamp range 
> causes a SQLServerException.
> The problem is in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264
>  (compileValue) which should surround timestamps/dates with quotes (only does 
> it for strings).
> Sample pseudo-code:
> val beg = new java.sql.Timestamp(...)
> val end = new java.sql.Timestamp(...)
> val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" 
> < end)
> Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
> Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 
> 00:00:00.0'"
> Fallback is to filter client-side which is extremely inefficient as the whole 
> table needs to be downloaded to each Spark executor.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException

2017-04-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964438#comment-15964438
 ] 

Björn-Elmar Macek edited comment on SPARK-11788 at 4/11/17 2:34 PM:


I have an issue which maybe related and occurs in 2.1.0. I created a table 
which contains a Timestamp column. An describe results in:

user_id int null
userCreatedAt   timestamp   null
country_iso string  null

When i query as follows ...

select  date_format(userCreatedAt, ""), userCreatedAt,  
date_format(userCreatedAt, "") = "2017" from result where 
date_format(userCreatedAt, "") = "2017"  order by userCreatedAt

... the third column contains only true values as it should be due to the 
corresponding expression being in the where clause.

When i execute the same query on the same table in which i replaced 
userCreatedAt by a  java.sql.Timestamp created from the userCreatedAt column, 
the result contains lots of false values. This is correct, since the year of 
those timestamp is not 2017. But i would expect them to not be included in the 
result. Also, date_format(userCreatedAt, "") returns the correct year. 

Can anybody reproduce this issue and will it be fixed?


was (Author: macek):
I have an issue which maybe related and occurs in 2.1.0. I created a table 
which contains a Timestamp column. An describe results in:

user_id int null
userCreatedAt   timestamp   null
country_iso string  null

When i query as follows ...

select  date_format(userCreatedAt, ""), userCreatedAt,  
date_format(userCreatedAt, "") = "2017" from result where 
date_format(userCreatedAt, "") = "2017"  order by userCreatedAt

... the third column contains only true values as it should be due to the 
expression being in the where clause.

When i execute the same query on the same table in which i replaced 
userCreatedAt by a  java.sql.Timestamp created from the userCreatedAt column, 
the result contains lots of false values. This is correct, since the year of 
those timestamp is not 2017. But i would expect them to not be included in the 
result. Also, date_format(userCreatedAt, "") returns the correct year. 

Can anybody reproduce this issue and will it be fixed?

> Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC 
> dataframes causes SQLServerException
> 
>
> Key: SPARK-11788
> URL: https://issues.apache.org/jira/browse/SPARK-11788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Martin Tapp
>Assignee: Huaxin Gao
> Fix For: 1.5.3, 1.6.0
>
>
> I have a MSSQL table that has a timestamp column and am reading it using 
> DataFrameReader.jdbc. Adding a where clause which compares a timestamp range 
> causes a SQLServerException.
> The problem is in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264
>  (compileValue) which should surround timestamps/dates with quotes (only does 
> it for strings).
> Sample pseudo-code:
> val beg = new java.sql.Timestamp(...)
> val end = new java.sql.Timestamp(...)
> val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" 
> < end)
> Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
> Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 
> 00:00:00.0'"
> Fallback is to filter client-side which is extremely inefficient as the whole 
> table needs to be downloaded to each Spark executor.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException

2017-04-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964438#comment-15964438
 ] 

Björn-Elmar Macek edited comment on SPARK-11788 at 4/11/17 2:28 PM:


I have an issue which maybe related and occurs in 2.1.0. I created a table 
which contains a Timestamp column. An describe results in:

user_id int null
userCreatedAt   timestamp   null
country_iso string  null

When i query as follows ...

select  date_format(userCreatedAt, ""), userCreatedAt,  
date_format(userCreatedAt, "") = "2017" from result where 
date_format(userCreatedAt, "") = "2017"  order by userCreatedAt

... the third column contains only true values as it should be due to the 
expression being in the where clause.

When i execute the same query on the same table in which i replaced 
userCreatedAt by a  java.sql.Timestamp created from the userCreatedAt column, 
the result contains lots of false values. This is correct, since the year of 
those timestamp is not 2017. But i would expect them to not be included in the 
result. Also, date_format(userCreatedAt, "") returns the correct year. 

Can anybody reproduce this issue and will it be fixed?


was (Author: macek):
I have an issue which maybe related and occurs in 2.1.0. I created a table 
which contains a Timestamp column. An describe results in:

user_id int null
userCreatedAt   timestamp   null
country_iso string  null

When i query as follows ...

select  date_format(userCreatedAt, ""), userCreatedAt,  
date_format(userCreatedAt, "") = "2017" from result where 
date_format(userCreatedAt, "") = "2017"  order by userCreatedAt

... the third column contains only true values as it should be due to the 
expression being in the where clause.

When i execute the same query on the same table in which i replaced 
userCreatedAt by a  java.sql.Timestamp created from the userCreatedAt column, 
the result contains lots of false values. Also, date_format(userCreatedAt, 
"") returns the correct year. 

Can anybody reproduce this issue and will it be fixed?

> Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC 
> dataframes causes SQLServerException
> 
>
> Key: SPARK-11788
> URL: https://issues.apache.org/jira/browse/SPARK-11788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Martin Tapp
>Assignee: Huaxin Gao
> Fix For: 1.5.3, 1.6.0
>
>
> I have a MSSQL table that has a timestamp column and am reading it using 
> DataFrameReader.jdbc. Adding a where clause which compares a timestamp range 
> causes a SQLServerException.
> The problem is in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264
>  (compileValue) which should surround timestamps/dates with quotes (only does 
> it for strings).
> Sample pseudo-code:
> val beg = new java.sql.Timestamp(...)
> val end = new java.sql.Timestamp(...)
> val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" 
> < end)
> Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
> Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 
> 00:00:00.0'"
> Fallback is to filter client-side which is extremely inefficient as the whole 
> table needs to be downloaded to each Spark executor.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException

2017-04-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964438#comment-15964438
 ] 

Björn-Elmar Macek commented on SPARK-11788:
---

I have an issue which maybe related and occurs in 2.1.0. I created a table 
which contains a Timestamp column. An describe results in:

user_id int null
userCreatedAt   timestamp   null
country_iso string  null

When i query as follows ...

select  date_format(userCreatedAt, ""), userCreatedAt,  
date_format(userCreatedAt, "") = "2017" from result where 
date_format(userCreatedAt, "") = "2017"  order by userCreatedAt

... the third column contains only true values as it should be due to the 
expression being in the where clause.

When i execute the same query on the same table in which i replaced 
userCreatedAt by a  java.sql.Timestamp created from the userCreatedAt column, 
the result contains lots of false values. Also, date_format(userCreatedAt, 
"") returns the correct year. 

Can anybody reproduce this issue and will it be fixed?

> Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC 
> dataframes causes SQLServerException
> 
>
> Key: SPARK-11788
> URL: https://issues.apache.org/jira/browse/SPARK-11788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Martin Tapp
>Assignee: Huaxin Gao
> Fix For: 1.5.3, 1.6.0
>
>
> I have a MSSQL table that has a timestamp column and am reading it using 
> DataFrameReader.jdbc. Adding a where clause which compares a timestamp range 
> causes a SQLServerException.
> The problem is in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264
>  (compileValue) which should surround timestamps/dates with quotes (only does 
> it for strings).
> Sample pseudo-code:
> val beg = new java.sql.Timestamp(...)
> val end = new java.sql.Timestamp(...)
> val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" 
> < end)
> Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
> Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 
> 00:00:00.0'"
> Fallback is to filter client-side which is extremely inefficient as the whole 
> table needs to be downloaded to each Spark executor.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20244) Incorrect input size in UI with pyspark

2017-04-11 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964422#comment-15964422
 ] 

Saisai Shao commented on SPARK-20244:
-

This actually is not a UI problem, it is FileSystem thread local statistics 
problem, because PythonRDD will create another thread to read data, so the 
readBytes getting from another thread will be error. But there's no problem if 
using spark-shell, since everything is processed in one thread.

This is a general problem if the child RDD's computation creates another thread 
to handle parent's RDD (HadoopRDD)'s iterator. I tried several different ways 
to handle this problem, but still have some small issues. The multi-thread 
processing inside the RDD make the fix quite complex. 

> Incorrect input size in UI with pyspark
> ---
>
> Key: SPARK-20244
> URL: https://issues.apache.org/jira/browse/SPARK-20244
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Artur Sukhenko
>Priority: Minor
> Attachments: pyspark_incorrect_inputsize.png, 
> sparkshell_correct_inputsize.png
>
>
> In Spark UI (Details for Stage) Input Size is  64.0 KB when running in 
> PySparkShell. 
> Also it is incorrect in Tasks table:
> 64.0 KB / 132120575 in pyspark
> 252.0 MB / 132120575 in spark-shell
> I will attach screenshots.
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> $ yes > /tmp/yes.txt
> $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> $ ./bin/pyspark
> {code}
> Python 2.7.5 (default, Nov  6 2016, 00:28:07) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.5 (default, Nov  6 2016 00:28:07)
> SparkSession available as 'spark'.{code}
> >>> a = sc.textFile("/tmp/yes.txt")
> >>> a.count()
> Open Spark UI and check Stage 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue

2017-04-11 Thread Ruhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruhui Wang updated SPARK-20295:
---
Description: 
when spark.sql.exchange.reuse is opened, then run a query with self join(such 
as tpcds-q95), the physical plan will become below randomly:

WholeStageCodegen
:  +- Project [id#0L]
: +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
::- Project [id#0L]
::  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
:: :- Range 0, 1, 4, 1024, [id#0L]
:: +- INPUT
:+- INPUT
:- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
:  +- WholeStageCodegen
: :  +- Range 0, 1, 4, 1024, [id#1L]
+- ReusedExchange [id#2L], BroadcastExchange 
HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))

If spark.sql.adaptive.enabled = true,  the code stack is : 
ShuffleExchange#doExecute --> postShuffleRDD function --> 
doEstimationIfNecessary . In this function, 
assert(exchanges.length == numExchanges) will be error, as left side has only 
one element, but right is equal to 2.

If this is a bug of spark.sql.adaptive.enabled and exchange resue?

  was:

when spark.sql.exchange.reuse is opened, then run a query with self join(such 
as tpcds-q95), the physical plan will become below randomly:

WholeStageCodegen
:  +- Project [id#0L]
: +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
::- Project [id#0L]
::  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
:: :- Range 0, 1, 4, 1024, [id#0L]
:: +- INPUT
:+- INPUT
:- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
:  +- WholeStageCodegen
: :  +- Range 0, 1, 4, 1024, [id#1L]
+- ReusedExchange [id#2L], BroadcastExchange 
HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))

If spark.sql.adaptive.enabled = true,  the code stack is : 
ShuffleExchange#doExecute --> postShuffleRDD function --> 
doEstimationIfNecessary . In this function, 
assert(exchanges.length == numExchanges) will be error, as left side has only 
one element, but right is equal to 2.

If this is a bug of spark.sql.adaptive.enabled and exchange resue


> when  spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
> --
>
> Key: SPARK-20295
> URL: https://issues.apache.org/jira/browse/SPARK-20295
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 2.1.0
>Reporter: Ruhui Wang
>
> when spark.sql.exchange.reuse is opened, then run a query with self join(such 
> as tpcds-q95), the physical plan will become below randomly:
> WholeStageCodegen
> :  +- Project [id#0L]
> : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
> ::- Project [id#0L]
> ::  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
> :: :- Range 0, 1, 4, 1024, [id#0L]
> :: +- INPUT
> :+- INPUT
> :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
> :  +- WholeStageCodegen
> : :  +- Range 0, 1, 4, 1024, [id#1L]
> +- ReusedExchange [id#2L], BroadcastExchange 
> HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
> If spark.sql.adaptive.enabled = true,  the code stack is : 
> ShuffleExchange#doExecute --> postShuffleRDD function --> 
> doEstimationIfNecessary . In this function, 
> assert(exchanges.length == numExchanges) will be error, as left side has only 
> one element, but right is equal to 2.
> If this is a bug of spark.sql.adaptive.enabled and exchange resue?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue

2017-04-11 Thread Ruhui Wang (JIRA)

Ruhui Wang created SPARK-20295:
--

 Summary: when  spark.sql.adaptive.enabled is enabled, have 
conflict with Exchange Resue
 Key: SPARK-20295
 URL: https://issues.apache.org/jira/browse/SPARK-20295
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, SQL
Affects Versions: 2.1.0
Reporter: Ruhui Wang



when spark.sql.exchange.reuse is opened, then run a query with self join(such 
as tpcds-q95), the physical plan will become below randomly:

WholeStageCodegen
:  +- Project [id#0L]
: +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
::- Project [id#0L]
::  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
:: :- Range 0, 1, 4, 1024, [id#0L]
:: +- INPUT
:+- INPUT
:- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
:  +- WholeStageCodegen
: :  +- Range 0, 1, 4, 1024, [id#1L]
+- ReusedExchange [id#2L], BroadcastExchange 
HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))

If spark.sql.adaptive.enabled = true,  the code stack is : 
ShuffleExchange#doExecute --> postShuffleRDD function --> 
doEstimationIfNecessary . In this function, 
assert(exchanges.length == numExchanges) will be error, as left side has only 
one element, but right is equal to 2.

If this is a bug of spark.sql.adaptive.enabled and exchange resue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

João Pedro Jericó updated SPARK-20294:
--
Description: 
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.

This is probably a problem with the other Spark APIs however I don't have the 
knowledge to look at the source code for other languages.

  was:
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.


> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

João Pedro Jericó updated SPARK-20294:
--
Description: 
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.

  was:
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method will return an empty RDD most of the time. However, this is 
not the desired result because we are able to sample at least 1% of the RDD.


> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread JIRA

João Pedro Jericó created SPARK-20294:
-

 Summary: _inferSchema for RDDs fails if sample returns empty RDD
 Key: SPARK-20294
 URL: https://issues.apache.org/jira/browse/SPARK-20294
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: João Pedro Jericó
Priority: Minor


Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method will return an empty RDD most of the time. However, this is 
not the desired result because we are able to sample at least 1% of the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference

2017-04-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20175:
---

Assignee: Liang-Chi Hsieh

> Exists should not be evaluated in Join operator and can be converted to 
> ScalarSubquery if no correlated reference
> -
>
> Key: SPARK-20175
> URL: https://issues.apache.org/jira/browse/SPARK-20175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> Similar to ListQuery, Exists should not be evaluated in Join operator too. 
> Otherwise, a query like following will fail:
> sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR 
> l.a = r.c)")
> For the Exists subquery without correlated reference, this patch converts it 
> to scalar subquery with a count Aggregate operator.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 135 matches

Mail list logo