date:20160720

[jira] [Assigned] (SPARK-16633) lag/lead does not return the default value when the offset row does not exist

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-16633:


Assignee: Yin Huai

> lag/lead does not return the default value when the offset row does not exist
> -
>
> Key: SPARK-16633
> URL: https://issues.apache.org/jira/browse/SPARK-16633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Attachments: window_function_bug.html
>
>
> Please see the attached notebook. Seems lag/lead somehow fail to recognize 
> that a offset row does not exist and generate wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.

2016-07-20 Thread Yin Huai (JIRA)

Yin Huai created SPARK-16642:


 Summary: ResolveWindowFrame should not be triggered on 
UnresolvedFunctions.
 Key: SPARK-16642
 URL: https://issues.apache.org/jira/browse/SPARK-16642
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


The case at 
https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792
 is shown below
{code}
case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, UnspecifiedFrame)) 
=>
  val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
acceptWindowFrame = true)
  we.copy(windowSpec = s.copy(frameSpecification = frame))
{code}
This case will be triggered even when the function is an unresolved. So, when 
the functions like lead are used, we may see errors like {{Window Frame RANGE 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame ROWS 
BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the frame 
specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-16642:


Assignee: Yin Huai

> ResolveWindowFrame should not be triggered on UnresolvedFunctions.
> --
>
> Key: SPARK-16642
> URL: https://issues.apache.org/jira/browse/SPARK-16642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The case at 
> https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792
>  is shown below
> {code}
> case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
> UnspecifiedFrame)) =>
>   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
> acceptWindowFrame = true)
>   we.copy(windowSpec = s.copy(frameSpecification = frame))
> {code}
> This case will be triggered even when the function is an unresolved. So, when 
> the functions like lead are used, we may see errors like {{Window Frame RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the 
> frame specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385462#comment-15385462
 ] 

Apache Spark commented on SPARK-16216:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14279

> CSV data source does not write date and timestamp correctly
> ---
>
> Key: SPARK-16216
> URL: https://issues.apache.org/jira/browse/SPARK-16216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, CSV data source write {{DateType}} and {{TimestampType}} as below:
> {code}
> ++
> |date|
> ++
> |14406372|
> |14144598|
> |14540400|
> ++
> {code}
> It would be nicer if it write dates and timestamps as a formatted string just 
> like JSON data sources.
> Also, CSV data source currently supports {{dateFormat}} option to read dates 
> and timestamps in a custom format. It might be better if this option can be 
> applied in writing as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16643) When doing Shuffle, report "java.io.FileNotFoundException"

2016-07-20 Thread Deng Changchun (JIRA)

Deng Changchun created SPARK-16643:
--

 Summary: When doing Shuffle, report "java.io.FileNotFoundException"
 Key: SPARK-16643
 URL: https://issues.apache.org/jira/browse/SPARK-16643
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
 Environment: LSB Version:  
:base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description:CentOS release 6.6 (Final)
Release:6.6
Codename:   Final

java version "1.7.0_10"
Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

Reporter: Deng Changchun


In our spark cluster of standalone mode, we execute some SQLs on SparkSQL, such 
 some aggregate sqls as "select count(rowKey) from HVRC_B_LOG where 1=1 and 
RESULTTIME >= 146332800 and RESULTTIME <= 1463414399000"

at the begining all is good, however after about 15 days, when execute the 
aggreate sqls, it will report error, the log looks like:
【Notice:
it is very strange that it won't report error every time when executing 
aggreate sql, let's say random, after executing some aggregate sqls, it will 
log error by chance.】

2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = 624
2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
executor.Executor: Exception in task 0.3 in stage 580.0 (TID 624)
java.io.FileNotFoundException: 
/tmp/spark-cb199fce-bb80-4e6f-853f-4d7984bf5f34/executor-fb7c2149-c6c4-4697-ba2f-3b53dcd7f34a/blockmgr-0a9003ad-23b3-4ff5-b76f-6fbc5d71e730/3e/temp_shuffle_ef68b340-85e4-483c-90e8-5e8c8d8ee4ee
 (没有那个文件或目录)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:212)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385484#comment-15385484
 ] 

Apache Spark commented on SPARK-16515:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14280

> [SPARK][SQL] transformation script got failure for python script
> 
>
> Key: SPARK-16515
> URL: https://issues.apache.org/jira/browse/SPARK-16515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Adrian Wang
>Priority: Critical
> Fix For: 2.0.0
>
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
> SELECT
>   wcs_user_sk,
>   wcs_item_sk,
>   (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
> FROM web_clickstreams
> WHERE wcs_item_sk IS NOT NULL
> AND   wcs_user_sk IS NOT NULL
> DISTRIBUTE BY wcs_user_sk
> SORT BY
>   wcs_user_sk,
>   tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
> wcs_user_sk,
> tstamp_inSec,
> wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
> wcs_item_sk BIGINT,
> sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>   at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>   ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in 
> user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-16644:
---

 Summary: constraints propagation may fail the query
 Key: SPARK-16644
 URL: https://issues.apache.org/jira/browse/SPARK-16644
 Project: Spark
  Issue Type: Bug
Reporter: Wenchen Fan


{code}
create table(a int, b int);
select
  a,
  max(b) as c1,
  b as c2
from tbl
where a = b
group by a, b
having c1 = 1
{code}

this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385510#comment-15385510
 ] 

Lianhui Wang commented on SPARK-2666:
-

[~tgraves] Sorry for late reply. In https://github.com/apache/spark/pull/1572, 
it will kill all running tasks before we resubmit for FetchFailed. But 
[~kayousterhout] said that  it keep the remaining tasks because the running 
tasks may hit Fetch failures from different map outputs than the original fetch 
failure. 
I think the best way is like Mapreduce we just resubmit the map stage of failed 
stage. if the reduce stage has a FetchFailed, it just report FetchFailed to 
DAGScheduler and fetch other results. Then the reduce stage getOutputStatus of 
FetchFailed every hearbeat like https://github.com/apache/spark/pull/3430.
[~tgraves] How about your ideas about this? Thanks. 

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385538#comment-15385538
 ] 

Apache Spark commented on SPARK-16644:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14281

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16644:


Assignee: Apache Spark

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16644:


Assignee: (was: Apache Spark)

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16643) When doing Shuffle, report "java.io.FileNotFoundException"

2016-07-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385548#comment-15385548
 ] 

Sean Owen commented on SPARK-16643:
---

This might have been resolved since 1.5.0; can you try with a more recent 
version?
There are similar old issues like 
https://issues.apache.org/jira/browse/SPARK-12240

> When doing Shuffle, report "java.io.FileNotFoundException"
> --
>
> Key: SPARK-16643
> URL: https://issues.apache.org/jira/browse/SPARK-16643
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: LSB Version: 
> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
> Distributor ID:   CentOS
> Description:  CentOS release 6.6 (Final)
> Release:  6.6
> Codename: Final
> java version "1.7.0_10"
> Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
> Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
>Reporter: Deng Changchun
>
> In our spark cluster of standalone mode, we execute some SQLs on SparkSQL, 
> such  some aggregate sqls as "select count(rowKey) from HVRC_B_LOG where 1=1 
> and RESULTTIME >= 146332800 and RESULTTIME <= 1463414399000"
> at the begining all is good, however after about 15 days, when execute the 
> aggreate sqls, it will report error, the log looks like:
> 【Notice:
> it is very strange that it won't report error every time when executing 
> aggreate sql, let's say random, after executing some aggregate sqls, it will 
> log error by chance.】
> 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
> executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = 
> 624
> 2016-07-20 13:48:50,250 ERROR [Executor task launch worker-75] 
> executor.Executor: Exception in task 0.3 in stage 580.0 (TID 624)
> java.io.FileNotFoundException: 
> /tmp/spark-cb199fce-bb80-4e6f-853f-4d7984bf5f34/executor-fb7c2149-c6c4-4697-ba2f-3b53dcd7f34a/blockmgr-0a9003ad-23b3-4ff5-b76f-6fbc5d71e730/3e/temp_shuffle_ef68b340-85e4-483c-90e8-5e8c8d8ee4ee
>  (没有那个文件或目录)
>   at java.io.FileOutputStream.open(Native Method)
>   at java.io.FileOutputStream.(FileOutputStream.java:212)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16628) OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16628:


Assignee: Apache Spark

> OrcConversions should not convert an ORC table represented by 
> MetastoreRelation to HadoopFsRelation if metastore schema does not match 
> schema stored in ORC files
> -
>
> Key: SPARK-16628
> URL: https://issues.apache.org/jira/browse/SPARK-16628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> When {{spark.sql.hive.convertMetastoreOrc}} is enabled, we will convert a ORC 
> table represented by a MetastoreRelation to HadoopFsRelation that uses 
> Spark's OrcFileFormat internally. This conversion aims to make table scanning 
> have a better performance since at runtime, the code path to scan 
> HadoopFsRelation's performance is better. However, OrcFileFormat's 
> implementation is based on the assumption that ORC files store their schema 
> with correct column names. However, before Hive 2.0, an ORC table created by 
> Hive does not store column name correctly in the ORC files (HIVE-4243). So, 
> for this kind of ORC datasets, we cannot really convert the code path. 
> Right now, if ORC tables are created by Hive 1.x or 0.x, enabling 
> {{spark.sql.hive.convertMetastoreOrc}} will introduce a runtime exception for 
> non-partitioned ORC tables and drop the metastore schema for partitioned ORC 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16628) OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385551#comment-15385551
 ] 

Apache Spark commented on SPARK-16628:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/14282

> OrcConversions should not convert an ORC table represented by 
> MetastoreRelation to HadoopFsRelation if metastore schema does not match 
> schema stored in ORC files
> -
>
> Key: SPARK-16628
> URL: https://issues.apache.org/jira/browse/SPARK-16628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> When {{spark.sql.hive.convertMetastoreOrc}} is enabled, we will convert a ORC 
> table represented by a MetastoreRelation to HadoopFsRelation that uses 
> Spark's OrcFileFormat internally. This conversion aims to make table scanning 
> have a better performance since at runtime, the code path to scan 
> HadoopFsRelation's performance is better. However, OrcFileFormat's 
> implementation is based on the assumption that ORC files store their schema 
> with correct column names. However, before Hive 2.0, an ORC table created by 
> Hive does not store column name correctly in the ORC files (HIVE-4243). So, 
> for this kind of ORC datasets, we cannot really convert the code path. 
> Right now, if ORC tables are created by Hive 1.x or 0.x, enabling 
> {{spark.sql.hive.convertMetastoreOrc}} will introduce a runtime exception for 
> non-partitioned ORC tables and drop the metastore schema for partitioned ORC 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16628) OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16628:


Assignee: (was: Apache Spark)

> OrcConversions should not convert an ORC table represented by 
> MetastoreRelation to HadoopFsRelation if metastore schema does not match 
> schema stored in ORC files
> -
>
> Key: SPARK-16628
> URL: https://issues.apache.org/jira/browse/SPARK-16628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> When {{spark.sql.hive.convertMetastoreOrc}} is enabled, we will convert a ORC 
> table represented by a MetastoreRelation to HadoopFsRelation that uses 
> Spark's OrcFileFormat internally. This conversion aims to make table scanning 
> have a better performance since at runtime, the code path to scan 
> HadoopFsRelation's performance is better. However, OrcFileFormat's 
> implementation is based on the assumption that ORC files store their schema 
> with correct column names. However, before Hive 2.0, an ORC table created by 
> Hive does not store column name correctly in the ORC files (HIVE-4243). So, 
> for this kind of ORC datasets, we cannot really convert the code path. 
> Right now, if ORC tables are created by Hive 1.x or 0.x, enabling 
> {{spark.sql.hive.convertMetastoreOrc}} will introduce a runtime exception for 
> non-partitioned ORC tables and drop the metastore schema for partitioned ORC 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16613) RDD.pipe returns values for empty partitions

2016-07-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385552#comment-15385552
 ] 

Sean Owen commented on SPARK-16613:
---

Yeah it's a tough call. The current behavior is at least consistent: entirely 
partition-oriented, one process per partition exactly, always. I agree it's not 
quite what I'd expect, but maybe the first thing we can do now is at least 
update the docs without changing the behavior.

> RDD.pipe returns values for empty partitions
> 
>
> Key: SPARK-16613
> URL: https://issues.apache.org/jira/browse/SPARK-16613
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Krasnyansky
>
> Suppose we have such Spark code
> {code}
> object PipeExample {
>   def main(args: Array[String]) {
> val fstRdd = sc.parallelize(List("hi", "hello", "how", "are", "you"))
> val pipeRdd = 
> fstRdd.pipe("/Users/finkel/spark-pipe-example/src/main/resources/len.sh")
> pipeRdd.collect.foreach(println)
>   }
> }
> {code}
> It uses a bash script to convert a string to its length.
> {code}
> #!/bin/sh
> read input
> len=${#input}
> echo $len
> {code}
> So far so good, but when I run the code, it prints incorrect output. For 
> example:
> {code}
> 0
> 2
> 0
> 5
> 3
> 0
> 3
> 3
> {code}
> I expect to see
> {code}
> 2
> 5
> 3
> 3
> 3
> {code}
> which is correct output for the app. I think it's a bug. It's expected to see 
> only positive integers and avoid zeros.
> Environment:
> 1. Spark version is 1.6.2
> 2. Scala version is 2.11.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16645) rename CatalogStorageFormat.serdeProperties to properties

2016-07-20 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-16645:
---

 Summary: rename CatalogStorageFormat.serdeProperties to properties
 Key: SPARK-16645
 URL: https://issues.apache.org/jira/browse/SPARK-16645
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16645) rename CatalogStorageFormat.serdeProperties to properties

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16645:


Assignee: Wenchen Fan  (was: Apache Spark)

> rename CatalogStorageFormat.serdeProperties to properties
> -
>
> Key: SPARK-16645
> URL: https://issues.apache.org/jira/browse/SPARK-16645
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16645) rename CatalogStorageFormat.serdeProperties to properties

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385562#comment-15385562
 ] 

Apache Spark commented on SPARK-16645:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14283

> rename CatalogStorageFormat.serdeProperties to properties
> -
>
> Key: SPARK-16645
> URL: https://issues.apache.org/jira/browse/SPARK-16645
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16645) rename CatalogStorageFormat.serdeProperties to properties

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16645:


Assignee: Apache Spark  (was: Wenchen Fan)

> rename CatalogStorageFormat.serdeProperties to properties
> -
>
> Key: SPARK-16645
> URL: https://issues.apache.org/jira/browse/SPARK-16645
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread liancheng (JIRA)

liancheng created SPARK-16646:
-

 Summary: LEAST doesn't accept numeric arguments with different 
data types
 Key: SPARK-16646
 URL: https://issues.apache.org/jira/browse/SPARK-16646
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: liancheng


{code:sql}
SELECT LEAST(1, 1.5);
{code}

{noformat}
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
all have the same type, got LEAST (ArrayBuffer(IntegerType, 
DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385569#comment-15385569
 ] 

Apache Spark commented on SPARK-16642:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14284

> ResolveWindowFrame should not be triggered on UnresolvedFunctions.
> --
>
> Key: SPARK-16642
> URL: https://issues.apache.org/jira/browse/SPARK-16642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The case at 
> https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792
>  is shown below
> {code}
> case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
> UnspecifiedFrame)) =>
>   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
> acceptWindowFrame = true)
>   we.copy(windowSpec = s.copy(frameSpecification = frame))
> {code}
> This case will be triggered even when the function is an unresolved. So, when 
> the functions like lead are used, we may see errors like {{Window Frame RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the 
> frame specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16633) lag/lead does not return the default value when the offset row does not exist

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385570#comment-15385570
 ] 

Apache Spark commented on SPARK-16633:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14284

> lag/lead does not return the default value when the offset row does not exist
> -
>
> Key: SPARK-16633
> URL: https://issues.apache.org/jira/browse/SPARK-16633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Attachments: window_function_bug.html
>
>
> Please see the attached notebook. Seems lag/lead somehow fail to recognize 
> that a offset row does not exist and generate wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread liancheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liancheng updated SPARK-16646:
--
Description: 
{code:sql}
SELECT LEAST(1, 1.5);
{code}

{noformat}
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
all have the same type, got LEAST (ArrayBuffer(IntegerType, 
DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
{noformat}

This query works for 1.6.

  was:
{code:sql}
SELECT LEAST(1, 1.5);
{code}

{noformat}
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
all have the same type, got LEAST (ArrayBuffer(IntegerType, 
DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
{noformat}


> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16628) OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files

2016-07-20 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385586#comment-15385586
 ] 

Liang-Chi Hsieh commented on SPARK-16628:
-

I've tried to address this issue by the PR with the first option.

> OrcConversions should not convert an ORC table represented by 
> MetastoreRelation to HadoopFsRelation if metastore schema does not match 
> schema stored in ORC files
> -
>
> Key: SPARK-16628
> URL: https://issues.apache.org/jira/browse/SPARK-16628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> When {{spark.sql.hive.convertMetastoreOrc}} is enabled, we will convert a ORC 
> table represented by a MetastoreRelation to HadoopFsRelation that uses 
> Spark's OrcFileFormat internally. This conversion aims to make table scanning 
> have a better performance since at runtime, the code path to scan 
> HadoopFsRelation's performance is better. However, OrcFileFormat's 
> implementation is based on the assumption that ORC files store their schema 
> with correct column names. However, before Hive 2.0, an ORC table created by 
> Hive does not store column name correctly in the ORC files (HIVE-4243). So, 
> for this kind of ORC datasets, we cannot really convert the code path. 
> Right now, if ORC tables are created by Hive 1.x or 0.x, enabling 
> {{spark.sql.hive.convertMetastoreOrc}} will introduce a runtime exception for 
> non-partitioned ORC tables and drop the metastore schema for partitioned ORC 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16440) Undeleted broadcast variables in Word2Vec causing OoM for long runs

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16440:
--
 Assignee: Anthony Truchet  (was: Sean Owen)
Fix Version/s: (was: 2.0.0)
   2.0.1

> Undeleted broadcast variables in Word2Vec causing OoM for long runs 
> 
>
> Key: SPARK-16440
> URL: https://issues.apache.org/jira/browse/SPARK-16440
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Anthony Truchet
>Assignee: Anthony Truchet
> Fix For: 1.6.3, 2.0.1
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Three broadcast variables created at the beginning of {{Word2Vec.fit()}} are 
> never deleted nor unpersisted. This seems to cause excessive memory 
> consumption on the driver for a job running hundreds of successive training.
> They are 
> {code}
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16629) UDTs can not be compared to DataTypes in dataframes.

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16629:
--
Target Version/s:   (was: 2.0.1)
   Fix Version/s: (was: 2.0.0)

> UDTs can not be compared to DataTypes in dataframes.
> 
>
> Key: SPARK-16629
> URL: https://issues.apache.org/jira/browse/SPARK-16629
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>
> Currently UDTs can not be compared to Datatypes even if their sqlTypes match. 
> this leads to errors like this 
> {code}
> In [12]: filtered = df.filter(df['udt_time'] > threshold)
> ---
> AnalysisException Traceback (most recent call last)
> /Users/franklyndsouza/dev/starscream/bin/starscream in ()
> > 1 thresholded = df.filter(df['udt_time'] > threshold)
> AnalysisException: u"cannot resolve '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' due to data typ mismatch: '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' requires (boolean or tinyint or smallint or int or bigint or 
> float or double or decimal or timestamp or date or string or binary) type, 
> not pythonuserdefined"
> {code}
> i've proposed a fix for this here https://github.com/apache/spark/pull/14164



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15624) 2.0 python converage ml.recommendation module

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15624.
---
Resolution: Done

> 2.0 python converage ml.recommendation module
> -
>
> Key: SPARK-15624
> URL: https://issues.apache.org/jira/browse/SPARK-15624
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>
> See parent task SPARK-14813.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15625) 2.0 python converage ml.classification module

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15625.
---
Resolution: Done

> 2.0 python converage ml.classification module
> -
>
> Key: SPARK-15625
> URL: https://issues.apache.org/jira/browse/SPARK-15625
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>
> See parent task SPARK-14813.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15629) 2.0 python converage pyspark.ml.linalg

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15629.
---
Resolution: Done

> 2.0 python converage pyspark.ml.linalg
> --
>
> Key: SPARK-15629
> URL: https://issues.apache.org/jira/browse/SPARK-15629
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>
> See parent task SPARK-14813.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16647) sparksql1.6.2 on yarn with hive metastore1.0.0 thows "alter_table_with_cascade" exception

2016-07-20 Thread zhangshuxin (JIRA)

zhangshuxin created SPARK-16647:
---

 Summary: sparksql1.6.2 on yarn with hive metastore1.0.0 thows 
"alter_table_with_cascade" exception
 Key: SPARK-16647
 URL: https://issues.apache.org/jira/browse/SPARK-16647
 Project: Spark
  Issue Type: Bug
Reporter: zhangshuxin


my spark version is 1.6.2(1.5.2,1.5.0) and hive version is 1.0.0
when i execute some sql like 'create table tbl1 as select * from tbl2' or 
'insert overwrite table tabl1 select * from tbl2',i get the following exception

16/07/20 10:14:13 WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost 
connection. Attempting to reconnect.
org.apache.thrift.TApplicationException: Invalid method name: 
'alter_table_with_cascade'
at 
org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_table_with_cascade(ThriftHiveMetastore.java:1374)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_table_with_cascade(ThriftHiveMetastore.java:1358)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_table(HiveMetaStoreClient.java:340)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.alter_table(SessionHiveMetaStoreClient.java:251)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy27.alter_table(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:496)
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:441)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
at 
org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at 
org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPl

[jira] [Resolved] (SPARK-15627) 2.0 python converage ml.tuning module

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15627.
---
Resolution: Done

> 2.0 python converage ml.tuning module
> -
>
> Key: SPARK-15627
> URL: https://issues.apache.org/jira/browse/SPARK-15627
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>
> See parent task SPARK-14813.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15626) 2.0 python converage ml.regression module

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15626.
---
Resolution: Done

> 2.0 python converage ml.regression module
> -
>
> Key: SPARK-15626
> URL: https://issues.apache.org/jira/browse/SPARK-15626
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>
> See parent task SPARK-14813.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12170) Deprecate the JAVA-specific deserialized storage levels

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12170:
--
Issue Type: Task  (was: Sub-task)
Parent: (was: SPARK-12169)

> Deprecate the JAVA-specific deserialized storage levels
> ---
>
> Key: SPARK-12170
> URL: https://issues.apache.org/jira/browse/SPARK-12170
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Sun Rui
>
> This is to be consistent with SPARK-12091 which is for pySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12169) SparkR 2.0

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12169.
---
Resolution: Done

> SparkR 2.0
> --
>
> Key: SPARK-12169
> URL: https://issues.apache.org/jira/browse/SPARK-12169
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Sun Rui
>
> This is an umbrella issue addressing all SparkR related issues corresponding 
> to Spark 2.0 being planned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12172) Consider removing SparkR internal RDD APIs

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12172:
--
Issue Type: Task  (was: Sub-task)
Parent: (was: SPARK-12169)

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits

2016-07-20 Thread Hayri Volkan Agun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385667#comment-15385667
 ] 

Hayri Volkan Agun commented on SPARK-8443:
--

The same issue for large sql syntax with a lot of unions and iterations over 
dataframe at spark 1.6.2


Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
at org.codehaus.janino.CodeContext.writeShort(CodeContext.java:959)
at 
org.codehaus.janino.UnitCompiler.writeConstantFieldrefInfo(UnitCompiler.java:10279)
at org.codehaus.janino.UnitCompiler.getfield(UnitCompiler.java:9946)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3322)
at org.codehaus.janino.UnitCompiler.access$8200(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$10.visitFieldAccess(UnitCompiler.java:3282)
at org.codehaus.janino.Java$FieldAccess.accept(Java.java:3232)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
at 
org.codehaus.janino.UnitCompiler.compileContext2(UnitCompiler.java:3190)
at org.codehaus.janino.UnitCompiler.access$5600(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$9.visitFieldAccess(UnitCompiler.java:3152)
at org.codehaus.janino.Java$FieldAccess.accept(Java.java:3232)
at 
org.codehaus.janino.UnitCompiler.compileContext(UnitCompiler.java:3160)
at 
org.codehaus.janino.UnitCompiler.compileContext2(UnitCompiler.java:3172)
at org.codehaus.janino.UnitCompiler.access$5400(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$9.visitAmbiguousName(UnitCompiler.java:3150)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3138)
at 
org.codehaus.janino.UnitCompiler.compileContext(UnitCompiler.java:3160)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4367)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975)
at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662)
at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)
at org.codehaus.janino.UnitCompiler.access$1100(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$4.visitExpressionStatement(UnitCompiler.java:936)
at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2097)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958)
at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1007)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:993)
at org.codehaus.janino.UnitCompiler.access$1000(UnitCompiler.java:185)
at org.codehaus.janino.UnitCompiler$4.visitBlock(UnitCompiler.java:935)
at org.codehaus.janino.Java$Block.accept(Java.java:2012)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1742)
at org.codehaus.janino.UnitCompiler.access$1200(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$4.visitIfStatement(UnitCompiler.java:937)
at org.codehaus.janino.Java$IfStatement.accept(Java.java:2157)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958)
at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1007)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:993)
at org.codehaus.janino.UnitCompiler.access$1000(UnitCompiler.java:185)
at org.codehaus.janino.UnitCompiler$4.visitBlock(UnitCompiler.java:935)
at org.codehaus.janino.Java$Block.accept(Java.java:20

[jira] [Created] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread liancheng (JIRA)

liancheng created SPARK-16648:
-

 Summary: LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
 Key: SPARK-16648
 URL: https://issues.apache.org/jira/browse/SPARK-16648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: liancheng


{code:sql}
SELECT LAST_VALUE(FALSE) OVER ();
{code}

Exception thrown:

{noformat}
java.lang.IndexOutOfBoundsException: 0
  at 
scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
  at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
  at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
  at 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
  at 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:79)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:78)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonf

[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread liancheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385753#comment-15385753
 ] 

liancheng commented on SPARK-16646:
---

{{GREATEST}} has similar issue.

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-07-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385760#comment-15385760
 ] 

Hyukjin Kwon commented on SPARK-16646:
--

Just FYI, I just tried to reproduce this.

In 1.6.x, in HiveContext the codes below:

{code}
SELECT LEAST(1, 1.5)
{code}

casts 1.5 as double. So, here, 
https://github.com/apache/spark/blob/162d04a30e38bb83d35865679145f8ea80b84c26/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L506-L508

this finds a tightest common type from both {{IntegerType}} and {{DoubleType}} 
as {{DoubleType}}, then it works okay.

But in master branach, 

It casts 1.5 as decimal(2, 1). So,  it fails to find a tightest common type 
from both {{IntegerType}} and {{DecimalType(2, 1)}}.



> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16649) Push partition predicates down into the Hive metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Lianhui Wang (JIRA)

Lianhui Wang created SPARK-16649:


 Summary: Push partition predicates down into the Hive metastore 
for OptimizeMetadataOnlyQuery
 Key: SPARK-16649
 URL: https://issues.apache.org/jira/browse/SPARK-16649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Lianhui Wang


SPARK-6910 has supported for  pushing partition predicates down into the Hive 
metastore for table scan. So it also should push partition predicates dow into 
metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Lianhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16649:
-
Summary: Push partition predicates down into metastore for 
OptimizeMetadataOnlyQuery  (was: Push partition predicates down into the Hive 
metastore for OptimizeMetadataOnlyQuery)

> Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
> ---
>
> Key: SPARK-16649
> URL: https://issues.apache.org/jira/browse/SPARK-16649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> SPARK-6910 has supported for  pushing partition predicates down into the Hive 
> metastore for table scan. So it also should push partition predicates dow 
> into metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16649:


Assignee: Apache Spark

> Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
> ---
>
> Key: SPARK-16649
> URL: https://issues.apache.org/jira/browse/SPARK-16649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>
> SPARK-6910 has supported for  pushing partition predicates down into the Hive 
> metastore for table scan. So it also should push partition predicates down 
> into metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Lianhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16649:
-
Description: SPARK-6910 has supported for  pushing partition predicates 
down into the Hive metastore for table scan. So it also should push partition 
predicates down into metastore for OptimizeMetadataOnlyQuery.  (was: SPARK-6910 
has supported for  pushing partition predicates down into the Hive metastore 
for table scan. So it also should push partition predicates dow into metastore 
for OptimizeMetadataOnlyQuery.)

> Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
> ---
>
> Key: SPARK-16649
> URL: https://issues.apache.org/jira/browse/SPARK-16649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> SPARK-6910 has supported for  pushing partition predicates down into the Hive 
> metastore for table scan. So it also should push partition predicates down 
> into metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385848#comment-15385848
 ] 

Apache Spark commented on SPARK-16649:
--

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14285

> Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
> ---
>
> Key: SPARK-16649
> URL: https://issues.apache.org/jira/browse/SPARK-16649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> SPARK-6910 has supported for  pushing partition predicates down into the Hive 
> metastore for table scan. So it also should push partition predicates down 
> into metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16649:


Assignee: (was: Apache Spark)

> Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
> ---
>
> Key: SPARK-16649
> URL: https://issues.apache.org/jira/browse/SPARK-16649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> SPARK-6910 has supported for  pushing partition predicates down into the Hive 
> metastore for table scan. So it also should push partition predicates down 
> into metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15923.
---
   Resolution: Fixed
Fix Version/s: 2.0.1

Issue resolved by pull request 14163
[https://github.com/apache/spark/pull/14163]

> Spark Application rest api returns "no such app: "
> -
>
> Key: SPARK-15923
> URL: https://issues.apache.org/jira/browse/SPARK-15923
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
> Fix For: 2.0.1
>
>
> Env : secure cluster
> Scenario:
> * Run SparkPi application in yarn-client or yarn-cluster mode
> * After application finishes, check Spark HS rest api to get details like 
> jobs / executor etc. 
> {code}
> http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
>  
> Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
> application_1465778870517_0001"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15923) Spark Application rest api returns "no such app: "

2016-07-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15923:
--
  Assignee: Weiqing Yang
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Spark Application rest api returns "no such app: "
> -
>
> Key: SPARK-15923
> URL: https://issues.apache.org/jira/browse/SPARK-15923
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>Assignee: Weiqing Yang
>Priority: Minor
> Fix For: 2.0.1
>
>
> Env : secure cluster
> Scenario:
> * Run SparkPi application in yarn-client or yarn-cluster mode
> * After application finishes, check Spark HS rest api to get details like 
> jobs / executor etc. 
> {code}
> http://:18080/api/v1/applications/application_1465778870517_0001/1/executors{code}
>  
> Rest api return HTTP Code: 404 and prints "HTTP Data: no such app: 
> application_1465778870517_0001"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385868#comment-15385868
 ] 

Thomas Graves commented on SPARK-16650:
---

I'll try to put up a patch for this later today.

> Improve documentation of spark.task.maxFailures   
> 
>
> Key: SPARK-16650
> URL: https://issues.apache.org/jira/browse/SPARK-16650
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>
> The documentation for spark.task.maxFailures isn't clear as to whether this 
> is just the number of total task failures or if its a single task failing 
> multiple attempts.  
> It turns out its the latter, a single task has to fail spark.task.maxFailures 
> number of attempts before it fails the job.
> We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16629) UDTs can not be compared to DataTypes in dataframes.

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385876#comment-15385876
 ] 

Apache Spark commented on SPARK-16629:
--

User 'damnMeddlingKid' has created a pull request for this issue:
https://github.com/apache/spark/pull/14164

> UDTs can not be compared to DataTypes in dataframes.
> 
>
> Key: SPARK-16629
> URL: https://issues.apache.org/jira/browse/SPARK-16629
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>
> Currently UDTs can not be compared to Datatypes even if their sqlTypes match. 
> this leads to errors like this 
> {code}
> In [12]: filtered = df.filter(df['udt_time'] > threshold)
> ---
> AnalysisException Traceback (most recent call last)
> /Users/franklyndsouza/dev/starscream/bin/starscream in ()
> > 1 thresholded = df.filter(df['udt_time'] > threshold)
> AnalysisException: u"cannot resolve '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' due to data typ mismatch: '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' requires (boolean or tinyint or smallint or int or bigint or 
> float or double or decimal or timestamp or date or string or binary) type, 
> not pythonuserdefined"
> {code}
> i've proposed a fix for this here https://github.com/apache/spark/pull/14164



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16629) UDTs can not be compared to DataTypes in dataframes.

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16629:


Assignee: Apache Spark

> UDTs can not be compared to DataTypes in dataframes.
> 
>
> Key: SPARK-16629
> URL: https://issues.apache.org/jira/browse/SPARK-16629
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Assignee: Apache Spark
>
> Currently UDTs can not be compared to Datatypes even if their sqlTypes match. 
> this leads to errors like this 
> {code}
> In [12]: filtered = df.filter(df['udt_time'] > threshold)
> ---
> AnalysisException Traceback (most recent call last)
> /Users/franklyndsouza/dev/starscream/bin/starscream in ()
> > 1 thresholded = df.filter(df['udt_time'] > threshold)
> AnalysisException: u"cannot resolve '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' due to data typ mismatch: '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' requires (boolean or tinyint or smallint or int or bigint or 
> float or double or decimal or timestamp or date or string or binary) type, 
> not pythonuserdefined"
> {code}
> i've proposed a fix for this here https://github.com/apache/spark/pull/14164



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16629) UDTs can not be compared to DataTypes in dataframes.

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16629:


Assignee: (was: Apache Spark)

> UDTs can not be compared to DataTypes in dataframes.
> 
>
> Key: SPARK-16629
> URL: https://issues.apache.org/jira/browse/SPARK-16629
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>
> Currently UDTs can not be compared to Datatypes even if their sqlTypes match. 
> this leads to errors like this 
> {code}
> In [12]: filtered = df.filter(df['udt_time'] > threshold)
> ---
> AnalysisException Traceback (most recent call last)
> /Users/franklyndsouza/dev/starscream/bin/starscream in ()
> > 1 thresholded = df.filter(df['udt_time'] > threshold)
> AnalysisException: u"cannot resolve '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' due to data typ mismatch: '(`udt_time` > TIMESTAMP('2015-10-20 
> 01:00:00.0'))' requires (boolean or tinyint or smallint or int or bigint or 
> float or double or decimal or timestamp or date or string or binary) type, 
> not pythonuserdefined"
> {code}
> i've proposed a fix for this here https://github.com/apache/spark/pull/14164



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread liancheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385886#comment-15385886
 ] 

liancheng commented on SPARK-16648:
---

The problematic code is the newly introduced {{TreeNode.withNewChildren}}. 
{{Last}} is a unary expression with two {{Expression}} arguments:

{code}
case class Last(child: Expression, ignoreNullsExpr: Expression) extends 
DeclarativeAggregate {
  ...
  override def children: Seq[Expression] = child :: Nil
  ...
}
{code}

Argument {{ignoreNullsExpr}} defaults to {{Literal.FalseLiteral}}. Thus 
{{LAST_VALUE(FALSE)}} is equivalent to {{Last(Literal.FalseLiteral, 
Literal.FalseLiteral)}}. This breaks the following case branch in 
{{TreeNode.withNewChildren}}:

{code}
case arg: TreeNode[_] if containsChild(arg) =>// Both `child` and 
`ignoreNullsExpr` hit this branch,
  val newChild = remainingNewChildren.remove(0)   // but only `child` is the 
real child node of `Last`.
  val oldChild = remainingOldChildren.remove(0)
  if (newChild fastEquals oldChild) {
oldChild
  } else {
changed = true
newChild
  }
{code}


> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
>

[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread liancheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liancheng updated SPARK-16648:
--
Description: 
The following simple SQL query reproduces this issue:

{code:sql}
SELECT LAST_VALUE(FALSE) OVER ();
{code}

Exception thrown:

{noformat}
java.lang.IndexOutOfBoundsException: 0
  at 
scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
  at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
  at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
  at 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
  at 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:79)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveExpressions$1.applyOrElse(LogicalPlan.scala:78)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scal

[jira] [Created] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-16650:
-

 Summary: Improve documentation of spark.task.maxFailures   
 Key: SPARK-16650
 URL: https://issues.apache.org/jira/browse/SPARK-16650
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.6.2
Reporter: Thomas Graves


The documentation for spark.task.maxFailures isn't clear as to whether this is 
just the number of total task failures or if its a single task failing multiple 
attempts.  

It turns out its the latter, a single task has to fail spark.task.maxFailures 
number of attempts before it fails the job.

We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Tom Phillips (JIRA)

Tom Phillips created SPARK-16651:


 Summary: No exception using DataFrame.withColumnRenamed when 
existing column doesn't exist
 Key: SPARK-16651
 URL: https://issues.apache.org/jira/browse/SPARK-16651
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.0
Reporter: Tom Phillips


The {{withColumnRenamed}} method does not raise an exception when the existing 
column does not exist in the dataframe.

Example:

{code}
In [4]: df.show()
+---+-+
|age| name|
+---+-+
|  1|Alice|
+---+-+


In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')

In [6]: df.show()
+---+-+
|age| name|
+---+-+
|  1|Alice|
+---+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)

Daniel Barclay created SPARK-16652:
--

 Summary: JVM crash from unsafe memory access for Dataset of class 
with List[Long]
 Key: SPARK-16652
 URL: https://issues.apache.org/jira/browse/SPARK-16652
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.2, 1.6.1
 Environment: Scala 2.10.
JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
MacOs 10.11.2
Reporter: Daniel Barclay


Generating and writing out a {{Dataset}} of a class that has a {{List}} (or at 
least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
crash.

The crash seems to be related to unsafe memory access especially because 
earlier code (before I got it reduced to the current bug test case)  reported 
"{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code}}".





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Barclay updated SPARK-16652:
---
Attachment: UnsafeAccessCrashBugTest.scala

> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (or 
> at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385966#comment-15385966
 ] 

Daniel Barclay commented on SPARK-16652:


SPARK-3947 reports the same InternalError message that I saw in trimming down 
my test case for SPARK-16652, although the stack traces are much different.

> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (or 
> at least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Barclay updated SPARK-16652:
---
Description: 
Generating and writing out a {{Dataset}} of a class that has a {{List}} (at 
least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
crash.

The crash seems to be related to unsafe memory access especially because 
earlier code (before I got it reduced to the current bug test case)  reported 
"{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code}}".



  was:
Generating and writing out a {{Dataset}} of a class that has a {{List}} (or at 
least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
crash.

The crash seems to be related to unsafe memory access especially because 
earlier code (before I got it reduced to the current bug test case)  reported 
"{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code}}".




> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (at 
> least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]

2016-07-20 Thread Daniel Barclay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385984#comment-15385984
 ] 

Daniel Barclay commented on SPARK-16652:


More info.:

Having the {{String}} member isn't necessary to trigger the bug.

Maybe any kind of list fails:  {{List\[Long]}}, {{List\[Int]}}, 
{{List\[Double]}}, {{List\[Any]}}, {{List\[AnyRef]}} and {{List\[String]}} all 
fail.

> JVM crash from unsafe memory access for Dataset of class with List[Long]
> 
>
> Key: SPARK-16652
> URL: https://issues.apache.org/jira/browse/SPARK-16652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2
> Environment: Scala 2.10.
> JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)"
> MacOs 10.11.2
>Reporter: Daniel Barclay
> Attachments: UnsafeAccessCrashBugTest.scala
>
>
> Generating and writing out a {{Dataset}} of a class that has a {{List}} (at 
> least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM 
> crash.
> The crash seems to be related to unsafe memory access especially because 
> earlier code (before I got it reduced to the current bug test case)  reported 
> "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code}}".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3210) Flume Polling Receiver must be more tolerant to connection failures.

2016-07-20 Thread Ian Brooks (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386002#comment-15386002
 ] 

Ian Brooks commented on SPARK-3210:
---

Hi,

I have also noticed an issue with resiliency for the flume polling receiver. 
This issue I have is as follows

1. Start Flume agent then Spark application
2. Spark application correctly connects to the flume agent and can receive data 
what is sent to Flume
3. Restart Flume
4. Spark application doesn't detect that Flume has been restarted and as such 
doesn't reconnect at any point preventing the Spark application from receiving 
any more data until its restarted.

I've had a trawl through the documentation and source code for 
FlumeUtils.createPollingStream but couldn't see anyway to test and reconnect if 
needed.

-Ian 

> Flume Polling Receiver must be more tolerant to connection failures.
> 
>
> Key: SPARK-3210
> URL: https://issues.apache.org/jira/browse/SPARK-3210
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-16653:
--

 Summary: Make convergence tolerance param in ANN default value 
consistent with other algorithm using LBFGS
 Key: SPARK-16653
 URL: https://issues.apache.org/jira/browse/SPARK-16653
 Project: Spark
  Issue Type: Improvement
Reporter: Weichen Xu


The default value of  convergence tolerance param in ANN is 1e-4,
but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
using LBFGS optimizers this param's default values are all 1e-6.

I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-16653:
---
Component/s: Optimizer
 ML

> Make convergence tolerance param in ANN default value consistent with other 
> algorithm using LBFGS
> -
>
> Key: SPARK-16653
> URL: https://issues.apache.org/jira/browse/SPARK-16653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Optimizer
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The default value of  convergence tolerance param in ANN is 1e-4,
> but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
> using LBFGS optimizers this param's default values are all 1e-6.
> I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386028#comment-15386028
 ] 

Apache Spark commented on SPARK-16653:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14286

> Make convergence tolerance param in ANN default value consistent with other 
> algorithm using LBFGS
> -
>
> Key: SPARK-16653
> URL: https://issues.apache.org/jira/browse/SPARK-16653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Optimizer
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The default value of  convergence tolerance param in ANN is 1e-4,
> but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
> using LBFGS optimizers this param's default values are all 1e-6.
> I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16653:


Assignee: Apache Spark

> Make convergence tolerance param in ANN default value consistent with other 
> algorithm using LBFGS
> -
>
> Key: SPARK-16653
> URL: https://issues.apache.org/jira/browse/SPARK-16653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Optimizer
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The default value of  convergence tolerance param in ANN is 1e-4,
> but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
> using LBFGS optimizers this param's default values are all 1e-6.
> I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16653) Make convergence tolerance param in ANN default value consistent with other algorithm using LBFGS

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16653:


Assignee: (was: Apache Spark)

> Make convergence tolerance param in ANN default value consistent with other 
> algorithm using LBFGS
> -
>
> Key: SPARK-16653
> URL: https://issues.apache.org/jira/browse/SPARK-16653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Optimizer
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The default value of  convergence tolerance param in ANN is 1e-4,
> but other algorithm (such as LinearRegression, LogisticRegression, and so on) 
> using LBFGS optimizers this param's default values are all 1e-6.
> I think make them the same will be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-07-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15968:

Fix Version/s: (was: 2.1.0)

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>  Labels: hive, metastore
> Fix For: 2.0.0
>
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for non-empty partitioned 
> tables. As a result, cache lookups on non-empty partitioned tables always 
> miss and these relations are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16613) RDD.pipe returns values for empty partitions

2016-07-20 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16613.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.1.0
   2.0.1

> RDD.pipe returns values for empty partitions
> 
>
> Key: SPARK-16613
> URL: https://issues.apache.org/jira/browse/SPARK-16613
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Krasnyansky
>Assignee: Sean Owen
> Fix For: 2.0.1, 2.1.0
>
>
> Suppose we have such Spark code
> {code}
> object PipeExample {
>   def main(args: Array[String]) {
> val fstRdd = sc.parallelize(List("hi", "hello", "how", "are", "you"))
> val pipeRdd = 
> fstRdd.pipe("/Users/finkel/spark-pipe-example/src/main/resources/len.sh")
> pipeRdd.collect.foreach(println)
>   }
> }
> {code}
> It uses a bash script to convert a string to its length.
> {code}
> #!/bin/sh
> read input
> len=${#input}
> echo $len
> {code}
> So far so good, but when I run the code, it prints incorrect output. For 
> example:
> {code}
> 0
> 2
> 0
> 5
> 3
> 0
> 3
> 3
> {code}
> I expect to see
> {code}
> 2
> 5
> 3
> 3
> 3
> {code}
> which is correct output for the app. I think it's a bug. It's expected to see 
> only positive integers and avoid zeros.
> Environment:
> 1. Spark version is 1.6.2
> 2. Scala version is 2.11.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server

2016-07-20 Thread Rahul Bhatia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386192#comment-15386192
 ] 

Rahul Bhatia commented on SPARK-13767:
--

I'm seeing the error that Venkata showed as well, if anyone has any thoughts on 
why that would occur, I'd really appreciate it. 

Thanks, 

> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> 
>
> Key: SPARK-13767
> URL: https://issues.apache.org/jira/browse/SPARK-13767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Poonam Agrawal
>
> I am trying to create spark context object with the following commands on 
> pyspark:
> from pyspark import SparkContext, SparkConf
> conf = 
> SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
>  'cassandra-machine-ip').set('spark.storage.memoryFraction', 
> '0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
> 500).set('spark.serializer', 
> 'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
> 'FAIR').set('spark.mesos.coarse', 'true')
> sc = SparkContext(conf=conf)
> but I am getting the following error:
> Traceback (most recent call last):
> File "", line 1, in 
> File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in 
> __init__
>   self._jconf = _jvm.SparkConf(loadDefaults)
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 766, in __getattr__
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 362, in send_command
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 318, in _get_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 325, in _create_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 432, in start
> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> I am getting the same error executing the command : conf = 
> SparkConf().setAppName("App_name").setMaster("spark://127.0.0.1:7077")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-07-20 Thread Michael Allman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-15968:
---
Fix Version/s: 2.0.0

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>  Labels: hive, metastore
> Fix For: 2.0.0, 2.1.0
>
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for non-empty partitioned 
> tables. As a result, cache lookups on non-empty partitioned tables always 
> miss and these relations are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15918) unionAll returns wrong result when two dataframes has schema in different order

2016-07-20 Thread Wade Salazar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386225#comment-15386225
 ] 

Wade Salazar commented on SPARK-15918:
--

Is the stance on this issue that since other SQL interpreters require both 
SELECT statements to be in order Spark will follow?  Though this seems 
relatively innocuous it leads to considerable amount of troubleshooting when 
this behavior is not expected.  Can we request that the requirement to have 
each SELECT statement eliminated as an improvement in SparkSQL 

> unionAll returns wrong result when two dataframes has schema in different 
> order
> ---
>
> Key: SPARK-15918
> URL: https://issues.apache.org/jira/browse/SPARK-15918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: CentOS
>Reporter: Prabhu Joseph
>
> On applying unionAll operation between A and B dataframes, they both has same 
> schema but in different order and hence the result has column value mapping 
> changed.
> Repro:
> {code}
> A.show()
> +---++---+--+--+-++---+--+---+---+-+
> |tag|year_day|tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---++---+--+--+-++---+--+---+---+-+
> +---++---+--+--+-++---+--+---+---+-+
> B.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUDP713.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT718.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT703Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUR716A.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F|C_FNHXUT803Z.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUT728.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> |F| C_FNHXUR806.CNSTHI|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275|
> +-+---+--+---+---+--+--+--+---+---+--++
> A = A.unionAll(B)
> A.show()
> +---+---+--+--+--+-++---+--+---+---+-+
> |tag|   year_day|   
> tm_hour|tm_min|tm_sec|dtype|time|tm_mday|tm_mon|tm_yday|tm_year|value|
> +---+---+--+--+--+-++---+--+---+---+-+
> |  F|C_FNHXUT701Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUDP713.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT718.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT703Z.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUR716A.CNSTLO|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F|C_FNHXUT803Z.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUT728.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> |  F| C_FNHXUR806.CNSTHI|1443790800|13| 2|0|  10|  0|   275|  
>  2015| 1.2345|2015275.0|
> +---+---+--+--+--+-++---+--+---+---+-+
> {code}
> On changing the schema of A according to B and doing unionAll works fine
> {code}
> C = 
> A.select("dtype","tag","time","tm_hour","tm_mday","tm_min",”tm_mon”,"tm_sec","tm_yday","tm_year","value","year_day")
> A = C.unionAll(B)
> A.show()
> +-+---+--+---+---+--+--+--+---+---+--++
> |dtype|tag|  
> time|tm_hour|tm_mday|tm_min|tm_mon|tm_sec|tm_yday|tm_year| value|year_day|
> +-+---+--+---+---+--+--+--+---+---+--++
> |F|C_FNHXUT701Z.CNSTLO|1443790800| 13|  2| 0|10| 0|   
>  275|   2015|1.2345| 2015275

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386235#comment-15386235
 ] 

Thomas Graves commented on SPARK-2666:
--

I think eventually adding prestart (MapReduce slowstart type setting) makes 
sense.  This is actually why I didn't change the mapoutput statuses to go along 
with task launch. I wanted to be able to do this or get incremental map output 
status results.

But as far as the keep the remaining tasks running I think it depends on the 
behavior and I haven't had time to go look in more detail.

If the stages fails, what tasks does it rerun:

1) does it rerun all the ones not succeeded yet in the failed stage (including 
the ones that could still be running)?  
2) does it only run the failed ones and wait for the ones still running in 
failed stage?  If they succeed it uses those results.

>From what I saw with this job I thought it was acting like number 1 above. The 
>only use to leave the ones running is to see if they get FetchFailures, this 
>seems like a lot of overhead to find that out if that task takes a long time.

When a fetch failure happens, does the schedule re-run all maps that had run on 
that node or just the ones specifically mentioned by the fetch failure?  Again 
I thought it was just the specific map that the fetch failure failed to get, 
thus why it needs to know if the other reducers get fetch failures.

I can kind of understand letting them run to see if they hit fetch failures as 
well but on a large job or with tasks that take a long time, if we aren't  
counting them as success then its more a waste of resources and just extends 
the job time as well as confuses the user since the UI doesn't represent those 
still running.

 In the case i was seeing my tasks took roughly an hour.  One stage failed so 
it restarted that stage, but since it didn't kill the tasks from the original 
stage it had very few executors open to run new ones, thus the job took a lot 
longer then it should.   I don't remember the exact cause of the failures 
anymore.

Anyway I think the results are going to vary a lot based on the type of job and 
length of each stage (map vs reduce). 

personally I think it would be better to change to fail all maps that ran on 
the host it failed to fetch from and kill the rest of the running reducers in 
that stage. But I would have to investigate the code more to fully understand.



> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubs

[jira] [Resolved] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-07-20 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-15951.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Change Executors Page to use datatables to support sorting columns and 
> searching
> 
>
> Key: SPARK-15951
> URL: https://issues.apache.org/jira/browse/SPARK-15951
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Support column sort and search for Executors Server using jQuery DataTable 
> and REST API. Before this commit, the executors page was generated hard-coded 
> html and can not support search, also, the sorting was disabled if there is 
> any application that has more than one attempt. Supporting search and sort 
> (over all applications rather than the 20 entries in the current page) in any 
> case will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-07-20 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-15951:
--
Assignee: Kishor Patil

> Change Executors Page to use datatables to support sorting columns and 
> searching
> 
>
> Key: SPARK-15951
> URL: https://issues.apache.org/jira/browse/SPARK-15951
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>Assignee: Kishor Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> Support column sort and search for Executors Server using jQuery DataTable 
> and REST API. Before this commit, the executors page was generated hard-coded 
> html and can not support search, also, the sorting was disabled if there is 
> any application that has more than one attempt. Supporting search and sort 
> (over all applications rather than the 20 entries in the current page) in any 
> case will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-07-20 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-16654:


 Summary: UI Should show blacklisted executors & nodes
 Key: SPARK-16654
 URL: https://issues.apache.org/jira/browse/SPARK-16654
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Web UI
Affects Versions: 2.0.0
Reporter: Imran Rashid


SPARK-8425 will add the ability to blacklist entire executors and nodes to deal 
w/ faulty hardware.  However, without displaying it on the UI, it can be hard 
to realize which executor is bad, and why tasks aren't getting scheduled on 
certain executors.

As a first step, we should just show nodes and executors that are blacklisted 
for the entire application (no need to show blacklisting for tasks & stages).

This should also ensure that blacklisting events get into the event logs for 
the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-07-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386242#comment-15386242
 ] 

Imran Rashid commented on SPARK-16654:
--

cc [~tgraves]

> UI Should show blacklisted executors & nodes
> 
>
> Key: SPARK-16654
> URL: https://issues.apache.org/jira/browse/SPARK-16654
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> SPARK-8425 will add the ability to blacklist entire executors and nodes to 
> deal w/ faulty hardware.  However, without displaying it on the UI, it can be 
> hard to realize which executor is bad, and why tasks aren't getting scheduled 
> on certain executors.
> As a first step, we should just show nodes and executors that are blacklisted 
> for the entire application (no need to show blacklisting for tasks & stages).
> This should also ensure that blacklisting events get into the event logs for 
> the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16634) GenericArrayData can't be loaded in certain JVMs

2016-07-20 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16634.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0
   2.0.1

> GenericArrayData can't be loaded in certain JVMs
> 
>
> Key: SPARK-16634
> URL: https://issues.apache.org/jira/browse/SPARK-16634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> There's an annoying bug in some JVMs that causes certain scala-generated 
> bytecode to not load. The current code in GenericArrayData.scala triggers 
> that bug (at least with 1.7.0_67, maybe others).
> Since it's easy to work around the bug, I'd rather do that instead of asking 
> people who might be running that version to have to upgrade.
> Error:
> {noformat}
> 16/07/19 16:02:35 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 
> (TID 2) on executor vanzin-st1-3.vpc.cloudera.com: java.lang.VerifyError (Bad 
>  method call from inside of a branch
> Exception Details:
>   Location:
> 
> org/apache/spark/sql/catalyst/util/GenericArrayData.(Ljava/lang/Object;)V
>  @52: invokespecial
>   Reason:
> Error exists in the bytecode
>   Bytecode:
> 000: 2a2b 4d2c c100 dc99 000e 2cc0 00dc 4e2d
> 010: 3a04 a700 20b2 0129 2c04 b601 2d99 001b
> 020: 2c3a 05b2 007a 1905 b600 7eb9 00fe 0100
> 030: 3a04 1904 b700 f3b1 bb01 2f59 2cb7 0131
> 040: bf 
>   Stackmap Table:
> 
> full_frame(@21,{UninitializedThis,Object[#177],Object[#177]},{UninitializedThis})
> 
> full_frame(@50,{UninitializedThis,Object[#177],Object[#177],Top,Object[#220]},{UninitializedThis})
> 
> full_frame(@56,{UninitializedThis,Object[#177],Object[#177]},{UninitializedThis})
> ) [duplicate 2]
> {noformat}
> I didn't run into this with 2.0, not sure whether the issue exists there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386286#comment-15386286
 ] 

Dongjoon Hyun commented on SPARK-16651:
---

Hi, [~tomwphillips].
It's a documented behavior in Scala side since 1.3.

{code}
def withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new DataFrame with a column renamed. This is a no-op if schema 
doesn't contain existingName.
{code}

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Maybe, we can update Python API but we can not change the behavior.

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386295#comment-15386295
 ] 

Dongjoon Hyun commented on SPARK-16651:
---

Also, in Spark 2.0 RC, Dataframe is merged into Dataset and the documentation 
of Dataset still has that notice.

{code}
def withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't 
contain existingName.
Since 2.0.0
{code}

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386295#comment-15386295
 ] 

Dongjoon Hyun edited comment on SPARK-16651 at 7/20/16 5:55 PM:


Also, in Spark 2.0 RC5, DataFrame is merged into Dataset and the documentation 
of Dataset still has that notice.

{code}
def withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't 
contain existingName.
Since 2.0.0
{code}

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset


was (Author: dongjoon):
Also, in Spark 2.0 RC, Dataframe is merged into Dataset and the documentation 
of Dataset still has that notice.

{code}
def withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't 
contain existingName.
Since 2.0.0
{code}

http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/api/scala/index.html#org.apache.spark.sql.Dataset

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386311#comment-15386311
 ] 

Apache Spark commented on SPARK-16650:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/14287

> Improve documentation of spark.task.maxFailures   
> 
>
> Key: SPARK-16650
> URL: https://issues.apache.org/jira/browse/SPARK-16650
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>
> The documentation for spark.task.maxFailures isn't clear as to whether this 
> is just the number of total task failures or if its a single task failing 
> multiple attempts.  
> It turns out its the latter, a single task has to fail spark.task.maxFailures 
> number of attempts before it fails the job.
> We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16650:


Assignee: (was: Apache Spark)

> Improve documentation of spark.task.maxFailures   
> 
>
> Key: SPARK-16650
> URL: https://issues.apache.org/jira/browse/SPARK-16650
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>
> The documentation for spark.task.maxFailures isn't clear as to whether this 
> is just the number of total task failures or if its a single task failing 
> multiple attempts.  
> It turns out its the latter, a single task has to fail spark.task.maxFailures 
> number of attempts before it fails the job.
> We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16650) Improve documentation of spark.task.maxFailures

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16650:


Assignee: Apache Spark

> Improve documentation of spark.task.maxFailures   
> 
>
> Key: SPARK-16650
> URL: https://issues.apache.org/jira/browse/SPARK-16650
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.2
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> The documentation for spark.task.maxFailures isn't clear as to whether this 
> is just the number of total task failures or if its a single task failing 
> multiple attempts.  
> It turns out its the latter, a single task has to fail spark.task.maxFailures 
> number of attempts before it fails the job.
> We should try to make that more clear in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16651:


Assignee: Apache Spark

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>Assignee: Apache Spark
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16651:


Assignee: (was: Apache Spark)

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386317#comment-15386317
 ] 

Apache Spark commented on SPARK-16651:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14288

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16655) Spark thrift server application is not stopped if its in ACCEPTED stage

2016-07-20 Thread Yesha Vora (JIRA)

Yesha Vora created SPARK-16655:
--

 Summary: Spark thrift server application is not stopped if its in 
ACCEPTED stage
 Key: SPARK-16655
 URL: https://issues.apache.org/jira/browse/SPARK-16655
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Yesha Vora


When spark-thriftserver is started in yarn-client mode, It starts a yarn 
application.  If yarn application is in ACCEPTED stage and stop operation is 
performed on spark thrift server,  yarn application does not get 
killed/stopped. 

On stop operation, spark thriftserver should stop the yarn application. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16642:
-
Target Version/s: 2.0.0

> ResolveWindowFrame should not be triggered on UnresolvedFunctions.
> --
>
> Key: SPARK-16642
> URL: https://issues.apache.org/jira/browse/SPARK-16642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The case at 
> https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792
>  is shown below
> {code}
> case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
> UnspecifiedFrame)) =>
>   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
> acceptWindowFrame = true)
>   we.copy(windowSpec = s.copy(frameSpecification = frame))
> {code}
> This case will be triggered even when the function is an unresolved. So, when 
> the functions like lead are used, we may see errors like {{Window Frame RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the 
> frame specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-20 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386333#comment-15386333
 ] 

Michael Allman commented on SPARK-16320:


[~maver1ck] Would it be possible for you to share your parquet file on S3? I 
would like to test with it specifically. If not publicly, could you share it 
with me privately? Thanks.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16654) UI Should show blacklisted executors & nodes

2016-07-20 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386338#comment-15386338
 ] 

Alex Bozarth commented on SPARK-16654:
--

Perhaps we can change the status column to "Blacklisted" or "Alive 
(Blacklisted)" instead on Alive or Dead? I'm not very familiar with how the 
blacklisting works but I would be willing to learn and add this once the other 
PR is merged.

> UI Should show blacklisted executors & nodes
> 
>
> Key: SPARK-16654
> URL: https://issues.apache.org/jira/browse/SPARK-16654
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> SPARK-8425 will add the ability to blacklist entire executors and nodes to 
> deal w/ faulty hardware.  However, without displaying it on the UI, it can be 
> hard to realize which executor is bad, and why tasks aren't getting scheduled 
> on certain executors.
> As a first step, we should just show nodes and executors that are blacklisted 
> for the entire application (no need to show blacklisting for tasks & stages).
> This should also ensure that blacklisting events get into the event logs for 
> the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16651) No exception using DataFrame.withColumnRenamed when existing column doesn't exist

2016-07-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386360#comment-15386360
 ] 

Dongjoon Hyun commented on SPARK-16651:
---

I made a PR for you and other people, but I'm not sure it'll be merged.

> No exception using DataFrame.withColumnRenamed when existing column doesn't 
> exist
> -
>
> Key: SPARK-16651
> URL: https://issues.apache.org/jira/browse/SPARK-16651
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Tom Phillips
>
> The {{withColumnRenamed}} method does not raise an exception when the 
> existing column does not exist in the dataframe.
> Example:
> {code}
> In [4]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> In [5]: df = df.withColumnRenamed('dob', 'date_of_birth')
> In [6]: df.show()
> +---+-+
> |age| name|
> +---+-+
> |  1|Alice|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16644:
-
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16648:
-
Target Version/s: 2.0.1  (was: 2.0.0)

> LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException
> --
>
> Key: SPARK-16648
> URL: https://issues.apache.org/jira/browse/SPARK-16648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: liancheng
>
> The following simple SQL query reproduces this issue:
> {code:sql}
> SELECT LAST_VALUE(FALSE) OVER ();
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 0
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:169)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:637)
>   at 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts$$anonfun$apply$12.applyOrElse(TypeCoercion.scala:615)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala

[jira] [Updated] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16642:
-
Target Version/s: 2.0.1  (was: 2.0.0)

> ResolveWindowFrame should not be triggered on UnresolvedFunctions.
> --
>
> Key: SPARK-16642
> URL: https://issues.apache.org/jira/browse/SPARK-16642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The case at 
> https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792
>  is shown below
> {code}
> case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
> UnspecifiedFrame)) =>
>   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
> acceptWindowFrame = true)
>   we.copy(windowSpec = s.copy(frameSpecification = frame))
> {code}
> This case will be triggered even when the function is an unresolved. So, when 
> the functions like lead are used, we may see errors like {{Window Frame RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the 
> frame specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16633) lag/lead does not return the default value when the offset row does not exist

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16633:
-
Target Version/s: 2.0.1  (was: 2.0.0)

> lag/lead does not return the default value when the offset row does not exist
> -
>
> Key: SPARK-16633
> URL: https://issues.apache.org/jira/browse/SPARK-16633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Attachments: window_function_bug.html
>
>
> Please see the attached notebook. Seems lag/lead somehow fail to recognize 
> that a offset row does not exist and generate wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16644:
-
Component/s: SQL

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16644) constraints propagation may fail the query

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16644:
-
Target Version/s: 2.0.1  (was: 2.0.0)

> constraints propagation may fail the query
> --
>
> Key: SPARK-16644
> URL: https://issues.apache.org/jira/browse/SPARK-16644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> {code}
> create table(a int, b int);
> select
>   a,
>   max(b) as c1,
>   b as c2
> from tbl
> where a = b
> group by a, b
> having c1 = 1
> {code}
> this query fails in 2.0, but works in 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default

2016-07-20 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15425:
-
Labels: release_notes releasenotes  (was: )

> Disallow cartesian joins by default
> ---
>
> Key: SPARK-15425
> URL: https://issues.apache.org/jira/browse/SPARK-15425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sameer Agarwal
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> It is fairly easy for users to shoot themselves in the foot if they run 
> cartesian joins. Often they might not even be aware of the join methods 
> chosen. This happened to me a few times in the last few weeks.
> It would be a good idea to disable cartesian joins by default, and require 
> explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
> however might be too large of a scope for 2.0 given the timing. As a small 
> and quick fix, we can just have a single config option 
> (spark.sql.join.enableCartesian) that controls this behavior. In the future 
> we can implement the fine-grained control.
> Note that the error message should be friendly and say "Set 
> spark.sql.join.enableCartesian to true to turn on cartesian joins."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits

2016-07-20 Thread Hayri Volkan Agun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386393#comment-15386393
 ] 

Hayri Volkan Agun commented on SPARK-14887:
---

Same issue in 1.6.2 can repeated with around 300 iterative udf transformation 
after a 20 unionAll calls on approaximately 4000 rows (~25 columns) dataframe...

> Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits
> ---
>
> Key: SPARK-14887
> URL: https://issues.apache.org/jira/browse/SPARK-14887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: fang fang chen
>
> Similiar issue with SPARK-14138 and SPARK-8443:
> With large sql syntax(673K), following error happened:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 161 matches

Mail list logo