[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system

2016-12-22 Thread luat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15772224#comment-15772224
 ] 

luat commented on SPARK-18941:
--

Hi [~dongjoon],

It is ok.
But I think that this difference should be documented.

For some historical reasons, we're using 'create table ... location' for all 
our ETL jobs to create Hive table (not External table). I saw that, it still 
works well on Spark 1.6.3 (the directory associated with the hive table will be 
deleted when we drop table).

We'll consider changing our ETL jobs to work correctly on Spark 2.0, too.
Thank for your support.


> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system
> -
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: luat
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18987) I am trying to disable Spark Stage progress logs on cluster.

2016-12-22 Thread vidit Singh (JIRA)
vidit Singh created SPARK-18987:
---

 Summary: I am trying to disable Spark Stage progress logs on 
cluster.
 Key: SPARK-18987
 URL: https://issues.apache.org/jira/browse/SPARK-18987
 Project: Spark
  Issue Type: Question
  Components: Input/Output
 Environment: Hadoop Cluster
Reporter: vidit Singh


I am trying to disable Stage progress of  spark on Cluster. As i don't want to 
see those stages. I have tried every possible things to disable that but still 
that problem persist. I have tried the following things : 
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);

Even modified log4j to log4j.logger.org.apache.spark=OFF

I have tried passing spark.ui.showConsoleProgress=false in Spark Submit query. 
even passing external log4j along with spark submit didn't work for me. : 
--files log4j.properties .


Output : 
[Stage 8:>(0 + 2) / 2][Stage 9:>(0 + 2) / 2][Stage 11:>   (0 + 2) / 2]
[Stage 8:>(0 + 2) / 2][Stage 9:>(0 + 2) / 2][Stage 13:>   (0 + 2) / 2]
[Stage 8:>(0 + 2) / 2][Stage 13:>   (0 + 2) / 2][Stage 15:>   (0 + 2) / 2]
[Stage 8:>(0 + 2) / 2][Stage 15:>   (0 + 2) / 2][Stage 20:==> (1 + 1) / 2]
[Stage 8:>  (0 + 2) / 2][Stage 15:> (0 + 2) / 2]
[Stage 8:=> (1 + 1) / 2][Stage 15:=>(1 + 1) / 2]
[Stage 15:=>(1 + 1) / 2]
[Stage 10:> (0 + 180) / 200]
[Stage 10:> (0 + 180) / 200][Stage 17:>   (0 + 0) / 200]
[Stage 10:> (2 + 180) / 200][Stage 17:>   (0 + 0) / 200]
[Stage 10:> (5 + 180) / 200][Stage 17:>   (0 + 0) / 200]
[Stage 10:>(14 + 180) / 200][Stage 17:>   (0 + 0) / 200]


I don't want to see these stages in my logs. Please help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18974) FileInputDStream could not detected files which moved to the directory

2016-12-22 Thread Adam Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Wang updated SPARK-18974:
--
Description: 
FileInputDStream use mod time to find new files, but if a file was moved into 
the directories it's modification time would not be changed, so 
FileInputDStream could not detect these files.

I think a way to fix this bug is get access_time and do judgment, bug it need a 
Set of files to save all old files, it would very inefficient for lot of files 
directory.

  was:
FileInputDStream use mod time to find new files, but if a file was moved into 
the directories it's modification time would not be changed, so 
FileInputDStream could not detected these files.

I think a way to fix this bug is get access_time and do judgment, bug it need a 
Set of files to save all old files, it would very inefficient for lot of files 
directory.


> FileInputDStream could not detected files which moved to the directory 
> ---
>
> Key: SPARK-18974
> URL: https://issues.apache.org/jira/browse/SPARK-18974
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Adam Wang
>
> FileInputDStream use mod time to find new files, but if a file was moved into 
> the directories it's modification time would not be changed, so 
> FileInputDStream could not detect these files.
> I think a way to fix this bug is get access_time and do judgment, bug it need 
> a Set of files to save all old files, it would very inefficient for lot of 
> files directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18978) Spark streaming ClassCastException

2016-12-22 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15772185#comment-15772185
 ] 

Liang-Chi Hsieh commented on SPARK-18978:
-

Actually I can't reproduce this in master branch with spark-shell. Do you use 
specified spark version?

> Spark streaming ClassCastException
> --
>
> Key: SPARK-18978
> URL: https://issues.apache.org/jira/browse/SPARK-18978
> Project: Spark
>  Issue Type: Bug
>Reporter: Keltoum BOUQOUROU
>
> I use Spark Streaming as a listener to monitor a directory. When a new file 
> is detected, the program performs a processing on the file. The program is 
> the following:
> val conf = new SparkConf().setAppName("DocumentRanking").setMaster("local[*]")
> val sparkStreamingContext = new StreamingContext(conf, Seconds(5))
> val directoryStream =sparkStreamingContext.textFileStream("nom_dossier")
> directoryStream.foreachRDD (rdd => if (rdd.count()!=0)
>   rdd.foreach(line =>traitement()))
> The processing of the first file added passes without problems. But with the 
> addition of the second file, I have the following error: 
> 16/12/22 09:11:37 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration 
> cannot be cast to [B
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18974) FileInputDStream could not detected files which moved to the directory

2016-12-22 Thread Adam Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Wang updated SPARK-18974:
--
Description: 
FileInputDStream use mod time to find new files, but if a file was moved into 
the directories it's modification time would not be changed, so 
FileInputDStream could not detected these files.

I think a way to fix this bug is get access_time and do judgment, bug it need a 
Set of files to save all old files, it would very inefficient for lot of files 
directory.

  was:
FileInputDStream use mod time to find new files, but if a file was moved into 
the directories it's modification time would not changed, so FileInputDStream 
could not detected these files.
I think a way to fix this bug is get access_time and do judgment, bug it need a 
Set of files to save all old files, it would very inefficient for lot of files 
directory.


> FileInputDStream could not detected files which moved to the directory 
> ---
>
> Key: SPARK-18974
> URL: https://issues.apache.org/jira/browse/SPARK-18974
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Adam Wang
>
> FileInputDStream use mod time to find new files, but if a file was moved into 
> the directories it's modification time would not be changed, so 
> FileInputDStream could not detected these files.
> I think a way to fix this bug is get access_time and do judgment, bug it need 
> a Set of files to save all old files, it would very inefficient for lot of 
> files directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator

2016-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18986:


Assignee: Apache Spark

> ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its 
> iterator
> -
>
> Key: SPARK-18986
> URL: https://issues.apache.org/jira/browse/SPARK-18986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an 
> iterator is not null in the map. However, the assertion is only true after 
> the map is asked for iterator. Before it, if another memory consumer asks 
> more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is 
> also be called too. In this case, we will see failure like this:
> {code}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:156)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
> [info]   at 
> org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
> MapSuite.scala:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator

2016-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15772013#comment-15772013
 ] 

Apache Spark commented on SPARK-18986:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16387

> ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its 
> iterator
> -
>
> Key: SPARK-18986
> URL: https://issues.apache.org/jira/browse/SPARK-18986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an 
> iterator is not null in the map. However, the assertion is only true after 
> the map is asked for iterator. Before it, if another memory consumer asks 
> more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is 
> also be called too. In this case, we will see failure like this:
> {code}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:156)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
> [info]   at 
> org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
> MapSuite.scala:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18199) Support appending to Parquet files

2016-12-22 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15772014#comment-15772014
 ] 

Takeshi Yamamuro commented on SPARK-18199:
--

Have you check this https://github.com/apache/spark/pull/16281?
The parquet community will release v1.8.2 for backports and spark's planning to 
upgrade to v1.8.2.
So, IIUC spark has no plan to upgrade to v1.9.0 for now.

> Support appending to Parquet files
> --
>
> Key: SPARK-18199
> URL: https://issues.apache.org/jira/browse/SPARK-18199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeremy Smith
>
> Currently, appending to a Parquet directory involves simply creating new 
> parquet files in the directory. With many small appends (for example, in a 
> streaming job with a short batch duration) this leads to an unbounded number 
> of small Parquet files accumulating. These must be cleaned up with some 
> frequency by removing them all and rewriting a new file containing all the 
> rows.
> It would be far better if Spark supported appending to the Parquet files 
> themselves. HDFS supports this, as does Parquet:
> * The Parquet footer can be read in order to obtain necessary metadata.
> * The new rows can then be appended to the Parquet file as a row group.
> * A new footer can then be appended containing the metadata and referencing 
> the new row groups as well as the previously existing row groups.
> This would result in a small amount of bloat in the file as new row groups 
> are added (since duplicate metadata would accumulate) but it's hugely 
> preferable to accumulating small files, which is bad for HDFS health and also 
> eventually leads to Spark being unable to read the Parquet directory at all.  
> Periodic rewriting of the file could still be performed in order to remove 
> the duplicate metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator

2016-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18986:


Assignee: (was: Apache Spark)

> ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its 
> iterator
> -
>
> Key: SPARK-18986
> URL: https://issues.apache.org/jira/browse/SPARK-18986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an 
> iterator is not null in the map. However, the assertion is only true after 
> the map is asked for iterator. Before it, if another memory consumer asks 
> more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is 
> also be called too. In this case, we will see failure like this:
> {code}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:156)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
> [info]   at 
> org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
> MapSuite.scala:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator

2016-12-22 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-18986:

Component/s: (was: SQL)
 Spark Core

> ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its 
> iterator
> -
>
> Key: SPARK-18986
> URL: https://issues.apache.org/jira/browse/SPARK-18986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an 
> iterator is not null in the map. However, the assertion is only true after 
> the map is asked for iterator. Before it, if another memory consumer asks 
> more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is 
> also be called too. In this case, we will see failure like this:
> {code}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:156)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
> [info]   at 
> org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
> MapSuite.scala:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator

2016-12-22 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-18986:
---

 Summary: ExternalAppendOnlyMap shouldn't fail when forced to spill 
before calling its iterator
 Key: SPARK-18986
 URL: https://issues.apache.org/jira/browse/SPARK-18986
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


{{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an iterator 
is not null in the map. However, the assertion is only true after the map is 
asked for iterator. Before it, if another memory consumer asks more memory than 
currently available, {{ExternalAppendOnlyMap.forceSpill}} is also be called 
too. In this case, we will see failure like this:

{code}
[info]   java.lang.AssertionError: assertion failed
[info]   at scala.Predef$.assert(Predef.scala:156)
[info]   at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
[info]   at 
org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
[info]   at 
org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
MapSuite.scala:294)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2016-12-22 Thread David Rosenstrauch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771773#comment-15771773
 ] 

David Rosenstrauch commented on SPARK-9686:
---

Apologies, this actually wasn't the issue I was having.  Apologies for the 
noise.

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Assignee: Cheng Lian
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18981) The last job hung when speculation is on

2016-12-22 Thread roncenzhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roncenzhao updated SPARK-18981:
---
Description: 
CONF:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   2

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will hang.


In the attachments, submitting the app by 'run_scala.sh' will lead to the 
'hang' problem as the 'job_hang.png' shows.


  was:
CONF:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   4

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will hang.


In the attachments, submitting the app by 'run_scala.sh' will lead to the 
'hang' problem as the 'job_hang.png' shows.



> The last job hung when speculation is on
> 
>
> Key: SPARK-18981
> URL: https://issues.apache.org/jira/browse/SPARK-18981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hadoop2.5.0
>Reporter: roncenzhao
>Priority: Critical
> Attachments: Test.scala, job_hang.png, run_scala.sh
>
>
> CONF:
> spark.speculation   true
> spark.dynamicAllocation.minExecutors0
> spark.executor.cores   2
> When I run the follow app, the bug will trigger.
> ```
> sc.runJob(job1)
> sleep(100s)
> sc.runJob(job2) // the job2 will hang and never be scheduled
> ```
> The triggering condition is described as follows:
> condition1: During the sleeping time, the executors will be released and the 
> # of the executor will be zero some seconds later. The #numExecutorsTarget in 
> 'ExecutorAllocationManager' will be 0.
> condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
> will be negative during the ending of job1's tasks. 
> condition3: The job2 only hava one task.
> result:
> In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', 
> we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
> #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
> negative. So the 'ExecutorAllocationManager' will not request container from 
> yarn. The app will hang.
> In the attachments, submitting the app by 'run_scala.sh' will lead to the 
> 'hang' problem as the 'job_hang.png' shows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18981) The last job hung when speculation is on

2016-12-22 Thread roncenzhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roncenzhao updated SPARK-18981:
---
Description: 
CONF:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   4

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will hang.


In the attachments, submitting the app by 'run_scala.sh' will lead to the 
'hang' problem as the 'job_hang.png' shows.


  was:
CONF:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   4

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will hang.





> The last job hung when speculation is on
> 
>
> Key: SPARK-18981
> URL: https://issues.apache.org/jira/browse/SPARK-18981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hadoop2.5.0
>Reporter: roncenzhao
>Priority: Critical
> Attachments: Test.scala, job_hang.png, run_scala.sh
>
>
> CONF:
> spark.speculation   true
> spark.dynamicAllocation.minExecutors0
> spark.executor.cores   4
> When I run the follow app, the bug will trigger.
> ```
> sc.runJob(job1)
> sleep(100s)
> sc.runJob(job2) // the job2 will hang and never be scheduled
> ```
> The triggering condition is described as follows:
> condition1: During the sleeping time, the executors will be released and the 
> # of the executor will be zero some seconds later. The #numExecutorsTarget in 
> 'ExecutorAllocationManager' will be 0.
> condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
> will be negative during the ending of job1's tasks. 
> condition3: The job2 only hava one task.
> result:
> In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', 
> we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
> #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
> negative. So the 'ExecutorAllocationManager' will not request container from 
> yarn. The app will hang.
> In the attachments, submitting the app by 'run_scala.sh' will lead to the 
> 'hang' problem as the 'job_hang.png' shows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18981) The last job hung when speculation is on

2016-12-22 Thread roncenzhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roncenzhao updated SPARK-18981:
---
Attachment: run_scala.sh

> The last job hung when speculation is on
> 
>
> Key: SPARK-18981
> URL: https://issues.apache.org/jira/browse/SPARK-18981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hadoop2.5.0
>Reporter: roncenzhao
>Priority: Critical
> Attachments: Test.scala, job_hang.png, run_scala.sh
>
>
> CONF:
> spark.speculation   true
> spark.dynamicAllocation.minExecutors0
> spark.executor.cores   4
> When I run the follow app, the bug will trigger.
> ```
> sc.runJob(job1)
> sleep(100s)
> sc.runJob(job2) // the job2 will hang and never be scheduled
> ```
> The triggering condition is described as follows:
> condition1: During the sleeping time, the executors will be released and the 
> # of the executor will be zero some seconds later. The #numExecutorsTarget in 
> 'ExecutorAllocationManager' will be 0.
> condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
> will be negative during the ending of job1's tasks. 
> condition3: The job2 only hava one task.
> result:
> In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', 
> we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
> #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
> negative. So the 'ExecutorAllocationManager' will not request container from 
> yarn. The app will hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data

2016-12-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771706#comment-15771706
 ] 

Hyukjin Kwon commented on SPARK-18984:
--

[~tonythor], it seems you could reproduce this in Spark 2.0.0 not 2.0.2 because 
the line number seems a bit different with 2.0.2. It seems the problematic code 
is the same line in Spark 2.0.0 - 
https://github.com/apache/spark/blob/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L146
 in your case

> Concat with ds.write.text() throw exception if column contains null data
> 
>
> Key: SPARK-18984
> URL: https://issues.apache.org/jira/browse/SPARK-18984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: spark2.02 scala 2.11.8 
>Reporter: Tony Fraser
>
> val customOutputFormat = outbound.select(concat(
>   outbound.col("device_id"), lit ("\t"),
>   lit("\"device\"=\""),   col("device"),lit("\","),
>   lit("\"_fd_cast\"=\""), col("_fd_cast"),  lit("\",")
> ).where("device_type='ios_app' and _fd_cast is not null")
> customOutputFormat
>   .limit(1000)
>   .write
>   .option("nullValue", "NULL")
>   .mode("overwrite")
>   .text("/filepath")
> There is no problem writing to JSON, CSV or Parquet. And above code works. As 
> soon as you take out "and _fd_cast is not null" though it throws the 
> exception below. And using either nullValue either treatEmptyValuesAsNulls 
> either reading in or writing out doesn't seem to matter.
> Exception is:
> 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
> in 5 ms
> 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
> 16/12/22 14:16:18 ERROR Utils: Aborting task
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/22 14:16:18 ERROR DefaultWriterContainer: Task attempt 
> attempt_201612221416_0002_m_00_0 aborted.
> 16/12/22 14:16:18 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> 

[jira] [Updated] (SPARK-18981) The last job hung when speculation is on

2016-12-22 Thread roncenzhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roncenzhao updated SPARK-18981:
---
Attachment: Test.scala

> The last job hung when speculation is on
> 
>
> Key: SPARK-18981
> URL: https://issues.apache.org/jira/browse/SPARK-18981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hadoop2.5.0
>Reporter: roncenzhao
>Priority: Critical
> Attachments: Test.scala, job_hang.png
>
>
> CONF:
> spark.speculation   true
> spark.dynamicAllocation.minExecutors0
> spark.executor.cores   4
> When I run the follow app, the bug will trigger.
> ```
> sc.runJob(job1)
> sleep(100s)
> sc.runJob(job2) // the job2 will hang and never be scheduled
> ```
> The triggering condition is described as follows:
> condition1: During the sleeping time, the executors will be released and the 
> # of the executor will be zero some seconds later. The #numExecutorsTarget in 
> 'ExecutorAllocationManager' will be 0.
> condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
> will be negative during the ending of job1's tasks. 
> condition3: The job2 only hava one task.
> result:
> In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', 
> we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
> #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
> negative. So the 'ExecutorAllocationManager' will not request container from 
> yarn. The app will hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data

2016-12-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771695#comment-15771695
 ] 

Hyukjin Kwon edited comment on SPARK-18984 at 12/23/16 2:50 AM:


Yes, I can reproduce the same error with the codes as in Spark 2.0.2 below:

{code}
scala> Seq(Some("a"), None).toDF.write.text("/tmp/abc")
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:148)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
{code}

What we could do with this JIRA might be one (or some) of these as below:

- resolve it as this seems fixed in the master anyway.
- add a test case for this because It seems there is no test case for this.
- introduce {{nullValue}} option because It seems we can't read {{null}} back 
via Text datasource as below.

{code}
scala> Seq(Some("a"), None).toDF.show()
+-+
|value|
+-+
|a|
| null|
+-+


scala> spark.read.text("/tmp/abc").show()
+-+
|value|
+-+
|a|
| |
+-+
{code}


was (Author: hyukjin.kwon):
Yes, I can reproduce the same error with the codes as below:

{code}
scala> Seq(Some("a"), None).toDF.write.text("/tmp/abc")
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:148)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
{code}

What we could do with this JIRA might be one (or some) of these as below:

- resolve it as this seems fixed in the master anyway.
- add a test case for this because It seems there is no test case for this.
- introduce {{nullValue}} option because It seems we can't read {{null}} back 
via Text datasource as below.

{code}
scala> Seq(Some("a"), None).toDF.show()
+-+
|value|
+-+
|a|
| null|
+-+


scala> spark.read.text("/tmp/abc").show()
+-+
|value|
+-+
|a|
| |
+-+
{code}

> Concat with ds.write.text() throw exception if column contains null data
> 
>
> Key: SPARK-18984
> URL: https://issues.apache.org/jira/browse/SPARK-18984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: spark2.02 scala 2.11.8 
>Reporter: Tony Fraser
>
> val customOutputFormat = outbound.select(concat(
>   outbound.col("device_id"), lit ("\t"),
>   lit("\"device\"=\""),   col("device"),lit("\","),
>   lit("\"_fd_cast\"=\""), col("_fd_cast"),  lit("\",")
> 

[jira] [Commented] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data

2016-12-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771695#comment-15771695
 ] 

Hyukjin Kwon commented on SPARK-18984:
--

Yes, I can reproduce the same error with the codes as below:

{code}
scala> Seq(Some("a"), None).toDF.write.text("/tmp/abc")
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:148)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
{code}

What we could do with this JIRA might be one (or some) of these as below:

- resolve it as this seems fixed in the master anyway.
- add a test case for this because It seems there is no test case for this.
- introduce {{nullValue}} option because It seems we can't read {{null}} back 
via Text datasource as below.

{code}
scala> Seq(Some("a"), None).toDF.show()
+-+
|value|
+-+
|a|
| null|
+-+


scala> spark.read.text("/tmp/abc").show()
+-+
|value|
+-+
|a|
| |
+-+
{code}

> Concat with ds.write.text() throw exception if column contains null data
> 
>
> Key: SPARK-18984
> URL: https://issues.apache.org/jira/browse/SPARK-18984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: spark2.02 scala 2.11.8 
>Reporter: Tony Fraser
>
> val customOutputFormat = outbound.select(concat(
>   outbound.col("device_id"), lit ("\t"),
>   lit("\"device\"=\""),   col("device"),lit("\","),
>   lit("\"_fd_cast\"=\""), col("_fd_cast"),  lit("\",")
> ).where("device_type='ios_app' and _fd_cast is not null")
> customOutputFormat
>   .limit(1000)
>   .write
>   .option("nullValue", "NULL")
>   .mode("overwrite")
>   .text("/filepath")
> There is no problem writing to JSON, CSV or Parquet. And above code works. As 
> soon as you take out "and _fd_cast is not null" though it throws the 
> exception below. And using either nullValue either treatEmptyValuesAsNulls 
> either reading in or writing out doesn't seem to matter.
> Exception is:
> 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
> in 5 ms
> 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
> 16/12/22 14:16:18 ERROR Utils: Aborting task
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 

[jira] [Assigned] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18352:


Assignee: (was: Apache Spark)

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18352:


Assignee: Apache Spark

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: releasenotes
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771689#comment-15771689
 ] 

Apache Spark commented on SPARK-18352:
--

User 'NathanHowell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16386

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data

2016-12-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771675#comment-15771675
 ] 

Hyukjin Kwon edited comment on SPARK-18984 at 12/23/16 2:41 AM:


It seems this is fixed together in SPARK-18658 in in master. Null check logic 
was introduced there - {{if (!row.isNullAt(0)) ..}} but this was only merged 
into master. 

The reproducible codes I tested was as below: 

{code}
Seq(Some("a"), None).toDF.write.text("/tmp/abc")
{code}

Could you please confirm if this is fine in the master?

BTW, I think {{nullValue}} is CSV datasource specific option.


was (Author: hyukjin.kwon):
It seems this is fixed together in SPARK-18658 in in master. Null check logic 
was introduced there - {{ if (!row.isNullAt(0)) ..}} but this was only merged 
into master. 

The reproducible codes I tested was as below: 

{code}
Seq(Some("a"), None).toDF.write.text("/tmp/abc")
{code}

Could you please confirm if this is fine in the master?

BTW, I think {{nullValue}} is CSV datasource specific option.

> Concat with ds.write.text() throw exception if column contains null data
> 
>
> Key: SPARK-18984
> URL: https://issues.apache.org/jira/browse/SPARK-18984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: spark2.02 scala 2.11.8 
>Reporter: Tony Fraser
>
> val customOutputFormat = outbound.select(concat(
>   outbound.col("device_id"), lit ("\t"),
>   lit("\"device\"=\""),   col("device"),lit("\","),
>   lit("\"_fd_cast\"=\""), col("_fd_cast"),  lit("\",")
> ).where("device_type='ios_app' and _fd_cast is not null")
> customOutputFormat
>   .limit(1000)
>   .write
>   .option("nullValue", "NULL")
>   .mode("overwrite")
>   .text("/filepath")
> There is no problem writing to JSON, CSV or Parquet. And above code works. As 
> soon as you take out "and _fd_cast is not null" though it throws the 
> exception below. And using either nullValue either treatEmptyValuesAsNulls 
> either reading in or writing out doesn't seem to matter.
> Exception is:
> 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
> in 5 ms
> 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
> 16/12/22 14:16:18 ERROR Utils: Aborting task
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/22 14:16:18 ERROR DefaultWriterContainer: Task attempt 
> attempt_201612221416_0002_m_00_0 aborted.
> 16/12/22 14:16:18 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 

[jira] [Commented] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data

2016-12-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771675#comment-15771675
 ] 

Hyukjin Kwon commented on SPARK-18984:
--

It seems this is fixed together in SPARK-18658 in in master. Null check logic 
was introduced there - {{ if (!row.isNullAt(0)) ..}} but this was only merged 
into master. 

The reproducible codes I tested was as below: 

{code}
Seq(Some("a"), None).toDF.write.text("/tmp/abc")
{code}

Could you please confirm if this is fine in the master?

BTW, I think {{nullValue}} is CSV datasource specific option.

> Concat with ds.write.text() throw exception if column contains null data
> 
>
> Key: SPARK-18984
> URL: https://issues.apache.org/jira/browse/SPARK-18984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: spark2.02 scala 2.11.8 
>Reporter: Tony Fraser
>
> val customOutputFormat = outbound.select(concat(
>   outbound.col("device_id"), lit ("\t"),
>   lit("\"device\"=\""),   col("device"),lit("\","),
>   lit("\"_fd_cast\"=\""), col("_fd_cast"),  lit("\",")
> ).where("device_type='ios_app' and _fd_cast is not null")
> customOutputFormat
>   .limit(1000)
>   .write
>   .option("nullValue", "NULL")
>   .mode("overwrite")
>   .text("/filepath")
> There is no problem writing to JSON, CSV or Parquet. And above code works. As 
> soon as you take out "and _fd_cast is not null" though it throws the 
> exception below. And using either nullValue either treatEmptyValuesAsNulls 
> either reading in or writing out doesn't seem to matter.
> Exception is:
> 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
> in 5 ms
> 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
> 16/12/22 14:16:18 ERROR Utils: Aborting task
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/22 14:16:18 ERROR DefaultWriterContainer: Task attempt 
> attempt_201612221416_0002_m_00_0 aborted.
> 16/12/22 14:16:18 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> 

[jira] [Updated] (SPARK-18981) The last job hung when speculation is on

2016-12-22 Thread roncenzhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roncenzhao updated SPARK-18981:
---
Attachment: job_hang.png

> The last job hung when speculation is on
> 
>
> Key: SPARK-18981
> URL: https://issues.apache.org/jira/browse/SPARK-18981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hadoop2.5.0
>Reporter: roncenzhao
>Priority: Critical
> Attachments: job_hang.png
>
>
> CONF:
> spark.speculation   true
> spark.dynamicAllocation.minExecutors0
> spark.executor.cores   4
> When I run the follow app, the bug will trigger.
> ```
> sc.runJob(job1)
> sleep(100s)
> sc.runJob(job2) // the job2 will hang and never be scheduled
> ```
> The triggering condition is described as follows:
> condition1: During the sleeping time, the executors will be released and the 
> # of the executor will be zero some seconds later. The #numExecutorsTarget in 
> 'ExecutorAllocationManager' will be 0.
> condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
> will be negative during the ending of job1's tasks. 
> condition3: The job2 only hava one task.
> result:
> In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', 
> we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
> #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
> negative. So the 'ExecutorAllocationManager' will not request container from 
> yarn. The app will hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18981) The last job hung when speculation is on

2016-12-22 Thread roncenzhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roncenzhao updated SPARK-18981:
---
Description: 
CONF:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   4

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will hang.




  was:
related settings:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   4

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will hang.





> The last job hung when speculation is on
> 
>
> Key: SPARK-18981
> URL: https://issues.apache.org/jira/browse/SPARK-18981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hadoop2.5.0
>Reporter: roncenzhao
>Priority: Critical
>
> CONF:
> spark.speculation   true
> spark.dynamicAllocation.minExecutors0
> spark.executor.cores   4
> When I run the follow app, the bug will trigger.
> ```
> sc.runJob(job1)
> sleep(100s)
> sc.runJob(job2) // the job2 will hang and never be scheduled
> ```
> The triggering condition is described as follows:
> condition1: During the sleeping time, the executors will be released and the 
> # of the executor will be zero some seconds later. The #numExecutorsTarget in 
> 'ExecutorAllocationManager' will be 0.
> condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
> will be negative during the ending of job1's tasks. 
> condition3: The job2 only hava one task.
> result:
> In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', 
> we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
> #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
> negative. So the 'ExecutorAllocationManager' will not request container from 
> yarn. The app will hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2016-12-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771602#comment-15771602
 ] 

Shivaram Venkataraman commented on SPARK-18924:
---

Yeah I think we could convert this JIRA into a few sub-tasks - the first one 
could be profiling some of the existing code to get a breakdown of how much 
time is spent where. The next one could be the JVM side changes like boxing / 
unboxing improvements etc. 

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2016-12-22 Thread Lev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771583#comment-15771583
 ] 

Lev commented on SPARK-18970:
-

Actually this is exactly the behavior I want. My problem is that application 
appeared to be alive, but was not processing and new files after this message 
in the log. I intentionally included the portion of the log in the bottom, 
showing that file lisd was refreshed every couple of minutes before.

> FileSource failure during file list refresh doesn't cause an application to 
> fail, but stops further processing
> --
>
> Key: SPARK-18970
> URL: https://issues.apache.org/jira/browse/SPARK-18970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Lev
> Attachments: sparkerror.log
>
>
> Spark streaming application uses S3 files as streaming sources. After running 
> for several day processing stopped even though an application continued to 
> run. 
> Stack trace:
> java.io.FileNotFoundException: No such file or directory 
> 's3n://X'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I believe 2 things should (or can) be fixed:
> 1. Application should fail in case of such an error.
> 2. Allow application to ignore such failure, since there is a chance that 
> during next refresh the error will not resurface. (In my case I believe an 
> error was cased by S3 cleaning the bucket exactly at the same moment when 
> refresh was running) 
> My code to create streaming processing looks as the following:
>   val cq = sqlContext.readStream
> .format("json")
> .schema(struct)
> .load(s"input")
> .writeStream
> .option("checkpointLocation", s"checkpoints")
> .foreach(new ForeachWriter[Row] {...})
> .trigger(ProcessingTime("10 seconds")).start()
>   
> cq.awaitTermination() 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError

2016-12-22 Thread Vladimir Pchelko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771504#comment-15771504
 ] 

Vladimir Pchelko commented on SPARK-18805:
--

I had faced with similar problem ... there are two 'problems' with mapWithState:

1. spark.streaming.concurrentJobs
2. lack of memory with high GC time

In both cases I noticed strange/magic errors.

It seems in your case - application is unrecoverable due lack of memory.

> InternalMapWithStateDStream make java.lang.StackOverflowError 
> --
>
> Key: SPARK-18805
> URL: https://issues.apache.org/jira/browse/SPARK-18805
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.0.2
> Environment: mesos
>Reporter: etienne
>
> When load InternalMapWithStateDStream from a check point.
> If isValidTime is true and if there is no generatedRDD at the given time 
> there is an infinite loop.
> 1) compute is call on InternalMapWithStateDStream
> 2) InternalMapWithStateDStream try to generate the previousRDD
> 3) Stream look in generatedRDD if the RDD is already generated for the given 
> time 
> 4) It not fund the rdd so it check if the time is valid.
> 5) if the time is valid call compute on InternalMapWithStateDStream
> 6) restart from 1)
> Here the exception that illustrate this error
> {code}
> Exception in thread "streaming-start" java.lang.StackOverflowError
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18972) Fix the netty thread names for RPC

2016-12-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18972.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> Fix the netty thread names for RPC
> --
>
> Key: SPARK-18972
> URL: https://issues.apache.org/jira/browse/SPARK-18972
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Right now the name of threads created by Netty for Spark RPC are 
> `shuffle-client-***` and `shuffle-server-***`. We should fix the names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17807) Scalatest listed as compile dependency in spark-tags

2016-12-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17807:
---
Fix Version/s: 2.0.3

> Scalatest listed as compile dependency in spark-tags
> 
>
> Key: SPARK-17807
> URL: https://issues.apache.org/jira/browse/SPARK-17807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tom Standard
>Priority: Trivial
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - 
> shouldn't this be in test scope?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs

2016-12-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18985.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Add missing @InterfaceStability.Evolving for Structured Streaming APIs
> --
>
> Key: SPARK-18985
> URL: https://issues.apache.org/jira/browse/SPARK-18985
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17807) Scalatest listed as compile dependency in spark-tags

2016-12-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17807.

   Resolution: Fixed
Fix Version/s: 2.1.1

> Scalatest listed as compile dependency in spark-tags
> 
>
> Key: SPARK-17807
> URL: https://issues.apache.org/jira/browse/SPARK-17807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tom Standard
>Priority: Trivial
> Fix For: 2.1.1, 2.2.0
>
>
> In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - 
> shouldn't this be in test scope?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18618) SparkR GLM model predict should support type as a argument

2016-12-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771417#comment-15771417
 ] 

Joseph K. Bradley commented on SPARK-18618:
---

Note that [~yanboliang]'s PR from [SPARK-18291] has a lot of the code required 
to fix this issue.

> SparkR GLM model predict should support type as a argument
> --
>
> Key: SPARK-18618
> URL: https://issues.apache.org/jira/browse/SPARK-18618
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>  Labels: 2.2.0
>
> SparkR GLM model {{predict}} should support {{type}} as a argument. This will 
> it consistent with native R predict such as 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18291) SparkR glm predict should output original label when family = "binomial"

2016-12-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-18291.
-
  Resolution: Duplicate
Target Version/s:   (was: 2.2.0)

I'm closing this since [SPARK-18618] will fix the issue.

> SparkR glm predict should output original label when family = "binomial"
> 
>
> Key: SPARK-18291
> URL: https://issues.apache.org/jira/browse/SPARK-18291
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Attachments: SparkR2.1decisionoutputschemaforGLMs.pdf
>
>
> SparkR spark.glm predict should output original label when family = 
> "binomial".
> For example, we can run the following code in sparkr shell:
> {code}
> training <- suppressWarnings(createDataFrame(iris))
> training <- training[training$Species %in% c("versicolor", "virginica"), ]
> model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = 
> binomial(link = "logit"))
> showDF(predict(model, training))
> {code}
> The prediction column is double value which makes no sense to users.
> {code}
> ++---++---+--+-+---+
> |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label| 
> prediction|
> ++---++---+--+-+---+
> | 7.0|3.2| 4.7|1.4|versicolor|  0.0| 
> 0.8271421517601544|
> | 6.4|3.2| 4.5|1.5|versicolor|  0.0| 
> 0.6044595910413112|
> | 6.9|3.1| 4.9|1.5|versicolor|  0.0| 
> 0.7916340858281998|
> | 5.5|2.3| 4.0|1.3|versicolor|  
> 0.0|0.16080518180591158|
> | 6.5|2.8| 4.6|1.5|versicolor|  0.0| 
> 0.6112229217050189|
> | 5.7|2.8| 4.5|1.3|versicolor|  0.0| 
> 0.2555087295500885|
> | 6.3|3.3| 4.7|1.6|versicolor|  0.0| 
> 0.5681507664364834|
> | 4.9|2.4| 3.3|1.0|versicolor|  
> 0.0|0.05990570219972002|
> | 6.6|2.9| 4.6|1.3|versicolor|  0.0| 
> 0.6644434078306246|
> | 5.2|2.7| 3.9|1.4|versicolor|  
> 0.0|0.11293577405862379|
> | 5.0|2.0| 3.5|1.0|versicolor|  
> 0.0|0.06152372321585971|
> | 5.9|3.0| 4.2|1.5|versicolor|  
> 0.0|0.35250697207602555|
> | 6.0|2.2| 4.0|1.0|versicolor|  
> 0.0|0.32267018290814303|
> | 6.1|2.9| 4.7|1.4|versicolor|  0.0|  
> 0.433391153814592|
> | 5.6|2.9| 3.6|1.3|versicolor|  0.0| 
> 0.2280744262436993|
> | 6.7|3.1| 4.4|1.4|versicolor|  0.0| 
> 0.7219848389339459|
> | 5.6|3.0| 4.5|1.5|versicolor|  
> 0.0|0.23527698971404695|
> | 5.8|2.7| 4.1|1.0|versicolor|  0.0|  
> 0.285024533520016|
> | 6.2|2.2| 4.5|1.5|versicolor|  0.0| 
> 0.4107047877447493|
> | 5.6|2.5| 3.9|1.1|versicolor|  
> 0.0|0.20083561961645083|
> ++---++---+--+-+---+
> {code}
> The prediction value should be the original label like:
> {code}
> ++---++---+--+-+--+
> |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   
> Species|label|prediction|
> ++---++---+--+-+--+
> | 7.0|3.2| 4.7|1.4|versicolor|  0.0| 
> virginica|
> | 6.4|3.2| 4.5|1.5|versicolor|  0.0| 
> virginica|
> | 6.9|3.1| 4.9|1.5|versicolor|  0.0| 
> virginica|
> | 5.5|2.3| 4.0|1.3|versicolor|  
> 0.0|versicolor|
> | 6.5|2.8| 4.6|1.5|versicolor|  0.0| 
> virginica|
> | 5.7|2.8| 4.5|1.3|versicolor|  
> 0.0|versicolor|
> | 6.3|3.3| 4.7|1.6|versicolor|  0.0| 
> virginica|
> | 4.9|2.4| 3.3|1.0|versicolor|  
> 0.0|versicolor|
> | 6.6|2.9| 4.6|1.3|versicolor|  0.0| 
> virginica|
> | 5.2|2.7| 3.9|1.4|versicolor|  
> 0.0|versicolor|
> | 5.0|2.0| 3.5|1.0|versicolor|  
> 0.0|versicolor|
> | 5.9|3.0| 4.2|1.5|versicolor|  
> 0.0|versicolor|
> | 6.0|2.2| 4.0|1.0|versicolor|  
> 

[jira] [Updated] (SPARK-18618) SparkR GLM model predict should support type as a argument

2016-12-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18618:
--
Description: SparkR GLM model {{predict}} should support {{type}} as a 
argument. This will it consistent with native R predict such as 
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .  
(was: SparkR model {{predict}} should support {{type}} as a argument. This will 
it consistent with native R predict such as 
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .)

> SparkR GLM model predict should support type as a argument
> --
>
> Key: SPARK-18618
> URL: https://issues.apache.org/jira/browse/SPARK-18618
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>  Labels: 2.2.0
>
> SparkR GLM model {{predict}} should support {{type}} as a argument. This will 
> it consistent with native R predict such as 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18618) SparkR GLM model predict should support type as a argument

2016-12-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18618:
--
Summary: SparkR GLM model predict should support type as a argument  (was: 
SparkR model predict should support type as a argument)

> SparkR GLM model predict should support type as a argument
> --
>
> Key: SPARK-18618
> URL: https://issues.apache.org/jira/browse/SPARK-18618
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>  Labels: 2.2.0
>
> SparkR model {{predict}} should support {{type}} as a argument. This will it 
> consistent with native R predict such as 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2016-12-22 Thread David Rosenstrauch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771364#comment-15771364
 ] 

David Rosenstrauch commented on SPARK-9686:
---

Ditto.  Any closer to a fix?

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Assignee: Cheng Lian
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2016-12-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771304#comment-15771304
 ] 

Shixiong Zhu commented on SPARK-18970:
--

I see. FileStreamSource ignores FileNotFoundException when trying to get a file 
status inside the input directory. This allows the user to clean up the old 
processed files without failing the query. Any reason you want to disable this 
behavior?

> FileSource failure during file list refresh doesn't cause an application to 
> fail, but stops further processing
> --
>
> Key: SPARK-18970
> URL: https://issues.apache.org/jira/browse/SPARK-18970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Lev
> Attachments: sparkerror.log
>
>
> Spark streaming application uses S3 files as streaming sources. After running 
> for several day processing stopped even though an application continued to 
> run. 
> Stack trace:
> java.io.FileNotFoundException: No such file or directory 
> 's3n://X'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I believe 2 things should (or can) be fixed:
> 1. Application should fail in case of such an error.
> 2. Allow application to ignore such failure, since there is a chance that 
> during next refresh the error will not resurface. (In my case I believe an 
> error was cased by S3 cleaning the bucket exactly at the same moment when 
> refresh was running) 
> My code to create streaming processing looks as the following:
>   val cq = sqlContext.readStream
> .format("json")
> .schema(struct)
> .load(s"input")
> .writeStream
> .option("checkpointLocation", s"checkpoints")
> .foreach(new ForeachWriter[Row] {...})
> .trigger(ProcessingTime("10 seconds")).start()
>   
> cq.awaitTermination() 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2016-12-22 Thread Lev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lev updated SPARK-18970:

Attachment: sparkerror.log

Here is the log file

> FileSource failure during file list refresh doesn't cause an application to 
> fail, but stops further processing
> --
>
> Key: SPARK-18970
> URL: https://issues.apache.org/jira/browse/SPARK-18970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Lev
> Attachments: sparkerror.log
>
>
> Spark streaming application uses S3 files as streaming sources. After running 
> for several day processing stopped even though an application continued to 
> run. 
> Stack trace:
> java.io.FileNotFoundException: No such file or directory 
> 's3n://X'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I believe 2 things should (or can) be fixed:
> 1. Application should fail in case of such an error.
> 2. Allow application to ignore such failure, since there is a chance that 
> during next refresh the error will not resurface. (In my case I believe an 
> error was cased by S3 cleaning the bucket exactly at the same moment when 
> refresh was running) 
> My code to create streaming processing looks as the following:
>   val cq = sqlContext.readStream
> .format("json")
> .schema(struct)
> .load(s"input")
> .writeStream
> .option("checkpointLocation", s"checkpoints")
> .foreach(new ForeachWriter[Row] {...})
> .trigger(ProcessingTime("10 seconds")).start()
>   
> cq.awaitTermination() 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2016-12-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771221#comment-15771221
 ] 

Shixiong Zhu commented on SPARK-18970:
--

Where did you find the exception? If it's in the driver, could you post the 
full stack trace?

2.1 RC5 passed, so it should be available very soon.

> FileSource failure during file list refresh doesn't cause an application to 
> fail, but stops further processing
> --
>
> Key: SPARK-18970
> URL: https://issues.apache.org/jira/browse/SPARK-18970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Lev
>
> Spark streaming application uses S3 files as streaming sources. After running 
> for several day processing stopped even though an application continued to 
> run. 
> Stack trace:
> java.io.FileNotFoundException: No such file or directory 
> 's3n://X'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I believe 2 things should (or can) be fixed:
> 1. Application should fail in case of such an error.
> 2. Allow application to ignore such failure, since there is a chance that 
> during next refresh the error will not resurface. (In my case I believe an 
> error was cased by S3 cleaning the bucket exactly at the same moment when 
> refresh was running) 
> My code to create streaming processing looks as the following:
>   val cq = sqlContext.readStream
> .format("json")
> .schema(struct)
> .load(s"input")
> .writeStream
> .option("checkpointLocation", s"checkpoints")
> .foreach(new ForeachWriter[Row] {...})
> .trigger(ProcessingTime("10 seconds")).start()
>   
> cq.awaitTermination() 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2016-12-22 Thread Lev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771199#comment-15771199
 ] 

Lev commented on SPARK-18970:
-

I am not sure whether task was retried or not, but Spark application never 
processed any new file after this error has occurred. It appears that this 
error is never passed to the application code itself and application doesn't 
have a chance to terminate itself. Perhaps I should mention that I am running 
several spark structured streams in my application. 
We'll wait for 2.1 release to test SPARK-18774. I couldn't find any release 
date for 2.1. Any idea when it may happen?

> FileSource failure during file list refresh doesn't cause an application to 
> fail, but stops further processing
> --
>
> Key: SPARK-18970
> URL: https://issues.apache.org/jira/browse/SPARK-18970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Lev
>
> Spark streaming application uses S3 files as streaming sources. After running 
> for several day processing stopped even though an application continued to 
> run. 
> Stack trace:
> java.io.FileNotFoundException: No such file or directory 
> 's3n://X'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I believe 2 things should (or can) be fixed:
> 1. Application should fail in case of such an error.
> 2. Allow application to ignore such failure, since there is a chance that 
> during next refresh the error will not resurface. (In my case I believe an 
> error was cased by S3 cleaning the bucket exactly at the same moment when 
> refresh was running) 
> My code to create streaming processing looks as the following:
>   val cq = sqlContext.readStream
> .format("json")
> .schema(struct)
> .load(s"input")
> .writeStream
> .option("checkpointLocation", s"checkpoints")
> .foreach(new ForeachWriter[Row] {...})
> .trigger(ProcessingTime("10 seconds")).start()
>   
> cq.awaitTermination() 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-12-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17344:
-
Target Version/s:   (was: 2.1.1)

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs

2016-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771175#comment-15771175
 ] 

Apache Spark commented on SPARK-18985:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16385

> Add missing @InterfaceStability.Evolving for Structured Streaming APIs
> --
>
> Key: SPARK-18985
> URL: https://issues.apache.org/jira/browse/SPARK-18985
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs

2016-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18985:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add missing @InterfaceStability.Evolving for Structured Streaming APIs
> --
>
> Key: SPARK-18985
> URL: https://issues.apache.org/jira/browse/SPARK-18985
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system

2016-12-22 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771176#comment-15771176
 ] 

Dongjoon Hyun commented on SPARK-18941:
---

After investigating, I found that it was reported by [~andrewor14] and 
committed by [~yhuai] as an official issue SPARK-15276 at 2.0.0.

So, although this is not well-documented except code, this is definitely 
intended behavior which is designed differently from Hive. Since the location 
is not under `spark-warehouse`, I agree with the current behavior.

If there is no other reasons except Hive compatibility, it seems to be invalid 
to make a PR to fix this.

Is it okay, [~luatnc]?

cc [~smilegator] [~rxin]

> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system
> -
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: luat
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs

2016-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18985:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add missing @InterfaceStability.Evolving for Structured Streaming APIs
> --
>
> Key: SPARK-18985
> URL: https://issues.apache.org/jira/browse/SPARK-18985
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs

2016-12-22 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18985:


 Summary: Add missing @InterfaceStability.Evolving for Structured 
Streaming APIs
 Key: SPARK-18985
 URL: https://issues.apache.org/jira/browse/SPARK-18985
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18537) Add a REST api to spark streaming

2016-12-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18537:
---
Assignee: Xing Shi

> Add a REST api to spark streaming
> -
>
> Key: SPARK-18537
> URL: https://issues.apache.org/jira/browse/SPARK-18537
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Peter Chan
>Assignee: Xing Shi
> Fix For: 2.2.0
>
>
> trying to monitoring our streaming application using Spark REST interface
> and found out that there is no api for streaming.
> it let us no choice but to implement one for ourself.
> this api should cover exceptly the same amount of information as you can get 
> from the web interface
> the implementation is base on the current REST implementation of spark-core
> and will be available for running applications only
> here is how you can use it:
> endpoint root: /streaming/api/v1
> || Endpoint || Meaning ||
> |/statistics|Statistics information of stream|
> |/receivers|A list of all receiver streams|
> |/receivers/\[stream-id\]|Details of the given receiver stream|
> |/batches|A list of all retained batches|
> |/batches/\[batch-id\]|Details of the given batch|
> |/batches/\[batch-id\]/operations|A list of all output operations of the 
> given batch|
> |/batches/\[batch-id\]/operations/\[operation-id\]|Details of the given 
> operation, given batch|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2016-12-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771166#comment-15771166
 ] 

Shixiong Zhu commented on SPARK-18970:
--

Did the Spark task fail or not? Looks like the Spark task was retried and it 
succeeded.

For ignoring such failure, it's done in SPARK-17850 and SPARK-18774.

> FileSource failure during file list refresh doesn't cause an application to 
> fail, but stops further processing
> --
>
> Key: SPARK-18970
> URL: https://issues.apache.org/jira/browse/SPARK-18970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Lev
>
> Spark streaming application uses S3 files as streaming sources. After running 
> for several day processing stopped even though an application continued to 
> run. 
> Stack trace:
> java.io.FileNotFoundException: No such file or directory 
> 's3n://X'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I believe 2 things should (or can) be fixed:
> 1. Application should fail in case of such an error.
> 2. Allow application to ignore such failure, since there is a chance that 
> during next refresh the error will not resurface. (In my case I believe an 
> error was cased by S3 cleaning the bucket exactly at the same moment when 
> refresh was running) 
> My code to create streaming processing looks as the following:
>   val cq = sqlContext.readStream
> .format("json")
> .schema(struct)
> .load(s"input")
> .writeStream
> .option("checkpointLocation", s"checkpoints")
> .foreach(new ForeachWriter[Row] {...})
> .trigger(ProcessingTime("10 seconds")).start()
>   
> cq.awaitTermination() 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18537) Add a REST api to spark streaming

2016-12-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18537.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Add a REST api to spark streaming
> -
>
> Key: SPARK-18537
> URL: https://issues.apache.org/jira/browse/SPARK-18537
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Peter Chan
> Fix For: 2.2.0
>
>
> trying to monitoring our streaming application using Spark REST interface
> and found out that there is no api for streaming.
> it let us no choice but to implement one for ourself.
> this api should cover exceptly the same amount of information as you can get 
> from the web interface
> the implementation is base on the current REST implementation of spark-core
> and will be available for running applications only
> here is how you can use it:
> endpoint root: /streaming/api/v1
> || Endpoint || Meaning ||
> |/statistics|Statistics information of stream|
> |/receivers|A list of all receiver streams|
> |/receivers/\[stream-id\]|Details of the given receiver stream|
> |/batches|A list of all retained batches|
> |/batches/\[batch-id\]|Details of the given batch|
> |/batches/\[batch-id\]/operations|A list of all output operations of the 
> given batch|
> |/batches/\[batch-id\]/operations/\[operation-id\]|Details of the given 
> operation, given batch|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-12-22 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770991#comment-15770991
 ] 

Ilya Matiach commented on SPARK-18054:
--

It looks like I can still repro the error with this code:

val data = sc.parallelize(Seq((1.0,
  org.apache.spark.mllib.linalg.Vectors.dense(Array(1.0, 2.0, 
3.0).toDF("label", "features")
val extractProbability = udf((vector: DenseVector) => vector(1))
val dfWithProbability = data.withColumn("foo", 
extractProbability(col("features")))

The error message is:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features)' due to 
data type mismatch: argument 1 requires vector type, however, '`features`' is 
of vector type.;

> Unexpected error from UDF that gets an element of a vector: argument 1 
> requires vector type, however, '`_column_`' is of vector type
> 
>
> Key: SPARK-18054
> URL: https://issues.apache.org/jira/browse/SPARK-18054
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 2.0.1
>Reporter: Barry Becker
>Priority: Minor
>
> Not sure if this is a bug in ML or a more core part of spark.
> It used to work in spark 1.6.2, but now gives me an error.
> I have a pipeline that contains a NaiveBayesModel which I created like this
> {code}
> val nbModel = new NaiveBayes()
>   .setLabelCol(target)
>   .setFeaturesCol(FEATURES_COL)
>   .setPredictionCol(PREDICTION_COLUMN)
>   .setProbabilityCol("_probability_column_")
>   .setModelType("multinomial")
> {code}
> When I apply that pipeline to some data there will be a 
> "_probability_column_" of type vector. I want to extract a probability for a 
> specific class label using the following, but it no longer works.
> {code}
> var newDf = pipeline.transform(df)
> val extractProbability = udf((vector: DenseVector) => vector(1))
> val dfWithProbability = newDf.withColumn("foo", 
> extractProbability(col("_probability_column_")))
> {code}
> The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. 
> I consider this a strange error because its basically saying "argument 1 
> requires a vector, but we got a vector instead". That does not make any sense 
> to me. It wants a vector, and a vector was given. Why does it fail?
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
> requires vector type, however, '`_class_probability_column__`' is of vector 
> type.;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 

[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-12-22 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770981#comment-15770981
 ] 

Ilya Matiach commented on SPARK-18054:
--

Actually, that error message above looks different.  Maybe the model 
transformed the dataset into something that had the old-style vector UDT?

> Unexpected error from UDF that gets an element of a vector: argument 1 
> requires vector type, however, '`_column_`' is of vector type
> 
>
> Key: SPARK-18054
> URL: https://issues.apache.org/jira/browse/SPARK-18054
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 2.0.1
>Reporter: Barry Becker
>Priority: Minor
>
> Not sure if this is a bug in ML or a more core part of spark.
> It used to work in spark 1.6.2, but now gives me an error.
> I have a pipeline that contains a NaiveBayesModel which I created like this
> {code}
> val nbModel = new NaiveBayes()
>   .setLabelCol(target)
>   .setFeaturesCol(FEATURES_COL)
>   .setPredictionCol(PREDICTION_COLUMN)
>   .setProbabilityCol("_probability_column_")
>   .setModelType("multinomial")
> {code}
> When I apply that pipeline to some data there will be a 
> "_probability_column_" of type vector. I want to extract a probability for a 
> specific class label using the following, but it no longer works.
> {code}
> var newDf = pipeline.transform(df)
> val extractProbability = udf((vector: DenseVector) => vector(1))
> val dfWithProbability = newDf.withColumn("foo", 
> extractProbability(col("_probability_column_")))
> {code}
> The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. 
> I consider this a strange error because its basically saying "argument 1 
> requires a vector, but we got a vector instead". That does not make any sense 
> to me. It wants a vector, and a vector was given. Why does it fail?
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
> requires vector type, however, '`_class_probability_column__`' is of vector 
> type.;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
>   at 
> 

[jira] [Resolved] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18054.
---
Resolution: Not A Problem

> Unexpected error from UDF that gets an element of a vector: argument 1 
> requires vector type, however, '`_column_`' is of vector type
> 
>
> Key: SPARK-18054
> URL: https://issues.apache.org/jira/browse/SPARK-18054
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 2.0.1
>Reporter: Barry Becker
>Priority: Minor
>
> Not sure if this is a bug in ML or a more core part of spark.
> It used to work in spark 1.6.2, but now gives me an error.
> I have a pipeline that contains a NaiveBayesModel which I created like this
> {code}
> val nbModel = new NaiveBayes()
>   .setLabelCol(target)
>   .setFeaturesCol(FEATURES_COL)
>   .setPredictionCol(PREDICTION_COLUMN)
>   .setProbabilityCol("_probability_column_")
>   .setModelType("multinomial")
> {code}
> When I apply that pipeline to some data there will be a 
> "_probability_column_" of type vector. I want to extract a probability for a 
> specific class label using the following, but it no longer works.
> {code}
> var newDf = pipeline.transform(df)
> val extractProbability = udf((vector: DenseVector) => vector(1))
> val dfWithProbability = newDf.withColumn("foo", 
> extractProbability(col("_probability_column_")))
> {code}
> The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. 
> I consider this a strange error because its basically saying "argument 1 
> requires a vector, but we got a vector instead". That does not make any sense 
> to me. It wants a vector, and a vector was given. Why does it fail?
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
> requires vector type, however, '`_class_probability_column__`' is of vector 
> type.;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210)
>  

[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-12-22 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770972#comment-15770972
 ] 

Ilya Matiach commented on SPARK-18054:
--

It looks like this is already fixed in the latest version.  The error message I 
got from Spark was:

java.lang.IllegalArgumentException: requirement failed: Column features must be 
of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually 
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.

My test code was in NaiveBayesSuite, the code was:

import org.apache.spark.sql.functions._

test("SPARK-18054: throw nice error message when using mllib vector instead of 
ml vector") {
// Example taken from bug submitter
// scalastyle:off println
dataset.columns.foreach(println(_))
// scalastyle:on println

val nbModel = new NaiveBayes()
  .setLabelCol(dataset.columns(0))
  .setFeaturesCol(dataset.columns(1))
  .setPredictionCol("_prediction_column_")
  .setProbabilityCol("_probability_column_")
  .setModelType("multinomial")
  .fit(dataset)
val data = sc.parallelize(Seq((1.0,
  
org.apache.spark.mllib.linalg.Vectors.dense(Array.empty[Double].toDF("label",
 "features")
var newDf = nbModel.transform(data)
val extractProbability = udf((vector: DenseVector) => vector(1))
val dfWithProbability = newDf.withColumn("foo", 
extractProbability(col("_probability_column_")))
  }


> Unexpected error from UDF that gets an element of a vector: argument 1 
> requires vector type, however, '`_column_`' is of vector type
> 
>
> Key: SPARK-18054
> URL: https://issues.apache.org/jira/browse/SPARK-18054
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 2.0.1
>Reporter: Barry Becker
>Priority: Minor
>
> Not sure if this is a bug in ML or a more core part of spark.
> It used to work in spark 1.6.2, but now gives me an error.
> I have a pipeline that contains a NaiveBayesModel which I created like this
> {code}
> val nbModel = new NaiveBayes()
>   .setLabelCol(target)
>   .setFeaturesCol(FEATURES_COL)
>   .setPredictionCol(PREDICTION_COLUMN)
>   .setProbabilityCol("_probability_column_")
>   .setModelType("multinomial")
> {code}
> When I apply that pipeline to some data there will be a 
> "_probability_column_" of type vector. I want to extract a probability for a 
> specific class label using the following, but it no longer works.
> {code}
> var newDf = pipeline.transform(df)
> val extractProbability = udf((vector: DenseVector) => vector(1))
> val dfWithProbability = newDf.withColumn("foo", 
> extractProbability(col("_probability_column_")))
> {code}
> The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. 
> I consider this a strange error because its basically saying "argument 1 
> requires a vector, but we got a vector instead". That does not make any sense 
> to me. It wants a vector, and a vector was given. Why does it fail?
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
> requires vector type, however, '`_class_probability_column__`' is of vector 
> type.;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
>   at 
> 

[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system

2016-12-22 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770967#comment-15770967
 ] 

Dongjoon Hyun commented on SPARK-18941:
---

Hi, first of all. your case is correct. I can reproduce your example. Thanks.

Currently, it seems to be intentional behavior because Spark assumes your table 
is EXTERNAL when users give locations.

```
scala> sql("create table table_with_location(a int) stored as orc location 
'/tmp/table_with_location'")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("desc extended table_with_location").show(false)
...
|# Detailed Table Information|CatalogTable(
Table: `default`.`table_with_location`
Owner: dhyun
Created: Thu Dec 22 12:01:35 PST 2016
Last Access: Wed Dec 31 16:00:00 PST 1969
Type: EXTERNAL
Schema: [StructField(a,IntegerType,true)]
Provider: hive
Properties: [transient_lastDdlTime=1482436895]
...
```

Let me try to make a PR for this.

> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system
> -
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: luat
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file sys

2016-12-22 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770967#comment-15770967
 ] 

Dongjoon Hyun edited comment on SPARK-18941 at 12/22/16 8:05 PM:
-

Hi, first of all. your case is correct. I can reproduce your example. Thanks.

Currently, it seems to be intentional behavior because Spark assumes your table 
is EXTERNAL when users give locations.

{code}
scala> sql("create table table_with_location(a int) stored as orc location 
'/tmp/table_with_location'")
scala> sql("desc extended table_with_location").show(false)
...
|# Detailed Table Information|CatalogTable(
Table: `default`.`table_with_location`
Owner: dhyun
Created: Thu Dec 22 12:01:35 PST 2016
Last Access: Wed Dec 31 16:00:00 PST 1969
Type: EXTERNAL
Schema: [StructField(a,IntegerType,true)]
Provider: hive
Properties: [transient_lastDdlTime=1482436895]
...
{code}

Let me try to make a PR for this.


was (Author: dongjoon):
Hi, first of all. your case is correct. I can reproduce your example. Thanks.

Currently, it seems to be intentional behavior because Spark assumes your table 
is EXTERNAL when users give locations.

```
scala> sql("create table table_with_location(a int) stored as orc location 
'/tmp/table_with_location'")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("desc extended table_with_location").show(false)
...
|# Detailed Table Information|CatalogTable(
Table: `default`.`table_with_location`
Owner: dhyun
Created: Thu Dec 22 12:01:35 PST 2016
Last Access: Wed Dec 31 16:00:00 PST 1969
Type: EXTERNAL
Schema: [StructField(a,IntegerType,true)]
Provider: hive
Properties: [transient_lastDdlTime=1482436895]
...
```

Let me try to make a PR for this.

> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system
> -
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: luat
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data

2016-12-22 Thread Tony Fraser (JIRA)
Tony Fraser created SPARK-18984:
---

 Summary: Concat with ds.write.text() throw exception if column 
contains null data
 Key: SPARK-18984
 URL: https://issues.apache.org/jira/browse/SPARK-18984
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
 Environment: spark2.02 scala 2.11.8 
Reporter: Tony Fraser


val customOutputFormat = outbound.select(concat(
  outbound.col("device_id"), lit ("\t"),
  lit("\"device\"=\""),   col("device"),lit("\","),
  lit("\"_fd_cast\"=\""), col("_fd_cast"),  lit("\",")
).where("device_type='ios_app' and _fd_cast is not null")

customOutputFormat
  .limit(1000)
  .write
  .option("nullValue", "NULL")
  .mode("overwrite")
  .text("/filepath")


There is no problem writing to JSON, CSV or Parquet. And above code works. As 
soon as you take out "and _fd_cast is not null" though it throws the exception 
below. And using either nullValue either treatEmptyValuesAsNulls either reading 
in or writing out doesn't seem to matter.


Exception is:
16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 
5 ms
16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/12/22 14:16:18 ERROR Utils: Aborting task
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/12/22 14:16:18 ERROR DefaultWriterContainer: Task attempt 
attempt_201612221416_0002_m_00_0 aborted.
16/12/22 14:16:18 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
16/12/22 14:16:18 

[jira] [Resolved] (SPARK-18975) Add an API to remove SparkListener from SparkContext

2016-12-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18975.
-
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.2.0

> Add an API to remove SparkListener from SparkContext 
> -
>
> Key: SPARK-18975
> URL: https://issues.apache.org/jira/browse/SPARK-18975
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.2.0
>
>
> In current Spark we could add customized {{SparkListener}} through 
> {{SparkContext#addListener}} API, but there's no API to remove the registered 
> one. In our scenario SparkListener will be added repeatedly accordingly to 
> the changed environment. If lacks the ability to remove listeners, there 
> might be bunch of registered listeners finally, this is unnecessary and 
> potentially affect the performance. So here propose to add an API to remove 
> registered listener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18983) Couldn't find leader offsets exception when the one of kafka cluster brokers is down

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18983.
---
  Resolution: Invalid
Target Version/s:   (was: 1.6.1)

Please read http://spark.apache.org/contributing.html
This should go to u...@spark.apache.org
But if all your brokers are down, this is the behavior you expect.

> Couldn't find leader offsets exception when the one of kafka cluster brokers 
> is down
> 
>
> Key: SPARK-18983
> URL: https://issues.apache.org/jira/browse/SPARK-18983
> Project: Spark
>  Issue Type: Bug
>Reporter: kraken
>Priority: Critical
>
> Hello,i got a trouble today
> when the company's PE restarts the one of kafka cluster broker,my spark job 
> run failed.
> Exception in thread "main" org.apache.spark.SparkException: get earliest 
> leader offsets failed: ArrayBuffer(org.apache.spark.SparkException: Couldn't 
> find leaders for Set
> I use the low level type to consume  kafka log, i know that high level can be 
> aware of the lead changes, but how can i do that by use low level?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18982) Couldn't find leader offsets exception when the one of kafka cluster brokers is down

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18982.
---
  Resolution: Duplicate
Target Version/s:   (was: 1.6.1)

> Couldn't find leader offsets exception when the one of kafka cluster brokers 
> is down
> 
>
> Key: SPARK-18982
> URL: https://issues.apache.org/jira/browse/SPARK-18982
> Project: Spark
>  Issue Type: Bug
>Reporter: kraken
>Priority: Critical
>
> Hello,i got a trouble today
> when the company's PE restarts the one of kafka cluster broker,my spark job 
> run failed.
> Exception in thread "main" org.apache.spark.SparkException: get earliest 
> leader offsets failed: ArrayBuffer(org.apache.spark.SparkException: Couldn't 
> find leaders for Set
> I use the low level type to consume  kafka log, i know that high level can be 
> aware of the lead changes, but how can i do that by use low level?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system

2016-12-22 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770823#comment-15770823
 ] 

Dongjoon Hyun commented on SPARK-18941:
---

Thank you for the detail! I'll try that.

> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system
> -
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: luat
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18983) Couldn't find leader offsets exception when the one of kafka cluster brokers is down

2016-12-22 Thread kraken (JIRA)
kraken created SPARK-18983:
--

 Summary: Couldn't find leader offsets exception when the one of 
kafka cluster brokers is down
 Key: SPARK-18983
 URL: https://issues.apache.org/jira/browse/SPARK-18983
 Project: Spark
  Issue Type: Bug
Reporter: kraken
Priority: Critical


Hello,i got a trouble today

when the company's PE restarts the one of kafka cluster broker,my spark job run 
failed.

Exception in thread "main" org.apache.spark.SparkException: get earliest leader 
offsets failed: ArrayBuffer(org.apache.spark.SparkException: Couldn't find 
leaders for Set

I use the low level type to consume  kafka log, i know that high level can be 
aware of the lead changes, but how can i do that by use low level?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18982) Couldn't find leader offsets exception when the one of kafka cluster brokers is down

2016-12-22 Thread kraken (JIRA)
kraken created SPARK-18982:
--

 Summary: Couldn't find leader offsets exception when the one of 
kafka cluster brokers is down
 Key: SPARK-18982
 URL: https://issues.apache.org/jira/browse/SPARK-18982
 Project: Spark
  Issue Type: Bug
Reporter: kraken
Priority: Critical


Hello,i got a trouble today

when the company's PE restarts the one of kafka cluster broker,my spark job run 
failed.

Exception in thread "main" org.apache.spark.SparkException: get earliest leader 
offsets failed: ArrayBuffer(org.apache.spark.SparkException: Couldn't find 
leaders for Set

I use the low level type to consume  kafka log, i know that high level can be 
aware of the lead changes, but how can i do that by use low level?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2016-12-22 Thread Lev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lev updated SPARK-18970:

Affects Version/s: 2.0.2

> FileSource failure during file list refresh doesn't cause an application to 
> fail, but stops further processing
> --
>
> Key: SPARK-18970
> URL: https://issues.apache.org/jira/browse/SPARK-18970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Lev
>
> Spark streaming application uses S3 files as streaming sources. After running 
> for several day processing stopped even though an application continued to 
> run. 
> Stack trace:
> java.io.FileNotFoundException: No such file or directory 
> 's3n://X'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> I believe 2 things should (or can) be fixed:
> 1. Application should fail in case of such an error.
> 2. Allow application to ignore such failure, since there is a chance that 
> during next refresh the error will not resurface. (In my case I believe an 
> error was cased by S3 cleaning the bucket exactly at the same moment when 
> refresh was running) 
> My code to create streaming processing looks as the following:
>   val cq = sqlContext.readStream
> .format("json")
> .schema(struct)
> .load(s"input")
> .writeStream
> .option("checkpointLocation", s"checkpoints")
> .foreach(new ForeachWriter[Row] {...})
> .trigger(ProcessingTime("10 seconds")).start()
>   
> cq.awaitTermination() 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18973) Remove SortPartitions and RedistributeData

2016-12-22 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18973.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Remove SortPartitions and RedistributeData
> --
>
> Key: SPARK-18973
> URL: https://issues.apache.org/jira/browse/SPARK-18973
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.1, 2.2.0
>
>
> SortPartitions and RedistributeData logical operators are not actually used 
> and can be removed. Note that we do have a Sort operator (with global flag 
> false) that subsumed SortPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality

2016-12-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18031:
-
Fix Version/s: 2.0.3

> Flaky test: 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic 
> functionality
> ---
>
> Key: SPARK-18031
> URL: https://issues.apache.org/jira/browse/SPARK-18031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite_name=basic+functionality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality

2016-12-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18031:
-
Fix Version/s: 2.2.0

> Flaky test: 
> org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic 
> functionality
> ---
>
> Key: SPARK-18031
> URL: https://issues.apache.org/jira/browse/SPARK-18031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite_name=basic+functionality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18054:
--
   Priority: Minor  (was: Major)
Component/s: Documentation
 Issue Type: Improvement  (was: Bug)

> Unexpected error from UDF that gets an element of a vector: argument 1 
> requires vector type, however, '`_column_`' is of vector type
> 
>
> Key: SPARK-18054
> URL: https://issues.apache.org/jira/browse/SPARK-18054
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 2.0.1
>Reporter: Barry Becker
>Priority: Minor
>
> Not sure if this is a bug in ML or a more core part of spark.
> It used to work in spark 1.6.2, but now gives me an error.
> I have a pipeline that contains a NaiveBayesModel which I created like this
> {code}
> val nbModel = new NaiveBayes()
>   .setLabelCol(target)
>   .setFeaturesCol(FEATURES_COL)
>   .setPredictionCol(PREDICTION_COLUMN)
>   .setProbabilityCol("_probability_column_")
>   .setModelType("multinomial")
> {code}
> When I apply that pipeline to some data there will be a 
> "_probability_column_" of type vector. I want to extract a probability for a 
> specific class label using the following, but it no longer works.
> {code}
> var newDf = pipeline.transform(df)
> val extractProbability = udf((vector: DenseVector) => vector(1))
> val dfWithProbability = newDf.withColumn("foo", 
> extractProbability(col("_probability_column_")))
> {code}
> The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. 
> I consider this a strange error because its basically saying "argument 1 
> requires a vector, but we got a vector instead". That does not make any sense 
> to me. It wants a vector, and a vector was given. Why does it fail?
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
> requires vector type, however, '`_class_probability_column__`' is of vector 
> type.;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> 

[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-12-22 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770690#comment-15770690
 ] 

Ilya Matiach commented on SPARK-18054:
--

I can try to repro this and add in a better error message.

> Unexpected error from UDF that gets an element of a vector: argument 1 
> requires vector type, however, '`_column_`' is of vector type
> 
>
> Key: SPARK-18054
> URL: https://issues.apache.org/jira/browse/SPARK-18054
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Barry Becker
>
> Not sure if this is a bug in ML or a more core part of spark.
> It used to work in spark 1.6.2, but now gives me an error.
> I have a pipeline that contains a NaiveBayesModel which I created like this
> {code}
> val nbModel = new NaiveBayes()
>   .setLabelCol(target)
>   .setFeaturesCol(FEATURES_COL)
>   .setPredictionCol(PREDICTION_COLUMN)
>   .setProbabilityCol("_probability_column_")
>   .setModelType("multinomial")
> {code}
> When I apply that pipeline to some data there will be a 
> "_probability_column_" of type vector. I want to extract a probability for a 
> specific class label using the following, but it no longer works.
> {code}
> var newDf = pipeline.transform(df)
> val extractProbability = udf((vector: DenseVector) => vector(1))
> val dfWithProbability = newDf.withColumn("foo", 
> extractProbability(col("_probability_column_")))
> {code}
> The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. 
> I consider this a strange error because its basically saying "argument 1 
> requires a vector, but we got a vector instead". That does not make any sense 
> to me. It wants a vector, and a vector was given. Why does it fail?
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
> requires vector type, however, '`_class_probability_column__`' is of vector 
> type.;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> 

[jira] [Updated] (SPARK-18738) Some Spark SQL queries has poor performance on HDFS Erasure Coding feature when enabling dynamic allocation.

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18738:
--
Fix Version/s: (was: 2.2.0)

> Some Spark SQL queries has poor performance on HDFS Erasure Coding feature 
> when enabling dynamic allocation.
> 
>
> Key: SPARK-18738
> URL: https://issues.apache.org/jira/browse/SPARK-18738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Lifeng Wang
>
> We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk 
> and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on 
> both Erasure Coding and 3-replication.  The test results show that some 
> queries has much worse performance on EC compared to 3-replication. After 
> initial investigations, we found spark starts one third executors to execute 
> queries on EC compared to 3-replication. 
> We use query 30 as example, our cluster can totally launch 108 executors. 
> When we run the query from 3-replication database, spark will start all 108 
> executors to execute the query.  When we run the query from Erasure Coding 
> database, spark will launch 108 executors and kill 72 executors due to 
> they’re idle, at last there are only 36 executors to execute the query which 
> leads to poor performance.
> This issue only happens when we enable dynamic allocations mechanism. When we 
> disable the dynamic allocations, Spark SQL query on EC has the similar 
> performance with on 3-replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18234) Update mode in structured streaming

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18234:
--
Assignee: Tathagata Das

> Update mode in structured streaming
> ---
>
> Key: SPARK-18234
> URL: https://issues.apache.org/jira/browse/SPARK-18234
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
>
> We have this internal, but we should nail down the semantics and expose it to 
> users.  The idea of update mode is that any tuple that changes will be 
> emitted.  Open questions:
>  - do we need to reason about the {{keys}} for a given stream?  For things 
> like the {{foreach}} sink its up to the user.  However, for more end to end 
> use cases such as a JDBC sink, we need to know which row downstream is being 
> updated.
>  - okay to not support files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18281:
--
Assignee: Liang-Chi Hsieh

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.3, 2.1.1
>
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17801) [ML]Random Forest Regression fails for large input

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17801.
---
Resolution: Not A Problem

I think this is just attributable to extremely high maxBins, and not a bug.

> [ML]Random Forest Regression fails for large input
> --
>
> Key: SPARK-17801
> URL: https://issues.apache.org/jira/browse/SPARK-17801
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04
>Reporter: samkit
>Priority: Minor
>
> Random Forest Regression
> Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip
> Parameters:
> NumTrees:500Maximum Bins:7477383 MaxDepth:27
> MinInstancesPerNode:8648  SamplingRate:1.0
> Java Options:
> "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" 
> "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC 
> -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy 
> -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit  
> -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g 
> -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" 
> "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" 
> "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms"  
> "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" 
> "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" 
> "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
> -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC 
> -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g 
> -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" 
> "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" 
> "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" 
> "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" 
> "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" 
> "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" 
> "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" 
> "-XX:SurvivorRatio=3" "-DnumPartitions=36"
> Partial Driver StackTrace:
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740)
>   
> org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525)
>   org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
> For complete Executor and Driver ErrorLog
> https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17801) [ML]Random Forest Regression fails for large input

2016-12-22 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770463#comment-15770463
 ] 

Ilya Matiach commented on SPARK-17801:
--

Taking a look into the error

> [ML]Random Forest Regression fails for large input
> --
>
> Key: SPARK-17801
> URL: https://issues.apache.org/jira/browse/SPARK-17801
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04
>Reporter: samkit
>Priority: Minor
>
> Random Forest Regression
> Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip
> Parameters:
> NumTrees:500Maximum Bins:7477383 MaxDepth:27
> MinInstancesPerNode:8648  SamplingRate:1.0
> Java Options:
> "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" 
> "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC 
> -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy 
> -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit  
> -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g 
> -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" 
> "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" 
> "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms"  
> "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" 
> "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" 
> "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
> -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC 
> -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g 
> -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" 
> "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" 
> "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" 
> "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" 
> "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" 
> "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" 
> "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" 
> "-XX:SurvivorRatio=3" "-DnumPartitions=36"
> Partial Driver StackTrace:
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740)
>   
> org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525)
>   org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
> For complete Executor and Driver ErrorLog
> https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2016-12-22 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770446#comment-15770446
 ] 

Ilya Matiach commented on SPARK-17975:
--

Could you send a link to the repro dataset?  I could work on this issue but it 
looks like you have a fix already.  For any fixes we need tests to validate 
them.

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Resolved] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18922.
---
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

If we find more like this in the short term, let's just reopen rather than make 
more JIRAs.

Resolved by https://github.com/apache/spark/pull/16335

> Fix more resource-closing-related and path-related test failures in 
> identified ones on Windows
> --
>
> Key: SPARK-18922
> URL: https://issues.apache.org/jira/browse/SPARK-18922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> There are more instances that are failed on Windows as below:
> - {{LauncherBackendSuite}}:
> {code}
> - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
>   The code passed to eventually never returned normally. Attempted 283 times 
> over 30.0960053 seconds. Last failure message: The reference was null. 
> (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 
> milliseconds)
>   The code passed to eventually never returned normally. Attempted 282 times 
> over 30.03798710002 seconds. Last failure message: The reference was 
> null. (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> {code}
> - {{SQLQuerySuite}}:
> {code}
> - specifying database name for a temporary table is not allowed *** FAILED 
> *** (125 milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{JsonSuite}}:
> {code}
> - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 
> milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{StateStoreSuite}}:
> {code}
> - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:116)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   ...
>   Cause: java.net.URISyntaxException: Relative path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:203)
> {code}
> - {{HDFSMetadataLogSuite}}:
> {code}
> - FileManager: FileContextManager *** FAILED *** (94 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> - FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> {code}
> 

[jira] [Resolved] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18878.
---
Resolution: Done

Let's look at reopening recent JIRAs and adding PRs if you find more changes of 
the same type as in previous changes.

> Fix/investigate the more identified test failures in Java/Scala on Windows
> --
>
> Key: SPARK-18878
> URL: https://issues.apache.org/jira/browse/SPARK-18878
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Hyukjin Kwon
>
> It seems many tests are being failed on Windows. Some are only related with 
> tests whereas others are related with the functionalities themselves which 
> causes actual failures for some APIs on Windows.
> The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and 
> now apparently we could proceed much further (apparently it seems we might 
> reach the end).
> The tests proceeded via AppVeyor - 
> https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18981) The last job hung when speculation is on

2016-12-22 Thread roncenzhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roncenzhao updated SPARK-18981:
---
Description: 
related settings:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   4

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will hang.




  was:
related settings:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   4

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will be hung.





> The last job hung when speculation is on
> 
>
> Key: SPARK-18981
> URL: https://issues.apache.org/jira/browse/SPARK-18981
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hadoop2.5.0
>Reporter: roncenzhao
>Priority: Critical
>
> related settings:
> spark.speculation   true
> spark.dynamicAllocation.minExecutors0
> spark.executor.cores   4
> When I run the follow app, the bug will trigger.
> ```
> sc.runJob(job1)
> sleep(100s)
> sc.runJob(job2) // the job2 will hang and never be scheduled
> ```
> The triggering condition is described as follows:
> condition1: During the sleeping time, the executors will be released and the 
> # of the executor will be zero some seconds later. The #numExecutorsTarget in 
> 'ExecutorAllocationManager' will be 0.
> condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
> will be negative during the ending of job1's tasks. 
> condition3: The job2 only hava one task.
> result:
> In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', 
> we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
> #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
> negative. So the 'ExecutorAllocationManager' will not request container from 
> yarn. The app will hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18981) The last job hung when speculation is on

2016-12-22 Thread roncenzhao (JIRA)
roncenzhao created SPARK-18981:
--

 Summary: The last job hung when speculation is on
 Key: SPARK-18981
 URL: https://issues.apache.org/jira/browse/SPARK-18981
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
 Environment: spark2.0.2
hadoop2.5.0
Reporter: roncenzhao
Priority: Critical


related settings:
spark.speculation   true
spark.dynamicAllocation.minExecutors0
spark.executor.cores   4

When I run the follow app, the bug will trigger.
```
sc.runJob(job1)
sleep(100s)
sc.runJob(job2) // the job2 will hang and never be scheduled
```

The triggering condition is described as follows:
condition1: During the sleeping time, the executors will be released and the # 
of the executor will be zero some seconds later. The #numExecutorsTarget in 
'ExecutorAllocationManager' will be 0.
condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks 
will be negative during the ending of job1's tasks. 
condition3: The job2 only hava one task.

result:
In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we 
will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, 
#numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or 
negative. So the 'ExecutorAllocationManager' will not request container from 
yarn. The app will be hung.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18199) Support appending to Parquet files

2016-12-22 Thread Soubhik Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770282#comment-15770282
 ] 

Soubhik Chakraborty commented on SPARK-18199:
-

Can't we use PARQUET-382 feature that got added in 1.9.0 ? 
SPARK-13127 already is in progress and assuming it won't hit any major hurdle, 
we should be able to use it right away. Only downside it doesn't allow append 
of two different schema which is understandable.

> Support appending to Parquet files
> --
>
> Key: SPARK-18199
> URL: https://issues.apache.org/jira/browse/SPARK-18199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeremy Smith
>
> Currently, appending to a Parquet directory involves simply creating new 
> parquet files in the directory. With many small appends (for example, in a 
> streaming job with a short batch duration) this leads to an unbounded number 
> of small Parquet files accumulating. These must be cleaned up with some 
> frequency by removing them all and rewriting a new file containing all the 
> rows.
> It would be far better if Spark supported appending to the Parquet files 
> themselves. HDFS supports this, as does Parquet:
> * The Parquet footer can be read in order to obtain necessary metadata.
> * The new rows can then be appended to the Parquet file as a row group.
> * A new footer can then be appended containing the metadata and referencing 
> the new row groups as well as the previously existing row groups.
> This would result in a small amount of bloat in the file as new row groups 
> are added (since duplicate metadata would accumulate) but it's hugely 
> preferable to accumulating small files, which is bad for HDFS health and also 
> eventually leads to Spark being unable to read the Parquet directory at all.  
> Periodic rewriting of the file could still be performed in order to remove 
> the duplicate metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows

2016-12-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770030#comment-15770030
 ] 

Hyukjin Kwon commented on SPARK-18878:
--

Thank you for guiding me. Let me try to follow it.

> Fix/investigate the more identified test failures in Java/Scala on Windows
> --
>
> Key: SPARK-18878
> URL: https://issues.apache.org/jira/browse/SPARK-18878
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Hyukjin Kwon
>
> It seems many tests are being failed on Windows. Some are only related with 
> tests whereas others are related with the functionalities themselves which 
> causes actual failures for some APIs on Windows.
> The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and 
> now apparently we could proceed much further (apparently it seems we might 
> reach the end).
> The tests proceeded via AppVeyor - 
> https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18896) Suppress ScalaCheck warning -- Unknown ScalaCheck args provided when executing tests using sbt

2016-12-22 Thread PJ Fanning (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769991#comment-15769991
 ] 

PJ Fanning commented on SPARK-18896:


I noticed from the pull request that you are looking at possibly upgrading 
scalatest too. Getting to scalatest 3.0.1 would be useful for later scala 2.12 
support. Scalatest 2.x is not cross compiled for Scala 2.12.

> Suppress ScalaCheck warning -- Unknown ScalaCheck args provided when 
> executing tests using sbt
> --
>
> Key: SPARK-18896
> URL: https://issues.apache.org/jira/browse/SPARK-18896
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> While executing tests for {{DAGScheduler}} I've noticed the following warning:
> {code}
> > core/testOnly org.apache.spark.scheduler.DAGSchedulerSuite
> ...
> [info] Warning: Unknown ScalaCheck args provided: -oDF
> {code}
> The reason is due to a bug in ScalaCheck as reported in 
> https://github.com/rickynils/scalacheck/issues/212 and fixed in 
> https://github.com/rickynils/scalacheck/commit/df435a5 that is available in 
> ScalaCheck 1.13.4.
> Spark uses [ScalaCheck 
> 1.12.5|https://github.com/apache/spark/blob/master/pom.xml#L717] which is 
> behind the latest 1.12.6 [released on Nov 
> 1|https://github.com/rickynils/scalacheck/releases] (not to mention 1.13.4).
> Let's get rid of ScalaCheck's warning (and perhaps upgrade ScalaCheck along 
> the way too!).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18977) Heavy udf is not stopped by cancelJobGroup

2016-12-22 Thread Vitaly Gerasimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaly Gerasimov updated SPARK-18977:
-
Summary: Heavy udf is not stopped by cancelJobGroup  (was: Heavy udf in not 
stopped by cancelJobGroup)

> Heavy udf is not stopped by cancelJobGroup
> --
>
> Key: SPARK-18977
> URL: https://issues.apache.org/jira/browse/SPARK-18977
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Vitaly Gerasimov
>
> Let's say we have a heavy udf that processing during a long time. When I try 
> to run a job in job group that execute this udf and call cancelJobGroup(), 
> the job is still continue processing.
> {code}
> # ./spark-shell
> > import scala.concurrent.Future
> > import scala.concurrent.ExecutionContext.Implicits.global
> > sc.setJobGroup("test-group", "udf-test")
> > sqlContext.udf.register("sleep", (times: Int) => { (1 to 
> > times).toList.foreach{ _ =>  print("sleep..."); Thread.sleep(1) }; 1L })
> > Future { Thread.sleep(5); sc.cancelJobGroup("test-group") }
> > sqlContext.sql("SELECT sleep(10)").collect()
> {code}
> It returns:
> {code}
> sleep...sleep...sleep...sleep...sleep...org.apache.spark.SparkException: Job 
> 0 cancelled part of cancelled job group test-group
> > sleep...sleep...sleep...sleep...sleep...16/12/22 14:36:44 WARN 
> > TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): TaskKilled 
> > (killed intentionally)
> {code}
> It seems unexpectedly for me, but if I don't know something and it works as 
> expected feel free to close the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18980) implement Aggregator with TypedImperativeAggregate

2016-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18980:


Assignee: Wenchen Fan  (was: Apache Spark)

> implement Aggregator with TypedImperativeAggregate
> --
>
> Key: SPARK-18980
> URL: https://issues.apache.org/jira/browse/SPARK-18980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18980) implement Aggregator with TypedImperativeAggregate

2016-12-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18980:


Assignee: Apache Spark  (was: Wenchen Fan)

> implement Aggregator with TypedImperativeAggregate
> --
>
> Key: SPARK-18980
> URL: https://issues.apache.org/jira/browse/SPARK-18980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18980) implement Aggregator with TypedImperativeAggregate

2016-12-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769916#comment-15769916
 ] 

Apache Spark commented on SPARK-18980:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16383

> implement Aggregator with TypedImperativeAggregate
> --
>
> Key: SPARK-18980
> URL: https://issues.apache.org/jira/browse/SPARK-18980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18980) implement Aggregator with TypedImperativeAggregate

2016-12-22 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18980:
---

 Summary: implement Aggregator with TypedImperativeAggregate
 Key: SPARK-18980
 URL: https://issues.apache.org/jira/browse/SPARK-18980
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18979) ShutdownHookManager:Exception while deleting Spark temp dir

2016-12-22 Thread zuotingbing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769827#comment-15769827
 ] 

zuotingbing commented on SPARK-18979:
-

my SPARK_LOCAL_DIRS value is setted like this:
SPARK_LOCAL_DIRS=/data2/zdh/spark/tmp,/data3/zdh/spark/tmp,/data4/zdh/spark/tmp,/data5/zdh/spark/tmp,/data6/zdh/spark/tmp,/data7/zdh/spark/tmp

>From the worker log we find only 3 of 6 dirs delete failed :

【
Search "Deleting directory" (6 hits in 1 files)
  
C:\Users\10159544\Desktop\logs\logs\spark-mr-worker-IDC-ZTECache-Logserver20-B12-U13.log
 (6 hits)
Line 909563: 2016-12-15 20:09:52,629 INFO ShutdownHookManager: Deleting 
directory /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d
Line 910117: 2016-12-15 20:12:59,930 INFO ShutdownHookManager: Deleting 
directory /data6/zdh/spark/tmp/spark-688964cc-010d-44cd-b398-5f54fcee0b34
Line 910118: 2016-12-15 20:15:21,932 INFO ShutdownHookManager: Deleting 
directory /data3/zdh/spark/tmp/spark-f15e06d0-fedf-4618-8ff5-8c7b10a55cbe
Line 913560: 2016-12-15 20:17:49,990 INFO ShutdownHookManager: Deleting 
directory /data5/zdh/spark/tmp/spark-16647cd9-57b0-4893-a876-5a39ebc4078f
Line 915393: 2016-12-15 20:20:11,767 INFO ShutdownHookManager: Deleting 
directory /data7/zdh/spark/tmp/spark-18e24536-8739-4d1a-8381-c762b540addf
Line 915394: 2016-12-15 20:20:11,767 INFO ShutdownHookManager: Deleting 
directory /data4/zdh/spark/tmp/spark-e7f52941-3e2c-48d7-993c-1df891e24577
Search "Exception while deleting" (3 hits in 1 files)
  
C:\Users\10159544\Desktop\logs\logs\spark-mr-worker-IDC-ZTECache-Logserver20-B12-U13.log
 (3 hits)
Line 910098: 2016-12-15 20:12:59,930 ERROR ShutdownHookManager: 
Exception while deleting Spark temp dir: 
/data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d
Line 913541: 2016-12-15 20:17:49,990 ERROR ShutdownHookManager: 
Exception while deleting Spark temp dir: 
/data3/zdh/spark/tmp/spark-f15e06d0-fedf-4618-8ff5-8c7b10a55cbe
Line 1044408: 2016-12-15 20:22:40,501 ERROR ShutdownHookManager: 
Exception while deleting Spark temp dir: 
/data4/zdh/spark/tmp/spark-e7f52941-3e2c-48d7-993c-1df891e24577
】

> ShutdownHookManager:Exception while deleting Spark temp dir  
> -
>
> Key: SPARK-18979
> URL: https://issues.apache.org/jira/browse/SPARK-18979
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: zuotingbing
>
> when i stop the worker process, the SPARK_LOCAL_DIRS should be delete 
> recursively but failed. Exception info:
> 2016-12-15 20:12:59,930 ERROR ShutdownHookManager: Exception while deleting 
> Spark temp dir: 
> /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d
> java.io.IOException: Failed to delete: 
> /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
>   at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
>   at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Resolved] (SPARK-18979) ShutdownHookManager:Exception while deleting Spark temp dir

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18979.
---
Resolution: Duplicate

> ShutdownHookManager:Exception while deleting Spark temp dir  
> -
>
> Key: SPARK-18979
> URL: https://issues.apache.org/jira/browse/SPARK-18979
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: zuotingbing
>
> when i stop the worker process, the SPARK_LOCAL_DIRS should be delete 
> recursively but failed. Exception info:
> 2016-12-15 20:12:59,930 ERROR ShutdownHookManager: Exception while deleting 
> Spark temp dir: 
> /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d
> java.io.IOException: Failed to delete: 
> /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
>   at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
>   at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18979) ShutdownHookManager:Exception while deleting Spark temp dir

2016-12-22 Thread zuotingbing (JIRA)
zuotingbing created SPARK-18979:
---

 Summary: ShutdownHookManager:Exception while deleting Spark temp 
dir  
 Key: SPARK-18979
 URL: https://issues.apache.org/jira/browse/SPARK-18979
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: zuotingbing


when i stop the worker process, the SPARK_LOCAL_DIRS should be delete 
recursively but failed. Exception info:
2016-12-15 20:12:59,930 ERROR ShutdownHookManager: Exception while deleting 
Spark temp dir: /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d
java.io.IOException: Failed to delete: 
/data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18977) Heavy udf in not stopped by cancelJobGroup

2016-12-22 Thread Vitaly Gerasimov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769741#comment-15769741
 ] 

Vitaly Gerasimov commented on SPARK-18977:
--

Yeah.. You are right. But how do jobs stop in Spark, by thread interruption?

> Heavy udf in not stopped by cancelJobGroup
> --
>
> Key: SPARK-18977
> URL: https://issues.apache.org/jira/browse/SPARK-18977
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Vitaly Gerasimov
>
> Let's say we have a heavy udf that processing during a long time. When I try 
> to run a job in job group that execute this udf and call cancelJobGroup(), 
> the job is still continue processing.
> {code}
> # ./spark-shell
> > import scala.concurrent.Future
> > import scala.concurrent.ExecutionContext.Implicits.global
> > sc.setJobGroup("test-group", "udf-test")
> > sqlContext.udf.register("sleep", (times: Int) => { (1 to 
> > times).toList.foreach{ _ =>  print("sleep..."); Thread.sleep(1) }; 1L })
> > Future { Thread.sleep(5); sc.cancelJobGroup("test-group") }
> > sqlContext.sql("SELECT sleep(10)").collect()
> {code}
> It returns:
> {code}
> sleep...sleep...sleep...sleep...sleep...org.apache.spark.SparkException: Job 
> 0 cancelled part of cancelled job group test-group
> > sleep...sleep...sleep...sleep...sleep...16/12/22 14:36:44 WARN 
> > TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): TaskKilled 
> > (killed intentionally)
> {code}
> It seems unexpectedly for me, but if I don't know something and it works as 
> expected feel free to close the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-12-22 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769718#comment-15769718
 ] 

zakaria hili edited comment on SPARK-18608 at 12/22/16 10:42 AM:
-

[~srowen], Now , I understand what you mean, your purpose is to optimize the 
memory,however, I think that all we need is to add an extra check,
if the dataframe is cached, we call the mllib algo directly,
if not, we have to cache the internal rdd

But the problem of this solution is: if we do not cache the internal rdd, we 
will get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD


was (Author: zahili):
[~srowen], I understand now what you mean, your purpose is to optimize the 
memory,however, I think that all we need is to add an extra check,
if the dataframe is cached, we call the mllib algo directly,
if not, we have to cache the internal rdd

But the problem of this solution is: if we do not cache the internal rdd, we 
will get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-12-22 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769718#comment-15769718
 ] 

zakaria hili edited comment on SPARK-18608 at 12/22/16 10:41 AM:
-

[~srowen], I understand now what you mean, your purpose is to optimize the 
memory,however, I think that all we need is to add an extra check,
if the dataframe is cached, we call the mllib algo directly,
if not, we have to cache the internal rdd

But the problem of this solution is: if we do not cache the internal rdd, we 
will get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD


was (Author: zahili):
I understand now what you mean, your purpose is to optimize the memory,however, 
I think that all we need is to add an extra check,
if the dataframe is cached, we call the mllib algo directly,
if not, we have to cache the internal rdd

But the problem of this solution: if we do not cache the internal rdd, we will 
get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-12-22 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769718#comment-15769718
 ] 

zakaria hili edited comment on SPARK-18608 at 12/22/16 10:39 AM:
-

I understand now what you mean, your purpose is to optimize the memory,however, 
I think that all we need is to add an extra check,
if the dataframe is cached, we call the mllib algo directly,
if not, we have to cache the internal rdd

But the problem of this solution: if we do not cache the internal rdd, we will 
get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD


was (Author: zahili):
I understand now what you mean, your purpose is to optimize the memory,however, 
I think that all we need is to add an extra check,
if the dataframe is cached, we call the mllib algo directly,
if not, we have to cache the internal rdd

But the problem of this solution: if we do not cache the internal rdd, we will 
get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-12-22 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769718#comment-15769718
 ] 

zakaria hili commented on SPARK-18608:
--

I understand now what you mean, your purpose is to optimize the memory,however, 
I think that all we need is to add an extra check,
if the dataframe is cached, we will not cache the rdd, then call the mllib algo,
if not, we have to cache the internal rdd

But the problem of this solution: if we do not cache the internal rdd, we will 
get a lot of warnings from mllib package (RDD is not cached)
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216
because df.rdd.getStorageLevel == StorageLevel.NONE is true.

So maybe we will need to create a new public class in mllib methods which can 
take the handlePersistence as parameter : if it has true there no need to 
recheck the input RDD

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18964) HiveContext does not support Time Interval Literals

2016-12-22 Thread Suhas Nalapure (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769691#comment-15769691
 ] 

Suhas Nalapure commented on SPARK-18964:


My understanding is that both the features, namely Window functions 
(https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html)
 and Time Interval Literals are Spark SQL features and hence ideally should 
both be supported by the SQLContext which it does in Spark 2.0 but in any of 
the earlier versions Window Functions are not supported by the SQLContext.

> HiveContext does not support Time Interval Literals
> ---
>
> Key: SPARK-18964
> URL: https://issues.apache.org/jira/browse/SPARK-18964
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Suhas Nalapure
>
> HiveContext does not recognize the Time Interval Literals mentioned here 
> https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html.
> E.g. The following Spark sql runs just fine when a SQLContext is used but 
> fails when HiveContext is used
>  
> select *, case when `Order_Date` + INTERVAL 7 DAY > `Ship_Date` then "On 
> Time" else "Late" end as Shipment_On_Time from sales;
> Logs:
> --
> org.apache.spark.sql.AnalysisException: cannot recognize input near 
> 'INTERVAL' '7' 'DAY' in expression specification; line 2 pos 30
>   at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
>   at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
>   at 
> org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
>   at 
> org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
>   at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:65)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
>   at 
> org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> 

[jira] [Commented] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows

2016-12-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769663#comment-15769663
 ] 

Sean Owen commented on SPARK-18878:
---

I understand why you need to fix some of these things in batches, but I see 
https://issues.apache.org/jira/browse/SPARK-17591 and 
https://issues.apache.org/jira/browse/SPARK-18785 with virtually the same 
title. This has two child JIRAs that are also pretty much identical. This is no 
longer very meaningful. Let's close down all but one JIRA for resource-closing 
problems, and not resolve it until you're pretty sure they're done. Let's also 
close general "umbrella" JIRAs like this. If there is a new type of Windows 
problem, there can be a new JIRA targeted at that new type of fix, and again we 
can make many PRs for one JIRA if needed to fix all of it in batches.

> Fix/investigate the more identified test failures in Java/Scala on Windows
> --
>
> Key: SPARK-18878
> URL: https://issues.apache.org/jira/browse/SPARK-18878
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Hyukjin Kwon
>
> It seems many tests are being failed on Windows. Some are only related with 
> tests whereas others are related with the functionalities themselves which 
> causes actual failures for some APIs on Windows.
> The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and 
> now apparently we could proceed much further (apparently it seems we might 
> reach the end).
> The tests proceeded via AppVeyor - 
> https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18976) in standlone mode,executor expired by HeartbeanReceiver that still take up cores but no tasks assigned to

2016-12-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18976:
--
Target Version/s:   (was: 1.6.1)
   Fix Version/s: (was: 1.6.1)

[~liujianhui] please read http://spark.apache.org/contributing.html before 
opening a JIRA. Don't set fix/target version. It doesn't even make sense to 
target a version released a year ago.

Can you test this vs 2.1 or master? 
Can you provide any reproduction or more detail?
I'm not sure this is enough info.

> in standlone mode,executor expired by HeartbeanReceiver that still take up 
> cores but no tasks assigned to 
> --
>
> Key: SPARK-18976
> URL: https://issues.apache.org/jira/browse/SPARK-18976
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1
> Environment: jdk1.8.0_77 Red Hat 4.4.7-11
>Reporter: liujianhui
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> h2. scene
> when executor expired by HeartbeatReceiver in driver, driver will mark that 
> executor as not live, task scheduler will not assign tasks to that executor, 
> but that executor's status will always be running and take up cores, the 
> executor 18 was expired and no task running, the task time far less than the 
> normal executor 142, but in app page, the executor is running
> !screenshot-1.png!
> !screenshot-2.png!
> !screenshot-3.png!
> h2.process:
> # exeuctor expired by HearbeatReceiver because the last heartbeat execeed the 
> executor timeout
> # executor will be removed in CoarseGrainedSchdulerBackend.killExecutors, so 
> that executor will marked as dead, it will not scheduled as offer since now 
> because it in executorsPendingToRemove
> # status of that executor is running because the CoarseGrainedExecutorBackend 
> processor is also exist and it register block manager to the driver every 
> 10s, log as 
> {code}
> 16/12/22 17:04:26 INFO Executor: Told to re-register on heartbeat
> 16/12/22 17:04:26 INFO BlockManager: BlockManager re-registering with master
> 16/12/22 17:04:26 INFO BlockManagerMaster: Trying to register BlockManager
> 16/12/22 17:04:26 INFO BlockManagerMaster: Registered BlockManager
> 16/12/22 17:04:26 INFO BlockManager: Reporting 0 blocks to the master.
> 16/12/22 17:04:36 INFO Executor: Told to re-register on heartbeat
> 16/12/22 17:04:36 INFO BlockManager: BlockManager re-registering with master
> 16/12/22 17:04:36 INFO BlockManagerMaster: Trying to register BlockManager
> 16/12/22 17:04:36 INFO BlockManagerMaster: Registered BlockManager
> 16/12/22 17:04:36 INFO BlockManager: Reporting 0 blocks to the master.
> 16/12/22 17:04:46 INFO Executor: Told to re-register on heartbeat
> 16/12/22 17:04:46 INFO BlockManager: BlockManager re-registering with master
> 16/12/22 17:04:46 INFO BlockManagerMaster: Trying to register BlockManager
> 16/12/22 17:04:46 INFO BlockManagerMaster: Registered BlockManager
> 16/12/22 17:04:46 INFO BlockManager: Reporting 0 blocks to the master.
> 16/12/22 17:04:56 INFO Executor: Told to re-register on heartbeat
> 16/12/22 17:04:56 INFO BlockManager: BlockManager re-registering with master
> 16/12/22 17:04:56 INFO BlockManagerMaster: Trying to register BlockManager
> 16/12/22 17:04:56 INFO BlockManagerMaster: Registered BlockManager
> 16/12/22 17:04:56 INFO BlockManager: Reporting 0 blocks to the master. 
> {code}
> h2. resolve 
> when the register times exceed some threshold(e.g. 10), the executor should 
> exit as zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-12-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769648#comment-15769648
 ] 

Sean Owen commented on SPARK-18608:
---

I don't think this has to do with Pyspark. The situation is different: input is 
cached, but the intermediate RDD created internally is not, and so is cached 
again.

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >