[jira] [Comment Edited] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-11-07 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646800#comment-15646800
 ] 

Genmao Yu edited comment on SPARK-18187 at 11/8/16 7:53 AM:


[~marmbrus] +1 to your 

+“I think the configuration should only be used when deciding if we should 
perform a new compaction. The identification of a compaction vs a delta should 
be done based on the file itself.“+

h4. how to set "compactInterval"?

{{compactInterval}} can be set by user in the first time. In case it is changed 
by user, we should check-update and use it. If {{isDeletingExpiredLog=false}}, 
we can get original {{compactInterval}} by computing the interval of 
{{.compact}} suffix, and then check it against user setting; if 
{{isDeletingExpiredLog=true}}, we can just use user setting , because this no 
expired meta log.



was (Author: unclegen):
[~marmbrus] +1 to your 

+think the configuration should only be used when deciding if we should perform 
a new compaction. The identification of a compaction vs a delta should be done 
based on the file itself.+

h4. how to set "compactInterval"?

{{compactInterval}} can be set by user in the first time. In case it is changed 
by user, we should check-update and use it. If {{isDeletingExpiredLog=false}}, 
we can get original {{compactInterval}} by computing the interval of 
{{.compact}} suffix, and then check it against user setting; if 
{{isDeletingExpiredLog=true}}, we can just use user setting , because this no 
expired meta log.


> CompactibleFileStreamLog should not rely on "compactInterval" to detect a 
> compaction batch
> --
>
> Key: SPARK-18187
> URL: https://issues.apache.org/jira/browse/SPARK-18187
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Priority: Critical
>
> Right now CompactibleFileStreamLog uses compactInterval to check if a batch 
> is a compaction batch. However, since this conf is controlled by the user, 
> they may just change it, and CompactibleFileStreamLog will just silently 
> return the wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-11-07 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646800#comment-15646800
 ] 

Genmao Yu commented on SPARK-18187:
---

[~marmbrus] +1 to your 

+think the configuration should only be used when deciding if we should perform 
a new compaction. The identification of a compaction vs a delta should be done 
based on the file itself.+

h4. how to set "compactInterval"?

{{compactInterval}} can be set by user in the first time. In case it is changed 
by user, we should check-update and use it. If {{isDeletingExpiredLog=false}}, 
we can get original {{compactInterval}} by computing the interval of 
{{.compact}} suffix, and then check it against user setting; if 
{{isDeletingExpiredLog=true}}, we can just use user setting , because this no 
expired meta log.


> CompactibleFileStreamLog should not rely on "compactInterval" to detect a 
> compaction batch
> --
>
> Key: SPARK-18187
> URL: https://issues.apache.org/jira/browse/SPARK-18187
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Priority: Critical
>
> Right now CompactibleFileStreamLog uses compactInterval to check if a batch 
> is a compaction batch. However, since this conf is controlled by the user, 
> they may just change it, and CompactibleFileStreamLog will just silently 
> return the wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-07 Thread Jason Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646795#comment-15646795
 ] 

Jason Pan commented on SPARK-18353:
---

in org.apache.spark.deploy.Client,
there is one line:
conf.set("spark.rpc.askTimeout", "10")

should we remove this line?

when use rest:
in org.apache.spark.deploy.rest.RestSubmissionClient , there is no this line.

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit":
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-07 Thread Jason Pan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Pan updated SPARK-18353:
--
Description: 
in http://spark.apache.org/docs/latest/configuration.html 
spark.rpc.askTimeout  120s  Duration for an RPC ask operation to wait 
before timing out
the defalut value is 120s as documented.

However when I run "spark-summit":
the cmd is:
Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
"/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
"-Xmx1024M" "-Dspark.eventLog.enabled=true" 
"-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
"-Dspark.app.name=org.apache.spark.examples.SparkPi" 
"-Dspark.submit.deployMode=cluster" 
"-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
 "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
"-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
"-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
"org.apache.spark.deploy.worker.DriverWrapper" 
"spark://Worker@9.111.159.127:7103" 
"/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
 "org.apache.spark.examples.SparkPi" "1000"

Dspark.rpc.askTimeout=10

the value is 10, it is not the same as document.

Note: when I summit to REST URL, it has no this issue.



  was:
for the doc:



> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit":
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-07 Thread Jason Pan (JIRA)
Jason Pan created SPARK-18353:
-

 Summary: spark.rpc.askTimeout defalut value is not 120s
 Key: SPARK-18353
 URL: https://issues.apache.org/jira/browse/SPARK-18353
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1, 1.6.1
 Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Jason Pan
Priority: Critical


for the doc:




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16496) Add wholetext as option for reading text in SQL.

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16496:

Target Version/s: 2.2.0

> Add wholetext as option for reading text in SQL.
> 
>
> Key: SPARK-16496
> URL: https://issues.apache.org/jira/browse/SPARK-16496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prashant Sharma
>
> In multiple text analysis problems, it is not often desirable for the rows to 
> be split by "\n". There exists a wholeText reader for RDD API, and this JIRA 
> just adds the same support for Dataset API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-17969:
-

> I think it's user unfriendly to process standard json file with DataFrame 
> --
>
> Key: SPARK-17969
> URL: https://issues.apache.org/jira/browse/SPARK-17969
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jianfei Wang
>Priority: Minor
>
> Currently, with DataFrame API,  we can't load standard json file directly, 
> maybe we can provide an override method to process this, the logic is as 
> below:
> ```
> val df = spark.sparkContext.wholeTextFiles("data/test.json") 
>  val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => 
>   val index = x.indexOf(',') 
>   x.substring(index + 1, x.length - 1) 
> } 
> val json_df = spark.read.json(json_rdd) 
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10840) SparkSQL doesn't work well with JSON

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-10840.
---
Resolution: Duplicate

> SparkSQL doesn't work well with JSON
> 
>
> Key: SPARK-10840
> URL: https://issues.apache.org/jira/browse/SPARK-10840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jordan Sarraf
>Priority: Minor
>  Labels: JSON, Scala, SparkSQL
>
> Well formed JSON doesn't work with the 1.5.1 version while using 
> sqlContext.read.json(""):
> {
>   "employees": {
> "employee": [
>   {
> "name": "Mia",
> "surname": "Radison",
> "mobile": "7295913821",
> "email": "miaradi...@sparky.com"
>   },
>   {
> "name": "Thor",
> "surname": "Kovaskz",
> "mobile": "8829177193",
> "email": "tkova...@sparky.com"
>   },
>   {
> "name": "Bindy",
> "surname": "Kvuls",
> "mobile": "5033828845",
> "email": "bind...@sparky.com"
>   }
> ]
>   }
> }
> For the above following error is obtained:
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2)
> scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2)
> Where as, this works fine because all components are in the same line:
> [
>   {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
> "miaradi...@sparky.com"},
>   {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
> "tkova...@sparky.com"},
>   {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
> "bind...@sparky.com"}
> ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-17969.
---
Resolution: Duplicate

> I think it's user unfriendly to process standard json file with DataFrame 
> --
>
> Key: SPARK-17969
> URL: https://issues.apache.org/jira/browse/SPARK-17969
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jianfei Wang
>Priority: Minor
>
> Currently, with DataFrame API,  we can't load standard json file directly, 
> maybe we can provide an override method to process this, the logic is as 
> below:
> ```
> val df = spark.sparkContext.wholeTextFiles("data/test.json") 
>  val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => 
>   val index = x.indexOf(',') 
>   x.substring(index + 1, x.length - 1) 
> } 
> val json_df = spark.read.json(json_rdd) 
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7366) Support multi-line JSON objects

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-7366:


> Support multi-line JSON objects
> ---
>
> Key: SPARK-7366
> URL: https://issues.apache.org/jira/browse/SPARK-7366
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Joe Halliwell
>Priority: Minor
>
> h2. Background: why the existing formats aren't enough
> The present object-per-line format for ingesting JSON data has a couple of 
> deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many 
> files which can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and 
> many systems do) it seems reasonable to support them directly as an input 
> format.
> h2. Suggested approach: use a depth hint
> The key challenge is to find record boundaries without parsing the file from 
> the start i.e. given an offset, locate a nearby boundary. In the general case 
> this is impossible: you can't be sure you've identified the start of a 
> top-level record without tracing back to the start of the file.
> However, if we know something more of the structure of the file i.e. maximum 
> object depth it seems plausible that we can do better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18352:

Summary: Parse normal, multi-line JSON files (not just JSON Lines)  (was: 
Parse normal JSON files (not just JSON Lines))

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7366) Support multi-line JSON objects

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7366.
--
Resolution: Duplicate

> Support multi-line JSON objects
> ---
>
> Key: SPARK-7366
> URL: https://issues.apache.org/jira/browse/SPARK-7366
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Joe Halliwell
>Priority: Minor
>
> h2. Background: why the existing formats aren't enough
> The present object-per-line format for ingesting JSON data has a couple of 
> deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many 
> files which can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and 
> many systems do) it seems reasonable to support them directly as an input 
> format.
> h2. Suggested approach: use a depth hint
> The key challenge is to find record boundaries without parsing the file from 
> the start i.e. given an offset, locate a nearby boundary. In the general case 
> this is impossible: you can't be sure you've identified the start of a 
> top-level record without tracing back to the start of the file.
> However, if we know something more of the structure of the file i.e. maximum 
> object depth it seems plausible that we can do better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7366) Support multi-line JSON objects

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7366.
--
Resolution: Fixed

I'm closing this in favor of https://issues.apache.org/jira/browse/SPARK-18352

In reality, it's unlikely each file is enormous and we must split them. If we 
don't do file splits, then it is not really an issue here and a single Spark 
task can stream through an entire file to do the parsing.


> Support multi-line JSON objects
> ---
>
> Key: SPARK-7366
> URL: https://issues.apache.org/jira/browse/SPARK-7366
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Joe Halliwell
>Priority: Minor
>
> h2. Background: why the existing formats aren't enough
> The present object-per-line format for ingesting JSON data has a couple of 
> deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many 
> files which can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and 
> many systems do) it seems reasonable to support them directly as an input 
> format.
> h2. Suggested approach: use a depth hint
> The key challenge is to find record boundaries without parsing the file from 
> the start i.e. given an offset, locate a nearby boundary. In the general case 
> this is impossible: you can't be sure you've identified the start of a 
> top-level record without tracing back to the start of the file.
> However, if we know something more of the structure of the file i.e. maximum 
> object depth it seems plausible that we can do better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18352) Parse normal JSON files (not just JSON Lines)

2016-11-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18352:
---

 Summary: Parse normal JSON files (not just JSON Lines)
 Key: SPARK-18352
 URL: https://issues.apache.org/jira/browse/SPARK-18352
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin


Spark currently can only parse JSON files that are JSON lines, i.e. each record 
has an entire line and records are separated by new line. In reality, a lot of 
users want to use Spark to parse actual JSON files, and are surprised to learn 
that it doesn't do that.

We can introduce a new mode (wholeJsonFile?) in which we don't split the files, 
and rather stream through them to parse the JSON files.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18351) from_json and to_json for parsing JSON for string columns

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18351.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> from_json and to_json for parsing JSON for string columns
> -
>
> Key: SPARK-18351
> URL: https://issues.apache.org/jira/browse/SPARK-18351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18295) Match up to_json to from_json in null safety

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18295:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-18351

> Match up to_json to from_json in null safety
> 
>
> Key: SPARK-18295
> URL: https://issues.apache.org/jira/browse/SPARK-18295
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> {code}
> scala> val df = Seq(Some(Tuple1(Tuple1(1))), None).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: struct<_1: int>]
> scala> df.show()
> ++
> |   a|
> ++
> | [1]|
> |null|
> ++
> scala> df.select(to_json($"a")).show()
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeFields(JacksonGenerator.scala:138)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator$$anonfun$write$1.apply$mcV$sp(JacksonGenerator.scala:194)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeObject(JacksonGenerator.scala:131)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:193)
>   at 
> org.apache.spark.sql.catalyst.expressions.StructToJson.eval(jsonExpressions.scala:544)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18295) Match up to_json to from_json in null safety

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18295:

Assignee: Hyukjin Kwon

> Match up to_json to from_json in null safety
> 
>
> Key: SPARK-18295
> URL: https://issues.apache.org/jira/browse/SPARK-18295
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> {code}
> scala> val df = Seq(Some(Tuple1(Tuple1(1))), None).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: struct<_1: int>]
> scala> df.show()
> ++
> |   a|
> ++
> | [1]|
> |null|
> ++
> scala> df.select(to_json($"a")).show()
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeFields(JacksonGenerator.scala:138)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator$$anonfun$write$1.apply$mcV$sp(JacksonGenerator.scala:194)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeObject(JacksonGenerator.scala:131)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:193)
>   at 
> org.apache.spark.sql.catalyst.expressions.StructToJson.eval(jsonExpressions.scala:544)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17764) to_json function for parsing Structs to json Strings

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17764:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-18351

> to_json function for parsing Structs to json Strings
> 
>
> Key: SPARK-17764
> URL: https://issues.apache.org/jira/browse/SPARK-17764
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we 
> have {{to_json}} too, in particular, in the case to write out dataframes by 
> some data sources not supporting nested structured types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18351) from_json and to_json for parsing JSON for string columns

2016-11-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18351:
---

 Summary: from_json and to_json for parsing JSON for string columns
 Key: SPARK-18351
 URL: https://issues.apache.org/jira/browse/SPARK-18351
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18260:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-18351

> from_json can throw a better exception when it can't find the column or be 
> nullSafe
> ---
>
> Key: SPARK-18260
> URL: https://issues.apache.org/jira/browse/SPARK-18260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Blocker
> Fix For: 2.1.0
>
>
> I got this exception:
> {code}
> SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 
> failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 
> 74170, 10.0.138.84, executor 2): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
> {code}
> This was because the column that I called `from_json` on didn't exist for all 
> of my rows. Either from_json should be null safe, or it should fail with a 
> better error message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17699) from_json function for parsing json Strings into Structs

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17699:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-18351

> from_json function for parsing json Strings into Structs
> 
>
> Key: SPARK-17699
> URL: https://issues.apache.org/jira/browse/SPARK-17699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 2.1.0
>
>
> Today, we have good support for reading standalone JSON data.  However, 
> sometimes (especially when reading from streaming sources such as Kafka) the 
> JSON is embedded in an envelope that has other information we'd like to 
> preserve.  It would be nice if we could also parse JSON string columns, while 
> preserving the original JSON schema. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18350) Support session local timezone

2016-11-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646704#comment-15646704
 ] 

Reynold Xin commented on SPARK-18350:
-

I'm guessing the easiest way to do this is to change all the expressions that 
can be impacted by timezones to add an explicit timezone argument, and the 
analyzer automatically places the timezone argument in those expressions.

cc [~hvanhovell] [~cloud_fan] [~smilegator] [~vssrinath] for input.


> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18350) Support session local timezone

2016-11-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18350:
---

 Summary: Support session local timezone
 Key: SPARK-18350
 URL: https://issues.apache.org/jira/browse/SPARK-18350
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin


As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
manipulation, which is bad if users are not in the same timezones as the 
machines, or if different users have different timezones.

We should introduce a session local timezone setting that is used for execution.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16545) Structured Streaming : foreachSink creates the Physical Plan multiple times per TriggerInterval

2016-11-07 Thread Mario Briggs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646665#comment-15646665
 ] 

Mario Briggs commented on SPARK-16545:
--

[~lwlin] I agree with the PR discussion. I am not terribly sure what the value 
of the 'Resolution'  state should be when closing... 'Later' for e.g. to 
indicate this is being fixed elsehere etc 

> Structured Streaming : foreachSink creates the Physical Plan multiple times 
> per TriggerInterval 
> 
>
> Key: SPARK-16545
> URL: https://issues.apache.org/jira/browse/SPARK-16545
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Mario Briggs
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18349) Update R API documentation on ml model summary

2016-11-07 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18349:


 Summary: Update R API documentation on ml model summary
 Key: SPARK-18349
 URL: https://issues.apache.org/jira/browse/SPARK-18349
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung


It has been discovered that there is a fair bit of consistency in the 
documentation of summary functions, eg.

{code}
#' @return \code{summary} returns a summary object of the fitted model, a list 
of components
#' including formula, number of features, list of features, feature 
importances, number of
#' trees, and tree weights
setMethod("summary", signature(object = "GBTRegressionModel")
{code}

For instance, what should be listed for the return value? Should it be a name 
or a phrase, or should it be a list of items; and should there be a longer 
description on what they mean, or reference link to Scala doc.

We will need to review this for all model summary implementations in mllib.R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18348) Improve tree ensemble model summary

2016-11-07 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18348:


 Summary: Improve tree ensemble model summary
 Key: SPARK-18348
 URL: https://issues.apache.org/jira/browse/SPARK-18348
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Affects Versions: 2.0.0, 2.1.0
Reporter: Felix Cheung


During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
discovered and discussed that

- we don't have a good summary on nodes or trees for their observations, loss, 
probability and so on

- we don't have a shared API with nicely formatted output

We believe this could be a shared API that benefits multiple language bindings, 
including R, when available.

For example, here is what R {code}rpart{code} shows for model summary:
{code}
Call:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
method = "class")
  n= 81

  CP nsplit rel errorxerror  xstd
1 0.17647059  0 1.000 1.000 0.2155872
2 0.01960784  1 0.8235294 0.9411765 0.2107780
3 0.0100  4 0.7647059 1.0588235 0.2200975

Variable importance
 StartAge Number
64 24 12

Node number 1: 81 observations,complexity param=0.1764706
  predicted class=absent   expected loss=0.2098765  P(node) =1
class counts:6417
   probabilities: 0.790 0.210
  left son=2 (62 obs) right son=3 (19 obs)
  Primary splits:
  Start  < 8.5  to the right, improve=6.762330, (0 missing)
  Number < 5.5  to the left,  improve=2.866795, (0 missing)
  Age< 39.5 to the left,  improve=2.250212, (0 missing)
  Surrogate splits:
  Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)

Node number 2: 62 observations,complexity param=0.01960784
  predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
class counts:56 6
   probabilities: 0.903 0.097
  left son=4 (29 obs) right son=5 (33 obs)
  Primary splits:
  Start  < 14.5 to the right, improve=1.0205280, (0 missing)
  Age< 55   to the left,  improve=0.6848635, (0 missing)
  Number < 4.5  to the left,  improve=0.2975332, (0 missing)
  Surrogate splits:
  Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
  Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)

Node number 3: 19 observations
  predicted class=present  expected loss=0.4210526  P(node) =0.2345679
class counts: 811
   probabilities: 0.421 0.579

Node number 4: 29 observations
  predicted class=absent   expected loss=0  P(node) =0.3580247
class counts:29 0
   probabilities: 1.000 0.000

Node number 5: 33 observations,complexity param=0.01960784
  predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
class counts:27 6
   probabilities: 0.818 0.182
  left son=10 (12 obs) right son=11 (21 obs)
  Primary splits:
  Age< 55   to the left,  improve=1.2467530, (0 missing)
  Start  < 12.5 to the right, improve=0.2887701, (0 missing)
  Number < 3.5  to the right, improve=0.1753247, (0 missing)
  Surrogate splits:
  Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
  Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)

Node number 10: 12 observations
  predicted class=absent   expected loss=0  P(node) =0.1481481
class counts:12 0
   probabilities: 1.000 0.000

Node number 11: 21 observations,complexity param=0.01960784
  predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
class counts:15 6
   probabilities: 0.714 0.286
  left son=22 (14 obs) right son=23 (7 obs)
  Primary splits:
  Age< 111  to the right, improve=1.71428600, (0 missing)
  Start  < 12.5 to the right, improve=0.79365080, (0 missing)
  Number < 3.5  to the right, improve=0.07142857, (0 missing)

Node number 22: 14 observations
  predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
class counts:12 2
   probabilities: 0.857 0.143

Node number 23: 7 observations
  predicted class=present  expected loss=0.4285714  P(node) =0.08641975
class counts: 3 4
   probabilities: 0.429 0.571
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18347) Infra for R - need qpdf on Jenkins

2016-11-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18347:
-
Description: 
As a part of working on building R package 
(https://issues.apache.org/jira/browse/SPARK-18264) we discover that building 
the package and vignettes require a tool call qpdf (for compressing PDFs)

In R, it is looking for qpdf as such:
Sys.which(Sys.getenv("R_QPDF", "qpdf"))

ie. which qpdf or whatever the export R_QPDF is pointing to.

Otherwise it raises a warning as such:

* checking for unstated dependencies in examples ... OK
 WARNING
‘qpdf’ is needed for checks on size reduction of PDFs

cc 
[~shaneknapp]

  was:
As a part of working on building R package 
(https://issues.apache.org/jira/browse/SPARK-18264) we discover that building 
the package and vignettes require a tool call qpdf (for compressing PDFs)

In R, it is looking for qpdf as such:
Sys.which(Sys.getenv("R_QPDF", "qpdf"))

ie. which qpdf or whatever the export R_QPDF is pointing to.

cc @shaneknapp



> Infra for R - need qpdf on Jenkins
> --
>
> Key: SPARK-18347
> URL: https://issues.apache.org/jira/browse/SPARK-18347
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> As a part of working on building R package 
> (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building 
> the package and vignettes require a tool call qpdf (for compressing PDFs)
> In R, it is looking for qpdf as such:
> Sys.which(Sys.getenv("R_QPDF", "qpdf"))
> ie. which qpdf or whatever the export R_QPDF is pointing to.
> Otherwise it raises a warning as such:
> * checking for unstated dependencies in examples ... OK
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> cc 
> [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18347) Infra for R - need qpdf on Jenkins

2016-11-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18347:
-
Description: 
As a part of working on building R package 
(https://issues.apache.org/jira/browse/SPARK-18264) we discover that building 
the package and vignettes require a tool call qpdf (for compressing PDFs)

In R, it is looking for qpdf as such:
{code}Sys.which(Sys.getenv("R_QPDF", "qpdf")){code}

ie. which qpdf or whatever the export R_QPDF is pointing to.

Otherwise it raises a warning as such:

{code}
* checking for unstated dependencies in examples ... OK
 WARNING
‘qpdf’ is needed for checks on size reduction of PDFs
{code}

cc 
[~shaneknapp]

  was:
As a part of working on building R package 
(https://issues.apache.org/jira/browse/SPARK-18264) we discover that building 
the package and vignettes require a tool call qpdf (for compressing PDFs)

In R, it is looking for qpdf as such:
Sys.which(Sys.getenv("R_QPDF", "qpdf"))

ie. which qpdf or whatever the export R_QPDF is pointing to.

Otherwise it raises a warning as such:

* checking for unstated dependencies in examples ... OK
 WARNING
‘qpdf’ is needed for checks on size reduction of PDFs

cc 
[~shaneknapp]


> Infra for R - need qpdf on Jenkins
> --
>
> Key: SPARK-18347
> URL: https://issues.apache.org/jira/browse/SPARK-18347
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> As a part of working on building R package 
> (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building 
> the package and vignettes require a tool call qpdf (for compressing PDFs)
> In R, it is looking for qpdf as such:
> {code}Sys.which(Sys.getenv("R_QPDF", "qpdf")){code}
> ie. which qpdf or whatever the export R_QPDF is pointing to.
> Otherwise it raises a warning as such:
> {code}
> * checking for unstated dependencies in examples ... OK
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> {code}
> cc 
> [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18347) Infra for R - need qpdf on Jenkins

2016-11-07 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18347:


 Summary: Infra for R - need qpdf on Jenkins
 Key: SPARK-18347
 URL: https://issues.apache.org/jira/browse/SPARK-18347
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung


As a part of working on building R package 
(https://issues.apache.org/jira/browse/SPARK-18264) we discover that building 
the package and vignettes require a tool call qpdf (for compressing PDFs)

In R, it is looking for qpdf as such:
Sys.which(Sys.getenv("R_QPDF", "qpdf"))

ie. which qpdf or whatever the export R_QPDF is pointing to.

cc @shaneknapp




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18264) Build and package R vignettes

2016-11-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18264:
-
Summary: Build and package R vignettes  (was: Package R vignettes)

> Build and package R vignettes
> -
>
> Key: SPARK-18264
> URL: https://issues.apache.org/jira/browse/SPARK-18264
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-package-vignettes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide update and migration guide

2016-11-07 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646598#comment-15646598
 ] 

Felix Cheung commented on SPARK-18332:
--

The R vignettes is a R-specific thing that is also a separate document from the 
Spark programming guide.


> SparkR 2.1 QA: Programming guide update and migration guide
> ---
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide.  Updates 
> will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> Note: New features are handled in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18332) SparkR 2.1 QA: Programming guide update and migration guide

2016-11-07 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646598#comment-15646598
 ] 

Felix Cheung edited comment on SPARK-18332 at 11/8/16 5:46 AM:
---

The R vignettes is a R-specific thing that is also a separate document from the 
Spark programming guide.
Perhaps that should be included in this task for future release QA clones.


was (Author: felixcheung):
The R vignettes is a R-specific thing that is also a separate document from the 
Spark programming guide.


> SparkR 2.1 QA: Programming guide update and migration guide
> ---
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide.  Updates 
> will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> Note: New features are handled in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18345) Structured Streaming quick examples fails with default configuration

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18345:


Assignee: Apache Spark

> Structured Streaming quick examples fails with default configuration
> 
>
> Key: SPARK-18345
> URL: https://issues.apache.org/jira/browse/SPARK-18345
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Tsuyoshi Ozawa
>Assignee: Apache Spark
>
> StructuredNetworkWordCount results in failure because it needs HDFS 
> configuration. It should use local filesystem instead of using HDFS by 
> default. 
> {quote}
> Exception in thread "main" java.net.ConnectException: Call From 
> ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18345) Structured Streaming quick examples fails with default configuration

2016-11-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646594#comment-15646594
 ] 

Apache Spark commented on SPARK-18345:
--

User 'oza' has created a pull request for this issue:
https://github.com/apache/spark/pull/15806

> Structured Streaming quick examples fails with default configuration
> 
>
> Key: SPARK-18345
> URL: https://issues.apache.org/jira/browse/SPARK-18345
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Tsuyoshi Ozawa
>
> StructuredNetworkWordCount results in failure because it needs HDFS 
> configuration. It should use local filesystem instead of using HDFS by 
> default. 
> {quote}
> Exception in thread "main" java.net.ConnectException: Call From 
> ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18345) Structured Streaming quick examples fails with default configuration

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18345:


Assignee: (was: Apache Spark)

> Structured Streaming quick examples fails with default configuration
> 
>
> Key: SPARK-18345
> URL: https://issues.apache.org/jira/browse/SPARK-18345
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Tsuyoshi Ozawa
>
> StructuredNetworkWordCount results in failure because it needs HDFS 
> configuration. It should use local filesystem instead of using HDFS by 
> default. 
> {quote}
> Exception in thread "main" java.net.ConnectException: Call From 
> ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18332) SparkR 2.1 QA: Programming guide update and migration guide

2016-11-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18332:
-
Comment: was deleted

(was: link
https://issues.apache.org/jira/browse/SPARK-18279
https://issues.apache.org/jira/browse/SPARK-18266
)

> SparkR 2.1 QA: Programming guide update and migration guide
> ---
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide.  Updates 
> will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> Note: New features are handled in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide update and migration guide

2016-11-07 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646592#comment-15646592
 ] 

Felix Cheung commented on SPARK-18332:
--

link
https://issues.apache.org/jira/browse/SPARK-18279
https://issues.apache.org/jira/browse/SPARK-18266


> SparkR 2.1 QA: Programming guide update and migration guide
> ---
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide.  Updates 
> will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> Note: New features are handled in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16892) flatten function to get flat array (or map) column from array of array (or array of map) column

2016-11-07 Thread Kapil Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646584#comment-15646584
 ] 

Kapil Singh commented on SPARK-16892:
-

It's not for flattening Rows. It's for flattening columns. The columns 
themselves can be of array of array or array of map types. How would you 
flatten them to obtain columns of array and map types respectively? Also this 
is for DataFrame expressions/functions.

> flatten function to get flat array (or map) column from array of array (or 
> array of map) column
> ---
>
> Key: SPARK-16892
> URL: https://issues.apache.org/jira/browse/SPARK-16892
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kapil Singh
>
> flatten(input)
> Converts input of array of array type into flat array type by inserting 
> elements of all element arrays into single array. Example:
> input: [[1, 2, 3], [4, 5], [-1, -2, 0]]
> output: [1, 2, 3, 4, 5, -1, -2, 0]
> Converts input of array of map type into flat map type by inserting key-value 
> pairs of all element maps into single map. Example:
> input: [(1 -> "one", 2 -> "two"), (0 -> "zero"), (4 -> "four")]
> output: (1 -> "one", 2 -> "two", 0 -> "zero", 4 -> "four")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18346) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec

2016-11-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646578#comment-15646578
 ] 

Apache Spark commented on SPARK-18346:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15805

> TRUNCATE TABLE should fail if no partition is matched for the given 
> non-partial partition spec
> --
>
> Key: SPARK-18346
> URL: https://issues.apache.org/jira/browse/SPARK-18346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18346) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18346:


Assignee: Wenchen Fan  (was: Apache Spark)

> TRUNCATE TABLE should fail if no partition is matched for the given 
> non-partial partition spec
> --
>
> Key: SPARK-18346
> URL: https://issues.apache.org/jira/browse/SPARK-18346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18346) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18346:


Assignee: Apache Spark  (was: Wenchen Fan)

> TRUNCATE TABLE should fail if no partition is matched for the given 
> non-partial partition spec
> --
>
> Key: SPARK-18346
> URL: https://issues.apache.org/jira/browse/SPARK-18346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16894) take function for returning the first n elements of array column

2016-11-07 Thread Kapil Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646576#comment-15646576
 ] 

Kapil Singh commented on SPARK-16894:
-

This is not about selecting first n elements/columns from a Row. It's for 
selecting first n elements of an array type column. So for every record/Row the 
input column has some m elements but the result column has only first n 
elements of the input column. This operation is similar to scala collection's 
take operation. The scope of the operation is cell values and not Row.

> take function for returning the first n elements of array column
> 
>
> Key: SPARK-16894
> URL: https://issues.apache.org/jira/browse/SPARK-16894
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kapil Singh
>
> take(inputArray, n)
> Returns array containing first n elements of inputArray



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18346) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec

2016-11-07 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18346:
---

 Summary: TRUNCATE TABLE should fail if no partition is matched for 
the given non-partial partition spec
 Key: SPARK-18346
 URL: https://issues.apache.org/jira/browse/SPARK-18346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16894) take function for returning the first n elements of array column

2016-11-07 Thread Kapil Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646563#comment-15646563
 ] 

Kapil Singh commented on SPARK-16894:
-

The use case is similar to scala collection's take method. So, for example, one 
of the input columns is an array containing product category hierarchy e.g. 
[apparel, men, t-shirt, printed, ...] and I'm only interested in first n (say 
3) categories. I want a function/expression on DataFrame so that I can get an 
output column containing only first 3 categories e.g. [apparel, men, t-shirt] 

> take function for returning the first n elements of array column
> 
>
> Key: SPARK-16894
> URL: https://issues.apache.org/jira/browse/SPARK-16894
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kapil Singh
>
> take(inputArray, n)
> Returns array containing first n elements of inputArray



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18345) Structured Streaming quick examples fails with default configuration

2016-11-07 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646551#comment-15646551
 ] 

Tsuyoshi Ozawa commented on SPARK-18345:


I would like to tackle this problem. I fixed it locally, so will send PR soon.

> Structured Streaming quick examples fails with default configuration
> 
>
> Key: SPARK-18345
> URL: https://issues.apache.org/jira/browse/SPARK-18345
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Tsuyoshi Ozawa
>
> StructuredNetworkWordCount results in failure because it needs HDFS 
> configuration. It should use local filesystem instead of using HDFS by 
> default. 
> {quote}
> Exception in thread "main" java.net.ConnectException: Call From 
> ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18345) Structured Streaming quick examples fails with default configuration

2016-11-07 Thread Tsuyoshi Ozawa (JIRA)
Tsuyoshi Ozawa created SPARK-18345:
--

 Summary: Structured Streaming quick examples fails with default 
configuration
 Key: SPARK-18345
 URL: https://issues.apache.org/jira/browse/SPARK-18345
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.1
Reporter: Tsuyoshi Ozawa


StructuredNetworkWordCount results in failure because it needs HDFS 
configuration. It should use local filesystem instead of using HDFS by default. 

{quote}
Exception in thread "main" java.net.ConnectException: Call From 
ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1351)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260)
at 
org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71)
at 
org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{quote}

.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18344) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec

2016-11-07 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18344:
---

 Summary: TRUNCATE TABLE should fail if no partition is matched for 
the given non-partial partition spec
 Key: SPARK-18344
 URL: https://issues.apache.org/jira/browse/SPARK-18344
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-11-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477
 ] 

Song Jun edited comment on SPARK-18055 at 11/8/16 5:04 AM:
---

[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?

I also test with the test-jar_2.11-1.0.jar in spark-shell:
>spark-shell --jars test-jar_2.11-1.0.jar

and there is no exception.



was (Author: windpiger):
[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?
or I must test it with customed jar?

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write

2016-11-07 Thread Luke Miner (JIRA)
Luke Miner created SPARK-18343:
--

 Summary: FileSystem$Statistics$StatisticsDataReferenceCleaner 
hangs on s3 write
 Key: SPARK-18343
 URL: https://issues.apache.org/jira/browse/SPARK-18343
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
 Environment: Spark 2.0.1
Hadoop 2.7.1
Mesos 1.0.1
Ubuntu 14.04
Reporter: Luke Miner


I have a driver program where I write read data in from Cassandra using spark, 
perform some operations, and then write out to JSON on S3. The program runs 
fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1.

However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and 
spark-cassandra-connector 2.0.0-M3, the program completes in the sense that all 
the expected files are written to S3, but the program never terminates.

I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. In 
both cases I use the default output committer.

>From the thread dump (included below) it seems like it could be waiting on: 
>`org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner`

Code snippet:
{code}
// get MongoDB oplog operations
val operations = sc.cassandraTable[JsonOperation](keyspace, namespace)
  .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp)

// replay oplog operations into documents
val documents = operations
  .spanBy(op => op.id)
  .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) }
  .filter { case (id, result) => result.isInstanceOf[Document] }
  .map { case (id, document) => MergedDocument(id = id, document = document
.asInstanceOf[Document])
  }

// write documents to json on s3
documents
  .map(document => document.toJson)
  .coalesce(partitions)
  .saveAsTextFile(path, classOf[GzipCodec])
sc.stop()
{code}

Thread dump on the driver:

{code}
60  context-cleaner-periodic-gc TIMED_WAITING
46  dag-scheduler-event-loopWAITING
4389DestroyJavaVM   RUNNABLE
12  dispatcher-event-loop-0 WAITING
13  dispatcher-event-loop-1 WAITING
14  dispatcher-event-loop-2 WAITING
15  dispatcher-event-loop-3 WAITING
47  driver-revive-threadTIMED_WAITING
3   Finalizer   WAITING
82  ForkJoinPool-1-worker-17WAITING
43  heartbeat-receiver-event-loop-threadTIMED_WAITING
93  java-sdk-http-connection-reaper TIMED_WAITING
4387java-sdk-progress-listener-callback-thread  WAITING
25  map-output-dispatcher-0 WAITING
26  map-output-dispatcher-1 WAITING
27  map-output-dispatcher-2 WAITING
28  map-output-dispatcher-3 WAITING
29  map-output-dispatcher-4 WAITING
30  map-output-dispatcher-5 WAITING
31  map-output-dispatcher-6 WAITING
32  map-output-dispatcher-7 WAITING
48  MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE
44  netty-rpc-env-timeout   TIMED_WAITING
92  
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner   
WAITING
62  pool-19-thread-1TIMED_WAITING
2   Reference Handler   WAITING
61  Scheduler-1112394071TIMED_WAITING
20  shuffle-server-0RUNNABLE
55  shuffle-server-0RUNNABLE
21  shuffle-server-1RUNNABLE
56  shuffle-server-1RUNNABLE
22  shuffle-server-2RUNNABLE
57  shuffle-server-2RUNNABLE
23  shuffle-server-3RUNNABLE
58  shuffle-server-3RUNNABLE
4   Signal Dispatcher   RUNNABLE
59  Spark Context Cleaner   TIMED_WAITING
9   SparkListenerBusWAITING
35  SparkUI-35-selector-ServerConnectorManager@651d3734/0   RUNNABLE
36  
SparkUI-36-acceptor-0@467924cb-ServerConnector@3b5eaf92{HTTP/1.1}{0.0.0.0:4040} 
RUNNABLE
37  SparkUI-37-selector-ServerConnectorManager@651d3734/1   RUNNABLE
38  SparkUI-38  TIMED_WAITING
39  SparkUI-39  TIMED_WAITING
40  SparkUI-40  TIMED_WAITING
41  SparkUI-41  RUNNABLE
42  SparkUI-42  TIMED_WAITING
438 task-result-getter-0WAITING
450 task-result-getter-1WAITING
489 task-result-getter-2WAITING
492 task-result-getter-3WAITING
75  threadDeathWatcher-2-1  TIMED_WAITING
45  Timer-0 WAITING
{code}

Thread dump on the executors. It's the same on all of them:

{code}
24  dispatcher-event-loop-0 WAITING
25  dispatcher-event-loop-1 WAITING
26  dispatcher-event-loop-2 RUNNABLE
27  dispatcher-event-loop-3 WAITING
39  driver-heartbeater  TIMED_WAITING
3   Finalizer   WAITING
58  java-sdk-http-connection-reaper TIMED_WAITING
75  java-sdk-progress-listener-callback-thread  WAITING
1   mainTIMED_WAITING
33  netty-rpc-env-timeout   TIMED_WAITING
55  
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner   
WAITING
59  pool-17-thread-1TIMED_WAITING
2   Reference Handler   WAITING
28  

[jira] [Comment Edited] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-11-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477
 ] 

Song Jun edited comment on SPARK-18055 at 11/8/16 4:51 AM:
---

[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?
or I must test it with customed jar?


was (Author: windpiger):
[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail

2016-11-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646484#comment-15646484
 ] 

Apache Spark commented on SPARK-18342:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15804

> HDFSBackedStateStore can fail to rename files causing snapshotting and 
> recovery to fail
> ---
>
> Key: SPARK-18342
> URL: https://issues.apache.org/jira/browse/SPARK-18342
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>Priority: Critical
>
> The HDFSBackedStateStore renames temporary files to delta files as it commits 
> new versions. It however doesn't check whether the rename succeeded. If the 
> rename fails, then recovery will not be possible. It should fail during the 
> rename stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18342:


Assignee: (was: Apache Spark)

> HDFSBackedStateStore can fail to rename files causing snapshotting and 
> recovery to fail
> ---
>
> Key: SPARK-18342
> URL: https://issues.apache.org/jira/browse/SPARK-18342
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>Priority: Critical
>
> The HDFSBackedStateStore renames temporary files to delta files as it commits 
> new versions. It however doesn't check whether the rename succeeded. If the 
> rename fails, then recovery will not be possible. It should fail during the 
> rename stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18342:


Assignee: Apache Spark

> HDFSBackedStateStore can fail to rename files causing snapshotting and 
> recovery to fail
> ---
>
> Key: SPARK-18342
> URL: https://issues.apache.org/jira/browse/SPARK-18342
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>Priority: Critical
>
> The HDFSBackedStateStore renames temporary files to delta files as it commits 
> new versions. It however doesn't check whether the rename succeeded. If the 
> rename fails, then recovery will not be possible. It should fail during the 
> rename stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-11-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477
 ] 

Song Jun commented on SPARK-18055:
--

[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-11-07 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646453#comment-15646453
 ] 

Harish commented on SPARK-17463:


I was able to figure out the issue, its  not related to this bug. 

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 

[jira] [Comment Edited] (SPARK-16892) flatten function to get flat array (or map) column from array of array (or array of map) column

2016-11-07 Thread Jayadevan M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646443#comment-15646443
 ] 

Jayadevan M edited comment on SPARK-16892 at 11/8/16 4:21 AM:
--

I hope you can use flatMap for this scenario. I don't think any new function is 
required for this.
For example

var array=Array(Array(1, 2, 3), Array(4, 5), Array(-1, -2, 0));
var rdd = sc.parallelize(array);
rdd.flatMap(x=>x).collect();



was (Author: jayadevan.m):
I hope you can use flatMap for this scenario. For example

var array=Array(Array(1, 2, 3), Array(4, 5), Array(-1, -2, 0));
var rdd = sc.parallelize(array);
rdd.flatMap(x=>x).collect();


> flatten function to get flat array (or map) column from array of array (or 
> array of map) column
> ---
>
> Key: SPARK-16892
> URL: https://issues.apache.org/jira/browse/SPARK-16892
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kapil Singh
>
> flatten(input)
> Converts input of array of array type into flat array type by inserting 
> elements of all element arrays into single array. Example:
> input: [[1, 2, 3], [4, 5], [-1, -2, 0]]
> output: [1, 2, 3, 4, 5, -1, -2, 0]
> Converts input of array of map type into flat map type by inserting key-value 
> pairs of all element maps into single map. Example:
> input: [(1 -> "one", 2 -> "two"), (0 -> "zero"), (4 -> "four")]
> output: (1 -> "one", 2 -> "two", 0 -> "zero", 4 -> "four")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16892) flatten function to get flat array (or map) column from array of array (or array of map) column

2016-11-07 Thread Jayadevan M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646443#comment-15646443
 ] 

Jayadevan M commented on SPARK-16892:
-

I hope you can use flatMap for this scenario. For example

var array=Array(Array(1, 2, 3), Array(4, 5), Array(-1, -2, 0));
var rdd = sc.parallelize(array);
rdd.flatMap(x=>x).collect();


> flatten function to get flat array (or map) column from array of array (or 
> array of map) column
> ---
>
> Key: SPARK-16892
> URL: https://issues.apache.org/jira/browse/SPARK-16892
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kapil Singh
>
> flatten(input)
> Converts input of array of array type into flat array type by inserting 
> elements of all element arrays into single array. Example:
> input: [[1, 2, 3], [4, 5], [-1, -2, 0]]
> output: [1, 2, 3, 4, 5, -1, -2, 0]
> Converts input of array of map type into flat map type by inserting key-value 
> pairs of all element maps into single map. Example:
> input: [(1 -> "one", 2 -> "two"), (0 -> "zero"), (4 -> "four")]
> output: (1 -> "one", 2 -> "two", 0 -> "zero", 4 -> "four")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail

2016-11-07 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18342:
---

 Summary: HDFSBackedStateStore can fail to rename files causing 
snapshotting and recovery to fail
 Key: SPARK-18342
 URL: https://issues.apache.org/jira/browse/SPARK-18342
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.1
Reporter: Burak Yavuz
Priority: Critical


The HDFSBackedStateStore renames temporary files to delta files as it commits 
new versions. It however doesn't check whether the rename succeeded. If the 
rename fails, then recovery will not be possible. It should fail during the 
rename stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18341) Eliminate use of SingularMatrixException in WeightedLeastSquares logic

2016-11-07 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-18341:
-

 Summary: Eliminate use of SingularMatrixException in 
WeightedLeastSquares logic
 Key: SPARK-18341
 URL: https://issues.apache.org/jira/browse/SPARK-18341
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


WeightedLeastSquares uses an Exception to implement fallback logic for which 
solver to use: 
[https://github.com/apache/spark/blob/6f3697136aa68dc39d3ce42f43a7af554d2a3bf9/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala#L258]

We should use an error code instead of an exception.
* Note the error code should be internal, not a public API.
* We may be able to eliminate the SingularMatrixException class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time

2016-11-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646333#comment-15646333
 ] 

Song Jun commented on SPARK-18298:
--

[~WangTao]  I test it,and the ui should show the user's local time which the 
1.6.x did, so I think this is a bug, and I post a pull request.

> HistoryServer use GMT time all time
> ---
>
> Key: SPARK-18298
> URL: https://issues.apache.org/jira/browse/SPARK-18298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1
> Environment: suse 11.3 with CST time
>Reporter: Tao Wang
>
> When I started HistoryServer for reading event logs, the timestamp readed 
> will be parsed using local timezone like "CST"(confirmed via debug).
> But the time related columns like "Started"/"Completed"/"Last Updated" in 
> History Server UI using "GMT" time, which is 8 hours earlier than "CST".
> {quote}
> App IDApp NameStarted Completed   DurationSpark 
> User  Last UpdatedEvent Log
> local-1478225166651   Spark shell 2016-11-04 02:06:06 2016-11-07 
> 01:33:30 71.5 h  root2016-11-07 01:33:30
> {quote}
> I've checked the REST api and found the result like:
> {color:red}
> [ {
>   "id" : "local-1478225166651",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:06:06.020GMT",  
> "endTime" : "2016-11-07T01:33:30.265GMT",  
> "lastUpdated" : "2016-11-07T01:33:30.000GMT",
> "duration" : 257244245,
> "sparkUser" : "root",
> "completed" : true,
> "lastUpdatedEpoch" : 147848241,
> "endTimeEpoch" : 1478482410265,
> "startTimeEpoch" : 1478225166020
>   } ]
> }, {
>   "id" : "local-1478224925869",
>   "name" : "Spark Pi",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:02:02.133GMT",
> "endTime" : "2016-11-04T02:02:07.468GMT",
> "lastUpdated" : "2016-11-04T02:02:07.000GMT",
> "duration" : 5335,
> "sparkUser" : "root",
> "completed" : true,
> ...
> {color}
> So maybe the change happened in transferring between server and browser? I 
> have no idea where to go from this point.
> Hope guys can offer some help, or just fix it if it's easy? :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time

2016-11-07 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646270#comment-15646270
 ] 

Tao Wang commented on SPARK-18298:
--

[~ajbozarth] Thanks for your attention.

I think it's the issue 1, in which the time shown with timezone GMT all time 
whatever the server timezone is.

For example, if the timestamp stored is `1478573043680`, we expected the time 
shown in HistoryServer looks like: "2016-11-08 10:44:03"(which is using CST 
time same as the server timezone), but not "2016-11-08 02:44:03"(in GMT time, 
ignoring timezone the server uses).

and in my opnion, the time should be shown as the server timezone as people who 
run the server will use local timezone than the GMT one.

> HistoryServer use GMT time all time
> ---
>
> Key: SPARK-18298
> URL: https://issues.apache.org/jira/browse/SPARK-18298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1
> Environment: suse 11.3 with CST time
>Reporter: Tao Wang
>
> When I started HistoryServer for reading event logs, the timestamp readed 
> will be parsed using local timezone like "CST"(confirmed via debug).
> But the time related columns like "Started"/"Completed"/"Last Updated" in 
> History Server UI using "GMT" time, which is 8 hours earlier than "CST".
> {quote}
> App IDApp NameStarted Completed   DurationSpark 
> User  Last UpdatedEvent Log
> local-1478225166651   Spark shell 2016-11-04 02:06:06 2016-11-07 
> 01:33:30 71.5 h  root2016-11-07 01:33:30
> {quote}
> I've checked the REST api and found the result like:
> {color:red}
> [ {
>   "id" : "local-1478225166651",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:06:06.020GMT",  
> "endTime" : "2016-11-07T01:33:30.265GMT",  
> "lastUpdated" : "2016-11-07T01:33:30.000GMT",
> "duration" : 257244245,
> "sparkUser" : "root",
> "completed" : true,
> "lastUpdatedEpoch" : 147848241,
> "endTimeEpoch" : 1478482410265,
> "startTimeEpoch" : 1478225166020
>   } ]
> }, {
>   "id" : "local-1478224925869",
>   "name" : "Spark Pi",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:02:02.133GMT",
> "endTime" : "2016-11-04T02:02:07.468GMT",
> "lastUpdated" : "2016-11-04T02:02:07.000GMT",
> "duration" : 5335,
> "sparkUser" : "root",
> "completed" : true,
> ...
> {color}
> So maybe the change happened in transferring between server and browser? I 
> have no idea where to go from this point.
> Hope guys can offer some help, or just fix it if it's easy? :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time

2016-11-07 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646269#comment-15646269
 ] 

Tao Wang commented on SPARK-18298:
--

[~ajbozarth] Thanks for your attention.

I think it's the issue 1, in which the time shown with timezone GMT all time 
whatever the server timezone is.

For example, if the timestamp stored is `1478573043680`, we expected the time 
shown in HistoryServer looks like: "2016-11-08 10:44:03"(which is using CST 
time same as the server timezone), but not "2016-11-08 02:44:03"(in GMT time, 
ignoring timezone the server uses).

and in my opnion, the time should be shown as the server timezone as people who 
run the server will use local timezone than the GMT one.

> HistoryServer use GMT time all time
> ---
>
> Key: SPARK-18298
> URL: https://issues.apache.org/jira/browse/SPARK-18298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1
> Environment: suse 11.3 with CST time
>Reporter: Tao Wang
>
> When I started HistoryServer for reading event logs, the timestamp readed 
> will be parsed using local timezone like "CST"(confirmed via debug).
> But the time related columns like "Started"/"Completed"/"Last Updated" in 
> History Server UI using "GMT" time, which is 8 hours earlier than "CST".
> {quote}
> App IDApp NameStarted Completed   DurationSpark 
> User  Last UpdatedEvent Log
> local-1478225166651   Spark shell 2016-11-04 02:06:06 2016-11-07 
> 01:33:30 71.5 h  root2016-11-07 01:33:30
> {quote}
> I've checked the REST api and found the result like:
> {color:red}
> [ {
>   "id" : "local-1478225166651",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:06:06.020GMT",  
> "endTime" : "2016-11-07T01:33:30.265GMT",  
> "lastUpdated" : "2016-11-07T01:33:30.000GMT",
> "duration" : 257244245,
> "sparkUser" : "root",
> "completed" : true,
> "lastUpdatedEpoch" : 147848241,
> "endTimeEpoch" : 1478482410265,
> "startTimeEpoch" : 1478225166020
>   } ]
> }, {
>   "id" : "local-1478224925869",
>   "name" : "Spark Pi",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:02:02.133GMT",
> "endTime" : "2016-11-04T02:02:07.468GMT",
> "lastUpdated" : "2016-11-04T02:02:07.000GMT",
> "duration" : 5335,
> "sparkUser" : "root",
> "completed" : true,
> ...
> {color}
> So maybe the change happened in transferring between server and browser? I 
> have no idea where to go from this point.
> Hope guys can offer some help, or just fix it if it's easy? :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16575) partition calculation mismatch with sc.binaryFiles

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16575.
-
   Resolution: Fixed
 Assignee: Tarun Kumar
Fix Version/s: 2.1.0

> partition calculation mismatch with sc.binaryFiles
> --
>
> Key: SPARK-16575
> URL: https://issues.apache.org/jira/browse/SPARK-16575
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Suhas
>Assignee: Tarun Kumar
>Priority: Critical
> Fix For: 2.1.0
>
>
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines 
> folder has 1922 files.
>  Ex: {noformat}val binaryRDD = 
> sc.binaryFiles("/databricks-datasets/airlines/*"){noformat}
> 2. check the number of partitions of the above RDD
> - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 
> 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 
> version.
> For explanation with screenshot, please look at the link below,
> http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18263) Configuring spark.kryo.registrator programmatically doesn't take effect

2016-11-07 Thread inred (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646224#comment-15646224
 ] 

inred commented on SPARK-18263:
---

config at builder take effect
val spark = SparkSession
  .builder
  .config("spark.kryo.registrator", 
"org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

> Configuring spark.kryo.registrator programmatically doesn't take effect
> ---
>
> Key: SPARK-18263
> URL: https://issues.apache.org/jira/browse/SPARK-18263
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.6
> scala-2.11.8
>Reporter: inred
>
> it run ok with spark-shell --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf 
> spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \
> but in IDE
>  val spark = SparkSession.builder.master("local[*]").appName("Anno 
> BDG").getOrCreate()
> spark.conf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> spark.conf.set("spark.kryo.registrator", 
> "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
> it reports the following error:
> java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord
> Serialization stack:
> object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, 
> value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, 
> "oldPosition": null, "end": 61758727, "mapq": 25, "readName": 
> "NB501244AR:119:HJY3WBGXY:2:2:6137:19359", "sequence": 
> "TACTGAGACTTATCAGAATTTCAGGCTAAAGCAACC", "qual": 
> "AAEA", "cigar": "40M", "oldCigar": null, 
> "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, 
> "properPair": false, "readMapped": true, "mateMapped": false, 
> "failedVendorQualityChecks": false, "duplicateRead": false, 
> "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": 
> true, "secondaryAlignment": false, "supplementaryAlignment": false, 
> "mismatchingPositions": "40", "origQual": null, "attributes": 
> "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": 
> null, "recordGroupSample": null, "mateAlignmentStart": null, 
> "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null})
> at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at 
> org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:150)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 0.0 in stage 2.0 (TID 9) 
> had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord
> Serialization stack:
> object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, 
> value: {"readInFragment": 0, "contigName": "chr1", "start": 10001, 
> "oldPosition": null, "end": 10041, "mapq": 0, "readName": 
> "NB501244AR:119:HJY3WBGXY:3:11508:7857:8792", "sequence": 
> "AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC", "qual": 
> "///E6EEEAEEE/AEAAA/A", "cigar": "40M", "oldCigar": null, 
> "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, 
> "properPair": false, "readMapped": true, "mateMapped": false, 
> "failedVendorQualityChecks": false, "duplicateRead": false, 
> "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": 
> true, "secondaryAlignment": false, "supplementaryAlignment": false, 
> "mismatchingPositions": "40", "origQual": null, "attributes": 
> "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:594", "recordGroupName": null, 
> "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": 
> null, "mateContigName": null, "inferredInsertSize": null}); not retrying
> 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 4.0 in stage 2.0 (TID 13) 
> had a not serializable 

[jira] [Resolved] (SPARK-18263) Configuring spark.kryo.registrator programmatically doesn't take effect

2016-11-07 Thread inred (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

inred resolved SPARK-18263.
---
Resolution: Fixed

> Configuring spark.kryo.registrator programmatically doesn't take effect
> ---
>
> Key: SPARK-18263
> URL: https://issues.apache.org/jira/browse/SPARK-18263
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.6
> scala-2.11.8
>Reporter: inred
>
> it run ok with spark-shell --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf 
> spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \
> but in IDE
>  val spark = SparkSession.builder.master("local[*]").appName("Anno 
> BDG").getOrCreate()
> spark.conf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> spark.conf.set("spark.kryo.registrator", 
> "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
> it reports the following error:
> java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord
> Serialization stack:
> object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, 
> value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, 
> "oldPosition": null, "end": 61758727, "mapq": 25, "readName": 
> "NB501244AR:119:HJY3WBGXY:2:2:6137:19359", "sequence": 
> "TACTGAGACTTATCAGAATTTCAGGCTAAAGCAACC", "qual": 
> "AAEA", "cigar": "40M", "oldCigar": null, 
> "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, 
> "properPair": false, "readMapped": true, "mateMapped": false, 
> "failedVendorQualityChecks": false, "duplicateRead": false, 
> "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": 
> true, "secondaryAlignment": false, "supplementaryAlignment": false, 
> "mismatchingPositions": "40", "origQual": null, "attributes": 
> "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": 
> null, "recordGroupSample": null, "mateAlignmentStart": null, 
> "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null})
> at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at 
> org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:150)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 0.0 in stage 2.0 (TID 9) 
> had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord
> Serialization stack:
> object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, 
> value: {"readInFragment": 0, "contigName": "chr1", "start": 10001, 
> "oldPosition": null, "end": 10041, "mapq": 0, "readName": 
> "NB501244AR:119:HJY3WBGXY:3:11508:7857:8792", "sequence": 
> "AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC", "qual": 
> "///E6EEEAEEE/AEAAA/A", "cigar": "40M", "oldCigar": null, 
> "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, 
> "properPair": false, "readMapped": true, "mateMapped": false, 
> "failedVendorQualityChecks": false, "duplicateRead": false, 
> "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": 
> true, "secondaryAlignment": false, "supplementaryAlignment": false, 
> "mismatchingPositions": "40", "origQual": null, "attributes": 
> "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:594", "recordGroupName": null, 
> "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": 
> null, "mateContigName": null, "inferredInsertSize": null}); not retrying
> 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 4.0 in stage 2.0 (TID 13) 
> had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord
> Serialization stack:
> object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, 
> value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, 
> "oldPosition": null, "end": 61758727, 

[jira] [Closed] (SPARK-18263) Configuring spark.kryo.registrator programmatically doesn't take effect

2016-11-07 Thread inred (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

inred closed SPARK-18263.
-

> Configuring spark.kryo.registrator programmatically doesn't take effect
> ---
>
> Key: SPARK-18263
> URL: https://issues.apache.org/jira/browse/SPARK-18263
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.6
> scala-2.11.8
>Reporter: inred
>
> it run ok with spark-shell --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf 
> spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \
> but in IDE
>  val spark = SparkSession.builder.master("local[*]").appName("Anno 
> BDG").getOrCreate()
> spark.conf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> spark.conf.set("spark.kryo.registrator", 
> "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
> it reports the following error:
> java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord
> Serialization stack:
> object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, 
> value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, 
> "oldPosition": null, "end": 61758727, "mapq": 25, "readName": 
> "NB501244AR:119:HJY3WBGXY:2:2:6137:19359", "sequence": 
> "TACTGAGACTTATCAGAATTTCAGGCTAAAGCAACC", "qual": 
> "AAEA", "cigar": "40M", "oldCigar": null, 
> "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, 
> "properPair": false, "readMapped": true, "mateMapped": false, 
> "failedVendorQualityChecks": false, "duplicateRead": false, 
> "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": 
> true, "secondaryAlignment": false, "supplementaryAlignment": false, 
> "mismatchingPositions": "40", "origQual": null, "attributes": 
> "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": 
> null, "recordGroupSample": null, "mateAlignmentStart": null, 
> "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null})
> at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at 
> org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:150)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 0.0 in stage 2.0 (TID 9) 
> had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord
> Serialization stack:
> object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, 
> value: {"readInFragment": 0, "contigName": "chr1", "start": 10001, 
> "oldPosition": null, "end": 10041, "mapq": 0, "readName": 
> "NB501244AR:119:HJY3WBGXY:3:11508:7857:8792", "sequence": 
> "AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC", "qual": 
> "///E6EEEAEEE/AEAAA/A", "cigar": "40M", "oldCigar": null, 
> "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, 
> "properPair": false, "readMapped": true, "mateMapped": false, 
> "failedVendorQualityChecks": false, "duplicateRead": false, 
> "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": 
> true, "secondaryAlignment": false, "supplementaryAlignment": false, 
> "mismatchingPositions": "40", "origQual": null, "attributes": 
> "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:594", "recordGroupName": null, 
> "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": 
> null, "mateContigName": null, "inferredInsertSize": null}); not retrying
> 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 4.0 in stage 2.0 (TID 13) 
> had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord
> Serialization stack:
> object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, 
> value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, 
> "oldPosition": null, "end": 61758727, "mapq": 25, "readName": 
> 

[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18339:
-
Target Version/s: 2.2.0

> Don't push down current_timestamp for filters in StructuredStreaming
> 
>
> Key: SPARK-18339
> URL: https://issues.apache.org/jira/browse/SPARK-18339
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>  Labels: correctness
>
> For the following workflow:
> 1. I have a column called time which is at minute level precision in a 
> Streaming DataFrame
> 2. I want to perform groupBy time, count
> 3. Then I want my MemorySink to only have the last 30 minutes of counts and I 
> perform this by
> {code}
> .where('time >= current_timestamp().cast("long") - 30 * 60)
> {code}
> what happens is that the `filter` gets pushed down before the aggregation, 
> and the filter happens on the source data for the aggregation instead of the 
> result of the aggregation (where I actually want to filter).
> I guess the main issue here is that `current_timestamp` is non-deterministic 
> in the streaming context and shouldn't be pushed down the filter.
> Does this require us to store the `current_timestamp` for each trigger of the 
> streaming job, that is something to discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18217) Disallow creating permanent views based on temporary views or UDFs

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18217.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Disallow creating permanent views based on temporary views or UDFs
> --
>
> Key: SPARK-18217
> URL: https://issues.apache.org/jira/browse/SPARK-18217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> See the discussion in the parent ticket SPARK-18209. It doesn't really make 
> sense to create permanent views based on temporary views or UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16609) Single function for parsing timestamps/dates

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16609:

Target Version/s: 2.2.0  (was: 2.1.0)

> Single function for parsing timestamps/dates
> 
>
> Key: SPARK-16609
> URL: https://issues.apache.org/jira/browse/SPARK-16609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Reynold Xin
>
> Today, if you want to parse a date or timestamp, you have to use the unix 
> time function and then cast to a timestamp.  Its a little odd there isn't a 
> single function that does both.  I propose we add
> {code}
> to_date(, )/to_timestamp(, ).
> {code}
> For reference, in other systems there are:
> MS SQL: {{convert(, )}}. See: 
> https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx
> Netezza: {{to_timestamp(, )}}. See: 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html
> Teradata has special casting functionality: {{cast( as timestamp 
> format '')}}
> MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you 
> define both date and time parts. See: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18298) HistoryServer use GMT time all time

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18298:


Assignee: (was: Apache Spark)

> HistoryServer use GMT time all time
> ---
>
> Key: SPARK-18298
> URL: https://issues.apache.org/jira/browse/SPARK-18298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1
> Environment: suse 11.3 with CST time
>Reporter: Tao Wang
>
> When I started HistoryServer for reading event logs, the timestamp readed 
> will be parsed using local timezone like "CST"(confirmed via debug).
> But the time related columns like "Started"/"Completed"/"Last Updated" in 
> History Server UI using "GMT" time, which is 8 hours earlier than "CST".
> {quote}
> App IDApp NameStarted Completed   DurationSpark 
> User  Last UpdatedEvent Log
> local-1478225166651   Spark shell 2016-11-04 02:06:06 2016-11-07 
> 01:33:30 71.5 h  root2016-11-07 01:33:30
> {quote}
> I've checked the REST api and found the result like:
> {color:red}
> [ {
>   "id" : "local-1478225166651",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:06:06.020GMT",  
> "endTime" : "2016-11-07T01:33:30.265GMT",  
> "lastUpdated" : "2016-11-07T01:33:30.000GMT",
> "duration" : 257244245,
> "sparkUser" : "root",
> "completed" : true,
> "lastUpdatedEpoch" : 147848241,
> "endTimeEpoch" : 1478482410265,
> "startTimeEpoch" : 1478225166020
>   } ]
> }, {
>   "id" : "local-1478224925869",
>   "name" : "Spark Pi",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:02:02.133GMT",
> "endTime" : "2016-11-04T02:02:07.468GMT",
> "lastUpdated" : "2016-11-04T02:02:07.000GMT",
> "duration" : 5335,
> "sparkUser" : "root",
> "completed" : true,
> ...
> {color}
> So maybe the change happened in transferring between server and browser? I 
> have no idea where to go from this point.
> Hope guys can offer some help, or just fix it if it's easy? :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time

2016-11-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646209#comment-15646209
 ] 

Apache Spark commented on SPARK-18298:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/15803

> HistoryServer use GMT time all time
> ---
>
> Key: SPARK-18298
> URL: https://issues.apache.org/jira/browse/SPARK-18298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1
> Environment: suse 11.3 with CST time
>Reporter: Tao Wang
>
> When I started HistoryServer for reading event logs, the timestamp readed 
> will be parsed using local timezone like "CST"(confirmed via debug).
> But the time related columns like "Started"/"Completed"/"Last Updated" in 
> History Server UI using "GMT" time, which is 8 hours earlier than "CST".
> {quote}
> App IDApp NameStarted Completed   DurationSpark 
> User  Last UpdatedEvent Log
> local-1478225166651   Spark shell 2016-11-04 02:06:06 2016-11-07 
> 01:33:30 71.5 h  root2016-11-07 01:33:30
> {quote}
> I've checked the REST api and found the result like:
> {color:red}
> [ {
>   "id" : "local-1478225166651",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:06:06.020GMT",  
> "endTime" : "2016-11-07T01:33:30.265GMT",  
> "lastUpdated" : "2016-11-07T01:33:30.000GMT",
> "duration" : 257244245,
> "sparkUser" : "root",
> "completed" : true,
> "lastUpdatedEpoch" : 147848241,
> "endTimeEpoch" : 1478482410265,
> "startTimeEpoch" : 1478225166020
>   } ]
> }, {
>   "id" : "local-1478224925869",
>   "name" : "Spark Pi",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:02:02.133GMT",
> "endTime" : "2016-11-04T02:02:07.468GMT",
> "lastUpdated" : "2016-11-04T02:02:07.000GMT",
> "duration" : 5335,
> "sparkUser" : "root",
> "completed" : true,
> ...
> {color}
> So maybe the change happened in transferring between server and browser? I 
> have no idea where to go from this point.
> Hope guys can offer some help, or just fix it if it's easy? :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18298) HistoryServer use GMT time all time

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18298:


Assignee: Apache Spark

> HistoryServer use GMT time all time
> ---
>
> Key: SPARK-18298
> URL: https://issues.apache.org/jira/browse/SPARK-18298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1
> Environment: suse 11.3 with CST time
>Reporter: Tao Wang
>Assignee: Apache Spark
>
> When I started HistoryServer for reading event logs, the timestamp readed 
> will be parsed using local timezone like "CST"(confirmed via debug).
> But the time related columns like "Started"/"Completed"/"Last Updated" in 
> History Server UI using "GMT" time, which is 8 hours earlier than "CST".
> {quote}
> App IDApp NameStarted Completed   DurationSpark 
> User  Last UpdatedEvent Log
> local-1478225166651   Spark shell 2016-11-04 02:06:06 2016-11-07 
> 01:33:30 71.5 h  root2016-11-07 01:33:30
> {quote}
> I've checked the REST api and found the result like:
> {color:red}
> [ {
>   "id" : "local-1478225166651",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:06:06.020GMT",  
> "endTime" : "2016-11-07T01:33:30.265GMT",  
> "lastUpdated" : "2016-11-07T01:33:30.000GMT",
> "duration" : 257244245,
> "sparkUser" : "root",
> "completed" : true,
> "lastUpdatedEpoch" : 147848241,
> "endTimeEpoch" : 1478482410265,
> "startTimeEpoch" : 1478225166020
>   } ]
> }, {
>   "id" : "local-1478224925869",
>   "name" : "Spark Pi",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:02:02.133GMT",
> "endTime" : "2016-11-04T02:02:07.468GMT",
> "lastUpdated" : "2016-11-04T02:02:07.000GMT",
> "duration" : 5335,
> "sparkUser" : "root",
> "completed" : true,
> ...
> {color}
> So maybe the change happened in transferring between server and browser? I 
> have no idea where to go from this point.
> Hope guys can offer some help, or just fix it if it's easy? :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18340) Inconsistent error messages in launching scripts and hanging in sparkr script for wrong options

2016-11-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18340:
-
Description: 
It seems there are some problems with handling wrong options as below:

*{{spark-submit}} script - this one looks fine

{code}
spark-submit --aabbcc
Error: Unrecognized option: --aabbcc

Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
  --class CLASS_NAME  Your application's main class (for Java / Scala 
apps).
  --name NAME A name of your application.
...
{code}


*{{spark-sql}} script - this one looks fine

{code}
spark-sql --aabbcc
Unrecognized option: --aabbcc
usage: hive
 -d,--define 

[jira] [Created] (SPARK-18340) Inconsistent error messages in launching scripts and hanging in sparkr script for wrong options

2016-11-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18340:


 Summary: Inconsistent error messages in launching scripts and 
hanging in sparkr script for wrong options
 Key: SPARK-18340
 URL: https://issues.apache.org/jira/browse/SPARK-18340
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit
Reporter: Hyukjin Kwon
Priority: Minor


It seems there are some problems with handling wrong options as below:

*{{spark-submit}} script - this one looks fine

{code}
spark-submit --aabbcc
Error: Unrecognized option: --aabbcc

Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
  --class CLASS_NAME  Your application's main class (for Java / Scala 
apps).
  --name NAME A name of your application.
...
{code}


*{{spark-sql}} script - this one looks fine

{code}
spark-sql --aabbcc
Unrecognized option: --aabbcc
usage: hive
 -d,--define 

[jira] [Updated] (SPARK-17019) Expose off-heap memory usage in various places

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17019:

Target Version/s: 2.2.0  (was: 2.1.0)

> Expose off-heap memory usage in various places
> --
>
> Key: SPARK-17019
> URL: https://issues.apache.org/jira/browse/SPARK-17019
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Priority: Minor
>
> With SPARK-13992, Spark supports persisting data into off-heap memory, but 
> the usage of off-heap is not exposed currently, it is not so convenient for 
> user to monitor and profile, so here propose to expose off-heap memory as 
> well as on-heap memory usage in various places:
> 1. Spark UI's executor page will display both on-heap and off-heap memory 
> usage.
> 2. REST request returns both on-heap and off-heap memory.
> 3. Also these two memory usage can be obtained programmatically from 
> SparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16317) Add file filtering interface for FileFormat

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16317:

Target Version/s: 2.2.0  (was: 2.1.0)

> Add file filtering interface for FileFormat
> ---
>
> Key: SPARK-16317
> URL: https://issues.apache.org/jira/browse/SPARK-16317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Minor
>
> {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) 
> have customized file filtering logics. For example, Parquet needs to filter 
> out summary files, while Avro provides a Hadoop configuration option to 
> filter out all files whose names don't end with ".avro".
> It would be nice to have a general file filtering interface in {{FileFormat}} 
> to handle similar requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18261) Add statistics to MemorySink for joining

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18261.
-
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.1.0

> Add statistics to MemorySink for joining 
> -
>
> Key: SPARK-18261
> URL: https://issues.apache.org/jira/browse/SPARK-18261
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>Assignee: Liwei Lin
> Fix For: 2.1.0
>
>
> Right now, there is no way to join the output of a memory sink with any table:
> {code}
> UnsupportedOperationException: LeafNode MemoryPlan must implement statistics
> {code}
> Being able to join snapshots of memory streams with tables would be nice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0

2016-11-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18086.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.1.0

> Regression: Hive variables no longer work in Spark 2.0
> --
>
> Key: SPARK-18086
> URL: https://issues.apache.org/jira/browse/SPARK-18086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.1.0
>
>
> The behavior of variables in the SQL shell has changed from 1.6 to 2.0. 
> Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer 
> work. Queries that worked correctly in 1.6 will either fail or produce 
> unexpected results in 2.0 so I think this is a regression that should be 
> addressed.
> Hive and Spark 1.6 work like this:
> 1. Command-line args --hiveconf and --hivevar can be used to set session 
> properties. --hiveconf properties are added to the Hadoop Configuration.
> 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} 
> adds a Hive var.
> 3. Hive vars can be substituted into queries by name, and Configuration 
> properties can be substituted using {{hiveconf:name}}.
> In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then 
> the value in SQLConf for the rest of the key is returned. SET adds properties 
> to the session config and (according to [a 
> comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28])
>  the Hadoop configuration "during I/O".
> {code:title=Hive and Spark 1.6.1 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> ${test.conf}
> spark-sql> select "${hivevar:test.var}";
> 2
> spark-sql> select "${test.var}";
> 2
> spark-sql> set test.set=3;
> SET test.set=3
> spark-sql> select "${test.set}"
> "${test.set}"
> spark-sql> select "${hivevar:test.set}"
> "${hivevar:test.set}"
> spark-sql> select "${hiveconf:test.set}"
> 3
> spark-sql> set hivevar:test.setvar=4;
> SET hivevar:test.setvar=4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> 4
> {code}
> {code:title=Spark 2.0.0 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> 1
> spark-sql> select "${hivevar:test.var}";
> ${hivevar:test.var}
> spark-sql> select "${test.var}";
> ${test.var}
> spark-sql> set test.set=3;
> test.set3
> spark-sql> select "${test.set}";
> 3
> spark-sql> set hivevar:test.setvar=4;
> hivevar:test.setvar  4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> ${test.setvar}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-07 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18339:
---

 Summary: Don't push down current_timestamp for filters in 
StructuredStreaming
 Key: SPARK-18339
 URL: https://issues.apache.org/jira/browse/SPARK-18339
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.1
Reporter: Burak Yavuz


For the following workflow:
1. I have a column called time which is at minute level precision in a 
Streaming DataFrame
2. I want to perform groupBy time, count
3. Then I want my MemorySink to only have the last 30 minutes of counts and I 
perform this by
{code}
.where('time >= current_timestamp().cast("long") - 30 * 60)
{code}

what happens is that the `filter` gets pushed down before the aggregation, and 
the filter happens on the source data for the aggregation instead of the result 
of the aggregation (where I actually want to filter).
I guess the main issue here is that `current_timestamp` is non-deterministic in 
the streaming context and shouldn't be pushed down the filter.

Does this require us to store the `current_timestamp` for each trigger of the 
streaming job, that is something to discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18295) Match up to_json to from_json in null safety

2016-11-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-18295.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15792
[https://github.com/apache/spark/pull/15792]

> Match up to_json to from_json in null safety
> 
>
> Key: SPARK-18295
> URL: https://issues.apache.org/jira/browse/SPARK-18295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> {code}
> scala> val df = Seq(Some(Tuple1(Tuple1(1))), None).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: struct<_1: int>]
> scala> df.show()
> ++
> |   a|
> ++
> | [1]|
> |null|
> ++
> scala> df.select(to_json($"a")).show()
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeFields(JacksonGenerator.scala:138)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator$$anonfun$write$1.apply$mcV$sp(JacksonGenerator.scala:194)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeObject(JacksonGenerator.scala:131)
>   at 
> org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:193)
>   at 
> org.apache.spark.sql.catalyst.expressions.StructToJson.eval(jsonExpressions.scala:544)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs

2016-11-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646011#comment-15646011
 ] 

Yanbo Liang commented on SPARK-18318:
-

I'm interested in contributing this task. Thanks.

> ML, Graph 2.1 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-18318
> URL: https://issues.apache.org/jira/browse/SPARK-18318
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18316) Spark MLlib, GraphX 2.1 QA umbrella

2016-11-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646006#comment-15646006
 ] 

Yanbo Liang commented on SPARK-18316:
-

Typo? Here should be 2.1 rather than 2.0?

> Spark MLlib, GraphX 2.1 QA umbrella
> ---
>
> Key: SPARK-18316
> URL: https://issues.apache.org/jira/browse/SPARK-18316
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-18329].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> * Major new algorithms: MinHash, RandomProjection
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18236) Reduce memory usage of Spark UI and HistoryServer by reducing duplicate objects

2016-11-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-18236.

   Resolution: Fixed
Fix Version/s: 2.2.0

Merged into master (2.2.0).

> Reduce memory usage of Spark UI and HistoryServer by reducing duplicate 
> objects
> ---
>
> Key: SPARK-18236
> URL: https://issues.apache.org/jira/browse/SPARK-18236
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.2.0
>
>
> When profiling heap dumps from the Spark History Server and live Spark web 
> UIs, I found a tremendous amount of memory being wasted on duplicate objects 
> and strings. A few small changes can cut per-task UI memory by half or more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds

2016-11-07 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18338:
---
Description: 
Test case initialization order under Maven and SBT are different. Maven always 
creates instances of all test cases and then run them all together.

This fails {{ObjectHashAggregateSuite}} because the randomized test cases there 
register a temporary Hive function right before creating a test case, and can 
be cleared while initializing other successive test cases.

In SBT, this is fine since the created test case is executed immediately after 
creating the temporary function. 

To fix this issue, we should put initialization/destruction code into 
{{beforeAll()}} and {{afterAll()}}.


  was:
Test case initialization order under Maven and SBT are different. Maven always 
creates instances of all test cases and then run them altogether.

This fails {{ObjectHashAggregateSuite}} because the randomized test cases their 
registers a temporary Hive function right before creating a test case, and can 
be cleared while initializing other successive test cases.

In SBT, this is fine since the created test case is executed immediately after 
creating the temporary function. 

To fix this issue, we should put initialization/destruction code into 
{{beforeAll()}} and {{afterAll()}}.



> ObjectHashAggregateSuite fails under Maven builds
> -
>
> Key: SPARK-18338
> URL: https://issues.apache.org/jira/browse/SPARK-18338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: flaky-test
>
> Test case initialization order under Maven and SBT are different. Maven 
> always creates instances of all test cases and then run them all together.
> This fails {{ObjectHashAggregateSuite}} because the randomized test cases 
> there register a temporary Hive function right before creating a test case, 
> and can be cleared while initializing other successive test cases.
> In SBT, this is fine since the created test case is executed immediately 
> after creating the temporary function. 
> To fix this issue, we should put initialization/destruction code into 
> {{beforeAll()}} and {{afterAll()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds

2016-11-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645865#comment-15645865
 ] 

Apache Spark commented on SPARK-18338:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15802

> ObjectHashAggregateSuite fails under Maven builds
> -
>
> Key: SPARK-18338
> URL: https://issues.apache.org/jira/browse/SPARK-18338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: flaky-test
>
> Test case initialization order under Maven and SBT are different. Maven 
> always creates instances of all test cases and then run them altogether.
> This fails {{ObjectHashAggregateSuite}} because the randomized test cases 
> their registers a temporary Hive function right before creating a test case, 
> and can be cleared while initializing other successive test cases.
> In SBT, this is fine since the created test case is executed immediately 
> after creating the temporary function. 
> To fix this issue, we should put initialization/destruction code into 
> {{beforeAll()}} and {{afterAll()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18338:


Assignee: Apache Spark  (was: Cheng Lian)

> ObjectHashAggregateSuite fails under Maven builds
> -
>
> Key: SPARK-18338
> URL: https://issues.apache.org/jira/browse/SPARK-18338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: flaky-test
>
> Test case initialization order under Maven and SBT are different. Maven 
> always creates instances of all test cases and then run them altogether.
> This fails {{ObjectHashAggregateSuite}} because the randomized test cases 
> their registers a temporary Hive function right before creating a test case, 
> and can be cleared while initializing other successive test cases.
> In SBT, this is fine since the created test case is executed immediately 
> after creating the temporary function. 
> To fix this issue, we should put initialization/destruction code into 
> {{beforeAll()}} and {{afterAll()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18338:


Assignee: Cheng Lian  (was: Apache Spark)

> ObjectHashAggregateSuite fails under Maven builds
> -
>
> Key: SPARK-18338
> URL: https://issues.apache.org/jira/browse/SPARK-18338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: flaky-test
>
> Test case initialization order under Maven and SBT are different. Maven 
> always creates instances of all test cases and then run them altogether.
> This fails {{ObjectHashAggregateSuite}} because the randomized test cases 
> their registers a temporary Hive function right before creating a test case, 
> and can be cleared while initializing other successive test cases.
> In SBT, this is fine since the created test case is executed immediately 
> after creating the temporary function. 
> To fix this issue, we should put initialization/destruction code into 
> {{beforeAll()}} and {{afterAll()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds

2016-11-07 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-18338:
--

 Summary: ObjectHashAggregateSuite fails under Maven builds
 Key: SPARK-18338
 URL: https://issues.apache.org/jira/browse/SPARK-18338
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Test case initialization order under Maven and SBT are different. Maven always 
creates instances of all test cases and then run them altogether.

This fails {{ObjectHashAggregateSuite}} because the randomized test cases their 
registers a temporary Hive function right before creating a test case, and can 
be cleared while initializing other successive test cases.

In SBT, this is fine since the created test case is executed immediately after 
creating the temporary function. 

To fix this issue, we should put initialization/destruction code into 
{{beforeAll()}} and {{afterAll()}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17490) Optimize SerializeFromObject for primitive array

2016-11-07 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17490.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.1.0

> Optimize SerializeFromObject for primitive array
> 
>
> Key: SPARK-17490
> URL: https://issues.apache.org/jira/browse/SPARK-17490
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.1.0
>
>
> In logical plan, {{SerializeFromObject}} for an array always use 
> {{GenericArrayData}} as a destination. {{UnsafeArrayData}} could be used for 
> an primitive array. This is a simple approach to solve issues that are 
> addressed by SPARK-16043.
> Here is a motivating example.
> {code}
> sparkContext.parallelize(Seq(Array(1)), 1).toDS.map(e => e).show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2016-11-07 Thread Mark Tygert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645652#comment-15645652
 ] 

Mark Tygert edited comment on SPARK-8614 at 11/7/16 10:36 PM:
--

This remains a big issue, rendering the results produced by MLlib to be 
incorrect for most matrix decompositions and matrix-matrix multiplications when 
using multiple executors or workers. [~hl475] of Yale is working to fix the 
problem, and eventually ML for DataFrames will need to incorporate his 
solutions.


was (Author: tygert):
This remains a big issue, rendering the results produced by MLlib to be 
incorrect for most matrix decompositions and matrix-matrix multiplications when 
using multiple executors or workers. Huamin Li of Yale is working to fix the 
problem, and eventually ML for DataFrames will need to incorporate his 
solutions.

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18337) Memory Sink should be able to recover from checkpoints in Complete OutputMode

2016-11-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645673#comment-15645673
 ] 

Apache Spark commented on SPARK-18337:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15801

> Memory Sink should be able to recover from checkpoints in Complete OutputMode
> -
>
> Key: SPARK-18337
> URL: https://issues.apache.org/jira/browse/SPARK-18337
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>
> Memory sinks are not meant to be fault tolerant, but there are certain cases, 
> where it would be nice that it can recover from checkpoints. In cases where 
> you may use a scalable StateStore in StructuredStreaming (when you have an 
> aggregation), and you add a filter based on a key or value in your state, 
> it's nice to be able to continue from where you left off after failures.
> For correctness reasons, the output will ONLY be correct in Complete mode, so 
> we could support that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18337) Memory Sink should be able to recover from checkpoints in Complete OutputMode

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18337:


Assignee: (was: Apache Spark)

> Memory Sink should be able to recover from checkpoints in Complete OutputMode
> -
>
> Key: SPARK-18337
> URL: https://issues.apache.org/jira/browse/SPARK-18337
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>
> Memory sinks are not meant to be fault tolerant, but there are certain cases, 
> where it would be nice that it can recover from checkpoints. In cases where 
> you may use a scalable StateStore in StructuredStreaming (when you have an 
> aggregation), and you add a filter based on a key or value in your state, 
> it's nice to be able to continue from where you left off after failures.
> For correctness reasons, the output will ONLY be correct in Complete mode, so 
> we could support that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18337) Memory Sink should be able to recover from checkpoints in Complete OutputMode

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18337:


Assignee: Apache Spark

> Memory Sink should be able to recover from checkpoints in Complete OutputMode
> -
>
> Key: SPARK-18337
> URL: https://issues.apache.org/jira/browse/SPARK-18337
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Memory sinks are not meant to be fault tolerant, but there are certain cases, 
> where it would be nice that it can recover from checkpoints. In cases where 
> you may use a scalable StateStore in StructuredStreaming (when you have an 
> aggregation), and you add a filter based on a key or value in your state, 
> it's nice to be able to continue from where you left off after failures.
> For correctness reasons, the output will ONLY be correct in Complete mode, so 
> we could support that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18337) Memory Sink should be able to recover from checkpoints in Complete OutputMode

2016-11-07 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18337:
---

 Summary: Memory Sink should be able to recover from checkpoints in 
Complete OutputMode
 Key: SPARK-18337
 URL: https://issues.apache.org/jira/browse/SPARK-18337
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.0.1
Reporter: Burak Yavuz


Memory sinks are not meant to be fault tolerant, but there are certain cases, 
where it would be nice that it can recover from checkpoints. In cases where you 
may use a scalable StateStore in StructuredStreaming (when you have an 
aggregation), and you add a filter based on a key or value in your state, it's 
nice to be able to continue from where you left off after failures.

For correctness reasons, the output will ONLY be correct in Complete mode, so 
we could support that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix

2016-11-07 Thread Mark Tygert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645652#comment-15645652
 ] 

Mark Tygert commented on SPARK-8614:


This remains a big issue, rendering the results produced by MLlib to be 
incorrect for most matrix decompositions and matrix-matrix multiplications when 
using multiple executors or workers. Huamin Li of Yale is working to fix the 
problem, and eventually ML for DataFrames will need to incorporate his 
solutions.

> Row order preservation for operations on MLlib IndexedRowMatrix
> ---
>
> Key: SPARK-8614
> URL: https://issues.apache.org/jira/browse/SPARK-8614
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Jan Luts
>
> In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
> dropped before calling the methods from RowMatrix. For example for 
> IndexedRowMatrix.computeSVD:
>val svd = toRowMatrix().computeSVD(k, computeU, rCond)
> and for IndexedRowMatrix.multiply:
>val mat = toRowMatrix().multiply(B).
> After computing these results, they are zipped with the original indices, 
> e.g. for IndexedRowMatrix.computeSVD
>val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> and for IndexedRowMatrix.multiply:
>
>val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
>   IndexedRow(i, v)
>}
> I have experienced that for IndexedRowMatrix.computeSVD().U and 
> IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
> indices can get mixed (when running Spark jobs with multiple 
> executors/machines): i.e. the vectors and indices of the result do not seem 
> to correspond anymore. 
> To me it looks like this is caused by zipping RDDs that have a different 
> ordering?
> For the IndexedRowMatrix.multiply I have observed that ordering within 
> partitions is preserved, but that it seems to get mixed up between 
> partitions. For example, for:
> part1Index1 part1Vector1
> part1Index2 part1Vector2
> part2Index1 part2Vector1
> part2Index2 part2Vector2
> I got:
> part2Index1 part1Vector1
> part2Index2 part1Vector2
> part1Index1 part2Vector1
> part1Index2 part2Vector2
> Another observation is that the mapPartitions in RowMatrix.multiply :
> val AB = rows.mapPartitions { iter =>
> had an "preservesPartitioning = true" argument in version 1.0, but this is no 
> longer there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17993) Spark prints an avalanche of warning messages from Parquet when reading parquet files written by older versions of Parquet-mr

2016-11-07 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17993:
---
Summary: Spark prints an avalanche of warning messages from Parquet when 
reading parquet files written by older versions of Parquet-mr  (was: Spark 
spews a slew of harmless but annoying warning messages from Parquet when 
reading parquet files written by older versions of Parquet-mr)

> Spark prints an avalanche of warning messages from Parquet when reading 
> parquet files written by older versions of Parquet-mr
> -
>
> Key: SPARK-17993
> URL: https://issues.apache.org/jira/browse/SPARK-17993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>   at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>   at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by 

[jira] [Assigned] (SPARK-18334) MinHash should use binary hash distance

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18334:


Assignee: (was: Apache Spark)

> MinHash should use binary hash distance
> ---
>
> Key: SPARK-18334
> URL: https://issues.apache.org/jira/browse/SPARK-18334
> Project: Spark
>  Issue Type: Bug
>Reporter: Yun Ni
>Priority: Trivial
>
> MinHash currently is using the same `hashDistance` function as 
> RandomProjection. This does not make sense for MinHash because the Jaccard 
> distance of two sets is not relevant to the absolute distance of their hash 
> buckets indices.
> This bug could affect accuracy of multi probing NN search for MinHash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18334) MinHash should use binary hash distance

2016-11-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645540#comment-15645540
 ] 

Apache Spark commented on SPARK-18334:
--

User 'Yunni' has created a pull request for this issue:
https://github.com/apache/spark/pull/15800

> MinHash should use binary hash distance
> ---
>
> Key: SPARK-18334
> URL: https://issues.apache.org/jira/browse/SPARK-18334
> Project: Spark
>  Issue Type: Bug
>Reporter: Yun Ni
>Priority: Trivial
>
> MinHash currently is using the same `hashDistance` function as 
> RandomProjection. This does not make sense for MinHash because the Jaccard 
> distance of two sets is not relevant to the absolute distance of their hash 
> buckets indices.
> This bug could affect accuracy of multi probing NN search for MinHash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18334) MinHash should use binary hash distance

2016-11-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18334:


Assignee: Apache Spark

> MinHash should use binary hash distance
> ---
>
> Key: SPARK-18334
> URL: https://issues.apache.org/jira/browse/SPARK-18334
> Project: Spark
>  Issue Type: Bug
>Reporter: Yun Ni
>Assignee: Apache Spark
>Priority: Trivial
>
> MinHash currently is using the same `hashDistance` function as 
> RandomProjection. This does not make sense for MinHash because the Jaccard 
> distance of two sets is not relevant to the absolute distance of their hash 
> buckets indices.
> This bug could affect accuracy of multi probing NN search for MinHash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr

2016-11-07 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645495#comment-15645495
 ] 

Michael Allman commented on SPARK-17993:


Thank you for your input, Keith. I agree this is a major issue, and I'm trying 
to get this resolved for 2.1.

> Spark spews a slew of harmless but annoying warning messages from Parquet 
> when reading parquet files written by older versions of Parquet-mr
> 
>
> Key: SPARK-17993
> URL: https://issues.apache.org/jira/browse/SPARK-17993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Allman
>
> It looks like https://github.com/apache/spark/pull/14690 broke parquet log 
> output redirection. After that patch, when querying parquet files written by 
> Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from 
> the Parquet reader:
> {code}
> Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
> Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
> parquet-mr version 1.6.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
>   at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>   at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This only happens during execution, not planning, and it doesn't matter what 
> log level the {{SparkContext}} is set to.
> This is a regression I noted as something we needed to fix as a follow up to 
> PR 14690. I feel responsible, so I'm going to expedite a fix for it. I 
> suspect that PR broke Spark's Parquet log output redirection. That's the 
> premise I'm going by.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

  1   2   3   4   >