[jira] [Updated] (SPARK-17386) Default polling and trigger intervals cause excessive RPC calls

2016-09-07 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-17386:

Description: 
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will sit in a loop calling {{getOffset()}} every 10 msec 
(the default value of STREAMING_POLLING_DELAY) on every {{Source}} until new 
data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this rapid polling leads to excessive CPU usage.

In a production environment, this overhead could disrupt critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If no 
data has arrived, Spark will list the directory's contents up to 100 times per 
second. This overhead could disrupt service to other systems using HDFS, 
including Spark itself. A similar situation will exist with the Kafka source, 
the {{getOffset()}} method of which will presumably call Kafka's 
{{Consumer.poll()}} method.

  was:
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec (the default value of STREAMING_POLLING_DELAY) on every {{Source}} until 
new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If no 
data has arrived, Spark will list the directory's contents up to 100 times per 
second. This overhead could disrupt service to other systems using HDFS, 
including Spark itself. A similar situation will exist with the Kafka source, 
the {{getOffset()}} method of which will presumably call Kafka's 
{{Consumer.poll()}} method.


> Default polling and trigger intervals cause excessive RPC calls
> ---
>
> Key: SPARK-17386
> URL: https://issues.apache.org/jira/browse/SPARK-17386
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Frederick Reiss
>Priority: Minor
>
> The default trigger interval for a Structured Streaming query is 
> {{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
> When the trigger is set to this default value, the scheduler in 
> {{StreamExecution}} will sit in a loop calling {{getOffset()}} every 10 msec 
> (the default value of STREAMING_POLLING_DELAY) on every {{Source}} until new 
> data arrives.
> In test cases, where most of the sources are {{MemoryStream}} or 
> {{TextSocketSource}}, this rapid polling leads to excessive CPU usage.
> In a production environment, this overhead could disrupt critical 
> infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} 
> or the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
> {{FileStreamSource}} performs a directory listing of an HDFS directory. If no 
> data has arrived, Spark will list the directory's contents up to 100 times 
> per second. This overhead could disrupt service to other systems using HDFS, 
> including Spark itself. A similar situation will exist with the Kafka source, 
> the {{getOffset()}} method of which will presumably call Kafka's 
> {{Consumer.poll()}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17386) Default polling and trigger intervals cause excessive RPC calls

2016-09-07 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-17386:

Description: 
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec (the default value of STREAMING_POLLING_DELAY) on every {{Source}} until 
new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If no 
data has arrived, Spark will list the directory's contents up to 100 times per 
second. This overhead could disrupt service to other systems using HDFS, 
including Spark itself. A similar situation will exist with the Kafka source, 
the {{getOffset()}} method of which will presumably call Kafka's 
{{Consumer.poll()}} method.

  was:
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec (the default value of STREAMING_POLLING_DELAY) on every {{Source}} until 
new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If the 
scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark will 
list an HDFS directory's contents up to 100 times per second. This overhead 
could disrupt service to other systems using HDFS, including Spark itself. A 
similar situation will exist with the Kafka source, the {{getOffset()}} method 
of which will presumably call Kafka's {{Consumer.poll()}} method.


> Default polling and trigger intervals cause excessive RPC calls
> ---
>
> Key: SPARK-17386
> URL: https://issues.apache.org/jira/browse/SPARK-17386
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Frederick Reiss
>Priority: Minor
>
> The default trigger interval for a Structured Streaming query is 
> {{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
> When the trigger is set to this default value, the scheduler in 
> {{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 
> 10 msec (the default value of STREAMING_POLLING_DELAY) on every {{Source}} 
> until new data arrives.
> In test cases, where most of the sources are {{MemoryStream}} or 
> {{TextSocketSource}}, this spinning leads to excessive CPU usage.
> In a production environment, this spinning could take down critical 
> infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} 
> or the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
> {{FileStreamSource}} performs a directory listing of an HDFS directory. If no 
> data has arrived, Spark will list the directory's contents up to 100 times 
> per second. This overhead could disrupt service to other systems using HDFS, 
> including Spark itself. A similar situation will exist with the Kafka source, 
> the {{getOffset()}} method of which will presumably call Kafka's 
> {{Consumer.poll()}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17386) Default polling and trigger intervals cause excessive RPC calls

2016-09-07 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-17386:

Description: 
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec (the default value of STREAMING_POLLING_DELAY) on every {{Source}} until 
new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If the 
scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark will 
list an HDFS directory's contents up to 100 times per second. This overhead 
could disrupt service to other systems using HDFS, including Spark itself. A 
similar situation will exist with the Kafka source, the {{getOffset()}} method 
of which will presumably call Kafka's {{Consumer.poll()}} method.

  was:
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec (the default value of on every {{Source}} until new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If the 
scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark will 
make hundreds of RPC calls per second to the HDFS NameNode. This overhead could 
disrupt service to other systems using HDFS, including Spark itself. A similar 
situation will exist with the Kafka source, the {{getOffset()}} method of which 
will presumably call Kafka's {{Consumer.poll()}} method.


> Default polling and trigger intervals cause excessive RPC calls
> ---
>
> Key: SPARK-17386
> URL: https://issues.apache.org/jira/browse/SPARK-17386
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Frederick Reiss
>Priority: Minor
>
> The default trigger interval for a Structured Streaming query is 
> {{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
> When the trigger is set to this default value, the scheduler in 
> {{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 
> 10 msec (the default value of STREAMING_POLLING_DELAY) on every {{Source}} 
> until new data arrives.
> In test cases, where most of the sources are {{MemoryStream}} or 
> {{TextSocketSource}}, this spinning leads to excessive CPU usage.
> In a production environment, this spinning could take down critical 
> infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} 
> or the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
> {{FileStreamSource}} performs a directory listing of an HDFS directory. If 
> the scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark 
> will list an HDFS directory's contents up to 100 times per second. This 
> overhead could disrupt service to other systems using HDFS, including Spark 
> itself. A similar situation will exist with the Kafka source, the 
> {{getOffset()}} method of which will presumably call Kafka's 
> {{Consumer.poll()}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17386) Default polling and trigger intervals cause excessive RPC calls

2016-09-07 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-17386:

Description: 
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec (the default value of on every {{Source}} until new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If the 
scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark will 
make hundreds of RPC calls per second to the HDFS NameNode. This overhead could 
disrupt service to other systems using HDFS, including Spark itself. A similar 
situation will exist with the Kafka source, the {{getOffset()}} method of which 
will presumably call Kafka's {{Consumer.poll()}} method.

  was:
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec on every {{Source}} until new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If the 
scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark will 
make hundreds of RPC calls per second to the HDFS NameNode. This overhead could 
disrupt service to other systems using HDFS, including Spark itself. A similar 
situation will exist with the Kafka source, the {{getOffset()}} method of which 
will presumably call Kafka's {{Consumer.poll()}} method.


> Default polling and trigger intervals cause excessive RPC calls
> ---
>
> Key: SPARK-17386
> URL: https://issues.apache.org/jira/browse/SPARK-17386
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Frederick Reiss
>Priority: Minor
>
> The default trigger interval for a Structured Streaming query is 
> {{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
> When the trigger is set to this default value, the scheduler in 
> {{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 
> 10 msec (the default value of on every {{Source}} until new data arrives.
> In test cases, where most of the sources are {{MemoryStream}} or 
> {{TextSocketSource}}, this spinning leads to excessive CPU usage.
> In a production environment, this spinning could take down critical 
> infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} 
> or the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
> {{FileStreamSource}} performs a directory listing of an HDFS directory. If 
> the scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark 
> will make hundreds of RPC calls per second to the HDFS NameNode. This 
> overhead could disrupt service to other systems using HDFS, including Spark 
> itself. A similar situation will exist with the Kafka source, the 
> {{getOffset()}} method of which will presumably call Kafka's 
> {{Consumer.poll()}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17386) Default polling and trigger intervals cause excessive RPC calls

2016-09-07 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-17386:

Priority: Minor  (was: Major)

> Default polling and trigger intervals cause excessive RPC calls
> ---
>
> Key: SPARK-17386
> URL: https://issues.apache.org/jira/browse/SPARK-17386
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Frederick Reiss
>Priority: Minor
>
> The default trigger interval for a Structured Streaming query is 
> {{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
> When the trigger is set to this default value, the scheduler in 
> {{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 
> 10 msec on every {{Source}} until new data arrives.
> In test cases, where most of the sources are {{MemoryStream}} or 
> {{TextSocketSource}}, this spinning leads to excessive CPU usage.
> In a production environment, this spinning could take down critical 
> infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} 
> or the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
> {{FileStreamSource}} performs a directory listing of an HDFS directory. If 
> the scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark 
> will make hundreds of RPC calls per second to the HDFS NameNode. This 
> overhead could disrupt service to other systems using HDFS, including Spark 
> itself. A similar situation will exist with the Kafka source, the 
> {{getOffset()}} method of which will presumably call Kafka's 
> {{Consumer.poll()}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-07 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-17445:
-

 Summary: Reference an ASF page as the main place to find 
third-party packages
 Key: SPARK-17445
 URL: https://issues.apache.org/jira/browse/SPARK-17445
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia


Some comments and docs like 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
 say to go to spark-packages.org, but since this is a package index maintained 
by a third party, it would be better to reference an ASF page that we can keep 
updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17429) spark sql length(1) return error

2016-09-07 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472801#comment-15472801
 ] 

cen yuhai commented on SPARK-17429:
---

ok, I wil make a PR later

> spark sql length(1) return error
> 
>
> Key: SPARK-17429
> URL: https://issues.apache.org/jira/browse/SPARK-17429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> select length(11);
> select length(2.0);
> these sql will return errors, but hive is ok.
> Error in query: cannot resolve 'length(11)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '11' is of int type.; 
> line 1 pos 14
> Error in query: cannot resolve 'length(2.0)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '2.0' is of double 
> type.; line 1 pos 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17444) spark memory allocation makes workers non responsive

2016-09-07 Thread Ofer Eliassaf (JIRA)
Ofer Eliassaf created SPARK-17444:
-

 Summary: spark memory allocation makes workers non responsive
 Key: SPARK-17444
 URL: https://issues.apache.org/jira/browse/SPARK-17444
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
 Environment: spark standalone
Reporter: Ofer Eliassaf
Priority: Critical


I am running a cluster of 3 slaves and 2 masters with spark standalone.
total of 12 cores  (4 in each machine)
memory allocated to executors and workers are 4.5GB, and the machine has total 
of 8GB.

steps to reproduce:
open pyspark and point to the masters

run the following command multiple times:
sc.parallelize(range(1,5000), 12).count()
after few runs the python will stop respond.

then exit the python shell.

The critical issue that after this happens the cluster is not useful any more:
There is no way to submit application or running another commands on the 
cluster etc.


Hope this helps!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-17339:
--
Fix Version/s: 2.0.1

> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.0.1, 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472704#comment-15472704
 ] 

Apache Spark commented on SPARK-17442:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/15010

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472681#comment-15472681
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464236#comment-15464236
 ] 

Hyukjin Kwon commented on SPARK-17339:
--

[~sarutak] [~shivaram] Please cc me if any of you submit a PR so that I can run 
the build automation (as it is not merged yet). Otherwise, I can do this if you 
tell me which one is preferred.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472682#comment-15472682
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464236#comment-15464236
 ] 

Hyukjin Kwon commented on SPARK-17339:
--

[~sarutak] [~shivaram] Please cc me if any of you submit a PR so that I can run 
the build automation (as it is not merged yet). Otherwise, I can do this if you 
tell me which one is preferred.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17442:
-
Target Version/s: 2.0.1, 2.1.0  (was: 2.0.1)

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17443) SparkLauncher should allow stoppingApplication and need not rely on SparkSubmit binary

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472616#comment-15472616
 ] 

Apache Spark commented on SPARK-17443:
--

User 'kishorvpatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15009

> SparkLauncher should allow stoppingApplication and need not rely on 
> SparkSubmit binary
> --
>
> Key: SPARK-17443
> URL: https://issues.apache.org/jira/browse/SPARK-17443
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>
> Oozie wants SparkLauncher to support the following things:
> - When oozie launcher is killed, the launched Spark application also gets 
> killed
> - Spark Launcher to not have to rely on spark-submit bash script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17443) SparkLauncher should allow stoppingApplication and need not rely on SparkSubmit binary

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17443:


Assignee: (was: Apache Spark)

> SparkLauncher should allow stoppingApplication and need not rely on 
> SparkSubmit binary
> --
>
> Key: SPARK-17443
> URL: https://issues.apache.org/jira/browse/SPARK-17443
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>
> Oozie wants SparkLauncher to support the following things:
> - When oozie launcher is killed, the launched Spark application also gets 
> killed
> - Spark Launcher to not have to rely on spark-submit bash script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17443) SparkLauncher should allow stoppingApplication and need not rely on SparkSubmit binary

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17443:


Assignee: Apache Spark

> SparkLauncher should allow stoppingApplication and need not rely on 
> SparkSubmit binary
> --
>
> Key: SPARK-17443
> URL: https://issues.apache.org/jira/browse/SPARK-17443
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>Assignee: Apache Spark
>
> Oozie wants SparkLauncher to support the following things:
> - When oozie launcher is killed, the launched Spark application also gets 
> killed
> - Spark Launcher to not have to rely on spark-submit bash script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472604#comment-15472604
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464244#comment-15464244
 ] 

Shivaram Venkataraman commented on SPARK-17339:
---

Thanks [~hyukjin.kwon] -- It will be great if you can try the 
`Utils.resolveURI` change as a PR and run that through the build automation 
tool. 

Also the reason I was trying to debug this today is I feel like it would be 
better to make the build green before merging the automation -- otherwise it 
might confuse other contributors etc.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472602#comment-15472602
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464244#comment-15464244
 ] 

Shivaram Venkataraman commented on SPARK-17339:
---

Thanks [~hyukjin.kwon] -- It will be great if you can try the 
`Utils.resolveURI` change as a PR and run that through the build automation 
tool. 

Also the reason I was trying to debug this today is I feel like it would be 
better to make the build green before merging the automation -- otherwise it 
might confuse other contributors etc.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17443) SparkLauncher should allow stoppingApplication and need not rely on SparkSubmit binary

2016-09-07 Thread Kishor Patil (JIRA)
Kishor Patil created SPARK-17443:


 Summary: SparkLauncher should allow stoppingApplication and need 
not rely on SparkSubmit binary
 Key: SPARK-17443
 URL: https://issues.apache.org/jira/browse/SPARK-17443
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.0.0
Reporter: Kishor Patil


Oozie wants SparkLauncher to support the following things:

- When oozie launcher is killed, the launched Spark application also gets killed
- Spark Launcher to not have to rely on spark-submit bash script




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472549#comment-15472549
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464247#comment-15464247
 ] 

Kousuke Saruta commented on SPARK-17339:


[~hyukjin.kwon] Go ahead and submit a PR. Thanks!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472551#comment-15472551
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464247#comment-15464247
 ] 

Kousuke Saruta commented on SPARK-17339:


[~hyukjin.kwon] Go ahead and submit a PR. Thanks!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472523#comment-15472523
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464250#comment-15464250
 ] 

Hyukjin Kwon commented on SPARK-17339:
--

Yeap, I totally agree. Thank you both! Will submit a PR within today.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472522#comment-15472522
 ] 

Hadoop QA commented on SPARK-17339:
---


[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464250#comment-15464250
 ] 

Hyukjin Kwon commented on SPARK-17339:
--

Yeap, I totally agree. Thank you both! Will submit a PR within today.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472413#comment-15472413
 ] 

Hadoop QA commented on SPARK-17400:
---

Frank Dai created SPARK-17400:
-

 Summary: MinMaxScaler.transform() outputs DenseVector by default, 
which causes poor performance
 Key: SPARK-17400
 URL: https://issues.apache.org/jira/browse/SPARK-17400
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.0.0, 1.6.2, 1.6.1
Reporter: Frank Dai


MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472409#comment-15472409
 ] 

Hadoop QA commented on SPARK-17400:
---

Frank Dai created SPARK-17400:
-

 Summary: MinMaxScaler.transform() outputs DenseVector by default, 
which causes poor performance
 Key: SPARK-17400
 URL: https://issues.apache.org/jira/browse/SPARK-17400
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.0.0, 1.6.2, 1.6.1
Reporter: Frank Dai


MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17423) Support IGNORE NULLS option in Window functions

2016-09-07 Thread Tim Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472394#comment-15472394
 ] 

Tim Chan commented on SPARK-17423:
--

[~hvanhovell]

I was able to rewrite this Redshift fragment: 

{code:sql}
DATEDIFF(day,
 LAG(CASE WHEN SUM(activities.activity_one, activities.activity_two) > 
0 THEN activities.date END)
   IGNORE NULLS OVER (PARTITION BY activities.user_id ORDER BY 
activities.date),
 activities.date
) AS days_since_last_activity
{code}

as this Spark SQL fragment: 

{code:sql}
DATEDIFF(activities.date,
 LAST(CASE WHEN SUM(activities.activity_one, activities.activity_two) > 
0 THEN activities.date END, true) OVER (PARTITION BY activities.user_id ORDER 
BY activities.date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) 
AS days_since_last_activity
{code}

Thanks for pointing me in the right direction. 



> Support IGNORE NULLS option in Window functions
> ---
>
> Key: SPARK-17423
> URL: https://issues.apache.org/jira/browse/SPARK-17423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tim Chan
>Priority: Minor
>
> http://stackoverflow.com/questions/24338119/is-it-possible-to-ignore-null-values-when-using-lag-and-lead-functions-in-sq



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472381#comment-15472381
 ] 

Hadoop QA commented on SPARK-17400:
---


 [ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Dai updated SPARK-17400:
--
Description: 
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 

  was:
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472382#comment-15472382
 ] 

Hadoop QA commented on SPARK-17400:
---


 [ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Dai updated SPARK-17400:
--
Description: 
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 

  was:
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472372#comment-15472372
 ] 

Hadoop QA commented on SPARK-17400:
---


 [ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Dai updated SPARK-17400:
--
Description: 
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 
MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: Boolean), 
so that users can decide whether  their output result is dense or sparse.

  was:
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472371#comment-15472371
 ] 

Hadoop QA commented on SPARK-17400:
---


 [ 
https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Dai updated SPARK-17400:
--
Description: 
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 
MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: Boolean), 
so that users can decide whether  their output result is dense or sparse.

  was:
MinMaxScaler.transform() outputs DenseVector by default, which will cause poor 
performance and consume a lot of memory.

The most important line of code is the following:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195

I suggest that the code should calculate the number of non-zero elements in 
advance, if the number of non-zero elements is less than half of the total 
elements in the matrix, use SparseVector, otherwise use DenseVector

Or we can make it configurable by adding  a parameter to 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



> MinMaxScaler.transform() outputs DenseVector by default, which causes poor 
> performance
> --
>
> Key: SPARK-17400
> URL: https://issues.apache.org/jira/browse/SPARK-17400
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Frank Dai
>
> MinMaxScaler.transform() outputs DenseVector by default, which will cause 
> poor performance and consume a lot of memory.
> The most important line of code is the following:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L195
> I suggest that the code should calculate the number of non-zero elements in 
> advance, if the number of non-zero elements is less than half of the total 
> elements in the matrix, use SparseVector, otherwise use DenseVector
> Or we can make it configurable by adding  a parameter to 
> MinMaxScaler.transform(), for example MinMaxScaler.transform(isDense: 
> Boolean), so that users can decide whether  their output result is dense or 
> sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16026) Cost-based Optimizer framework

2016-09-07 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472323#comment-15472323
 ] 

Zhenhua Wang edited comment on SPARK-16026 at 9/8/16 1:06 AM:
--

1. It's a good point. I think it's worth a try, even without the phase 2.
But we need to notice that the search space will become larger for this adapted 
algo, which makes join reordering more time-consuming.
Actually, I'm thinking that maybe we can provide several algorithms for join 
reordering, and leave the decision to users/DBAs to choose which one should be 
used. Or we can simply switch the algo internally based on number of enumerable 
tables, e.g. when the number of tables is small (some threshold), we use the 
one with larger search space.
2. Agree with [~ron8hu]
3. For condition C1(on attribute a) && C2(on attribute b), when estimating C2, 
we assume b is in uniform distribution on a. Its cardinality would be 
ridiculous only when that assumption is ridiculously wrong (b and a have strong 
correlation). In this case, we need multi-dimensional histograms to improve 
accuracy. 
On the other hand, defining a reasonable minimum selectivity is difficult, do 
you have any suggestions?


was (Author: zenwzh):
1. It's a good point. I think it's worth a try, even without the phase 2.
But we need to notice that the search space will become larger for this adapted 
algo, which makes join reordering more time-consuming.
Actually, I'm thinking that maybe we can provide several algorithms for join 
reordering, and leave the decision to users/DBAs to choose which one should be 
used. Or we can simply switch the algo internally based on number of enumerable 
tables, e.g. when the number of tables is small (some threshold), we use the 
one with larger search space.
2. Agree with [~ron8hu]
3. For condition C1(on attribute a) && C2(on attribute b), when estimating C2, 
we assume b is in uniform distribution on a. Its cardinality would be 
ridiculous only when that assumption is ridiculously wrong. In this case, we 
need multi-dimensional histograms to improve accuracy. 
On the other hand, defining a reasonable minimum selectivity is difficult, do 
you have any suggestions?

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-09-07 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472323#comment-15472323
 ] 

Zhenhua Wang commented on SPARK-16026:
--

1. It's a good point. I think it's worth a try, even without the phase 2.
But we need to notice that the search space will become larger for this adapted 
algo, which makes join reordering more time-consuming.
Actually, I'm thinking that maybe we can provide several algorithms for join 
reordering, and leave the decision to users/DBAs to choose which one should be 
used. Or we can simply switch the algo internally based on number of enumerable 
tables, e.g. when the number of tables is small (some threshold), we use the 
one with larger search space.
2. Agree with [~ron8hu]
3. For condition C1(on attribute a) && C2(on attribute b), when estimating C2, 
we assume b is in uniform distribution on a. Its cardinality would be 
ridiculous only when that assumption is ridiculously wrong. In this case, we 
need multi-dimensional histograms to improve accuracy. 
On the other hand, defining a reasonable minimum selectivity is difficult, do 
you have any suggestions?

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17414) Set type is not supported for creating data frames

2016-09-07 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472302#comment-15472302
 ] 

Shuai Lin commented on SPARK-17414:
---

So what type should {{Set}} be mapped to? {{ArrayType}}? That sounds sort of 
counter-intuitive.

> Set type is not supported for creating data frames
> --
>
> Key: SPARK-17414
> URL: https://issues.apache.org/jira/browse/SPARK-17414
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Emre Colak
>Priority: Minor
>
> For a case class that has a field of type Set, createDataFrame() method 
> throws an exception saying "Schema for type Set is not supported". Exception 
> is raised by the org.apache.spark.sql.catalyst.ScalaReflection class where 
> Array, Seq and Map types are supported but Set is not. It would be nice to 
> support Set here by default instead of having to write a custom Encoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17442:
---
Target Version/s: 2.0.1

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17339) Fix SparkR tests on Windows

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472245#comment-15472245
 ] 

Apache Spark commented on SPARK-17339:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15008

> Fix SparkR tests on Windows
> ---
>
> Key: SPARK-17339
> URL: https://issues.apache.org/jira/browse/SPARK-17339
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> A number of SparkR tests are current failing when run on Windows as discussed 
> in https://github.com/apache/spark/pull/14743
> The list of tests that fail right now is at 
> https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
> A full log from a build and test on AppVeyor is at 
> https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472219#comment-15472219
 ] 

Apache Spark commented on SPARK-17442:
--

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/15007

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17442:


Assignee: Apache Spark

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17442:


Assignee: (was: Apache Spark)

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472189#comment-15472189
 ] 

Sean Zhong commented on SPARK-17364:


I have a trial fix at https://github.com/apache/spark/pull/15006

Which will tokenize temp.20160826_ip_list as:
{code}
temp // Matches the IDENTIFIER lexer rule
. // Matches single dot
20160826_ip_list // Matches the IDENTIFIER lexer rule.
{code}



> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182
 ] 

Sean Zhong edited comment on SPARK-17364 at 9/7/16 11:56 PM:
-

[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
tokens like
{code}
// temp.20160826_ip_list is break downs to:
temp  // Matches the IDENTIFIER lexer rule
.20160826 // Matches the DECIMAL_VALUE lexer rule.
_ip_list  // Matches the IDENTIFIER lexer rule.
{code}




was (Author: clockfly):
[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
tokens like
{code}
// temp.20160826_ip_list is break downs to:
temp
.20160826
_ip_list
{code}



> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182
 ] 

Sean Zhong edited comment on SPARK-17364 at 9/7/16 11:55 PM:
-

[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
tokens like
{code}
// temp.20160826_ip_list is break downs to:
temp
.20160826
_ip_list
{code}




was (Author: clockfly):
[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
token like
{code}
// temp.20160826_ip_list is break downs to:
temp
.20160826
_ip_list
{code}



> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472182#comment-15472182
 ] 

Sean Zhong commented on SPARK-17364:


[~hvanhovell] That is because the antlr4 lexer breaks temp.20160826_ip_list to 
token like
{code}
// temp.20160826_ip_list is break downs to:
temp
.20160826
_ip_list
{code}



> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472174#comment-15472174
 ] 

Shivaram Venkataraman commented on SPARK-17442:
---

I think we missed passing the options through when we converted to the new 
writer API in 
https://github.com/apache/spark/commit/cc4d5229c98a589da76a4d5e5fdc5ea92385183b

The fix is probably just to add a line `write <- callJMethod(write, "options", 
options)` -- Feel free to send a PR if you get a chance or I'll send one later 
tonight

cc [~felixcheung]

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17364:


Assignee: (was: Apache Spark)

> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472175#comment-15472175
 ] 

Apache Spark commented on SPARK-17364:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/15006

> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17364) Can not query hive table starting with number

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17364:


Assignee: Apache Spark

> Can not query hive table starting with number
> -
>
> Key: SPARK-17364
> URL: https://issues.apache.org/jira/browse/SPARK-17364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Egor Pahomov
>Assignee: Apache Spark
>
> I can do it with spark-1.6.2
> {code}
> SELECT * from  temp.20160826_ip_list limit 100
> {code}
> {code}
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '.20160826' expecting {, ',', 'SELECT', 'FROM', 'ADD', 
> 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
> 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 19)
> == SQL ==
> SELECT * from  temp.20160826_ip_list limit 100
> ---^^^
> SQLState:  null
> ErrorCode: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16533) Spark application not handling preemption messages

2016-09-07 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-16533:
---
Fix Version/s: 2.0.1

> Spark application not handling preemption messages
> --
>
> Key: SPARK-16533
> URL: https://issues.apache.org/jira/browse/SPARK-16533
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, 
> YARN
>Affects Versions: 1.6.0
> Environment: Yarn version: Hadoop 2.7.1-amzn-0
> AWS EMR Cluster running:
> 1 x r3.8xlarge (Master)
> 52 x r3.8xlarge (Core)
> Spark version : 1.6.0
> Scala version: 2.10.5
> Java version: 1.8.0_51
> Input size: ~10 tb
> Input coming from S3
> Queue Configuration:
> Dynamic allocation: enabled
> Preemption: enabled
> Q1: 70% capacity with max of 100%
> Q2: 30% capacity with max of 100%
> Job Configuration:
> Driver memory = 10g
> Executor cores = 6
> Executor memory = 10g
> Deploy mode = cluster
> Master = yarn
> maxResultSize = 4g
> Shuffle manager = hash
>Reporter: Lucas Winkelmann
>Assignee: Angus Gerry
> Fix For: 2.0.1, 2.1.0
>
>
> Here is the scenario:
> I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
> I wait between 15-30 mins ( for this job to complete with 100% of the cluster 
> available takes about 1hr so job 1 is between 25-50% complete). Note that if 
> I wait less time then the issue sometimes does not occur, it appears to be 
> only after the job 1 is at least 25% complete.
> I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to 
> allow 70% of cluster utilization.
> At this point job 1 basically halts progress while job 2 continues to execute 
> as normal and finishes. Job 2 either:
> - Fails its attempt and restarts. By the time this attempt fails the other 
> job is already complete meaning the second attempt has full cluster 
> availability and finishes.
> - The job remains at its current progress and simply does not finish ( I have 
> waited ~6 hrs until finally killing the application ).
>  
> Looking into the error log there is this constant error message:
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: 
> ip-NUMBERS.ec2.internal was preempted.)] in X attempts
>  
> My observations have led me to believe that the application master does not 
> know about this container being killed and continuously asks the container to 
> remove the executor until eventually failing the attempt or continue trying 
> to remove the executor.
>  
> I have done much digging online for anyone else experiencing this issue but 
> have come up with nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17386) Default polling and trigger intervals cause excessive RPC calls

2016-09-07 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-17386:

Summary: Default polling and trigger intervals cause excessive RPC calls  
(was: Default trigger interval causes excessive RPC calls)

> Default polling and trigger intervals cause excessive RPC calls
> ---
>
> Key: SPARK-17386
> URL: https://issues.apache.org/jira/browse/SPARK-17386
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Frederick Reiss
>
> The default trigger interval for a Structured Streaming query is 
> {{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
> When the trigger is set to this default value, the scheduler in 
> {{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 
> 10 msec on every {{Source}} until new data arrives.
> In test cases, where most of the sources are {{MemoryStream}} or 
> {{TextSocketSource}}, this spinning leads to excessive CPU usage.
> In a production environment, this spinning could take down critical 
> infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} 
> or the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
> {{FileStreamSource}} performs a directory listing of an HDFS directory. If 
> the scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark 
> will make hundreds of RPC calls per second to the HDFS NameNode. This 
> overhead could disrupt service to other systems using HDFS, including Spark 
> itself. A similar situation will exist with the Kafka source, the 
> {{getOffset()}} method of which will presumably call Kafka's 
> {{Consumer.poll()}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17386) Default trigger interval causes excessive RPC calls

2016-09-07 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-17386:

Description: 
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec on every {{Source}} until new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If the 
scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark will 
make hundreds of RPC calls per second to the HDFS NameNode. This overhead could 
disrupt service to other systems using HDFS, including Spark itself. A similar 
situation will exist with the Kafka source, the {{getOffset()}} method of which 
will presumably call Kafka's {{Consumer.poll()}} method.

  was:
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} on every 
{{Source}} until new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If the 
scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark will 
make hundreds of RPC calls per second to the HDFS NameNode. This overhead could 
disrupt service to other systems using HDFS, including Spark itself. A similar 
situation will exist with the Kafka source, the {{getOffset()}} method of which 
will presumably call Kafka's {{Consumer.poll()}} method.


> Default trigger interval causes excessive RPC calls
> ---
>
> Key: SPARK-17386
> URL: https://issues.apache.org/jira/browse/SPARK-17386
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Frederick Reiss
>
> The default trigger interval for a Structured Streaming query is 
> {{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
> When the trigger is set to this default value, the scheduler in 
> {{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 
> 10 msec on every {{Source}} until new data arrives.
> In test cases, where most of the sources are {{MemoryStream}} or 
> {{TextSocketSource}}, this spinning leads to excessive CPU usage.
> In a production environment, this spinning could take down critical 
> infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} 
> or the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
> {{FileStreamSource}} performs a directory listing of an HDFS directory. If 
> the scheduler calls {{FileStreamSource.getOffset()}} in a tight loop, Spark 
> will make hundreds of RPC calls per second to the HDFS NameNode. This 
> overhead could disrupt service to other systems using HDFS, including Spark 
> itself. A similar situation will exist with the Kafka source, the 
> {{getOffset()}} method of which will presumably call Kafka's 
> {{Consumer.poll()}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17442:
---
Priority: Blocker  (was: Critical)

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Blocker
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-17442:
---
Priority: Critical  (was: Major)

> Additional arguments in write.df are not passed to data source
> --
>
> Key: SPARK-17442
> URL: https://issues.apache.org/jira/browse/SPARK-17442
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Critical
>
> {{write.df}} passes everything in its arguments to underlying data source in 
> 1.x, but it is not passing header = "true" in Spark 2.0. For example the 
> following code snippet produces a header line in older versions of Spark but 
> not in 2.0.
> {code}
> df <- createDataFrame(iris)
> write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header 
> = "true")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17442) Additional arguments in write.df are not passed to data source

2016-09-07 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17442:
--

 Summary: Additional arguments in write.df are not passed to data 
source
 Key: SPARK-17442
 URL: https://issues.apache.org/jira/browse/SPARK-17442
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Hossein Falaki


{{write.df}} passes everything in its arguments to underlying data source in 
1.x, but it is not passing header = "true" in Spark 2.0. For example the 
following code snippet produces a header line in older versions of Spark but 
not in 2.0.

{code}
df <- createDataFrame(iris)
write.df(df, source = "com.databricks.spark.csv", path = "/tmp/iris", header = 
"true")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17421) Warnings about "MaxPermSize" parameter when building with Maven and Java 8

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472045#comment-15472045
 ] 

Apache Spark commented on SPARK-17421:
--

User 'frreiss' has created a pull request for this issue:
https://github.com/apache/spark/pull/15005

> Warnings about "MaxPermSize" parameter when building with Maven and Java 8
> --
>
> Key: SPARK-17421
> URL: https://issues.apache.org/jira/browse/SPARK-17421
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Frederick Reiss
>Priority: Minor
>
> When building Spark with {{build/mvn}} or {{dev/run-tests}}, a Java warning 
> appears repeatedly on STDERR:
> {{OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support 
> was removed in 8.0}}
> This warning is due to {{build/mvn}} adding the {{-XX:MaxPermSize=512M}} 
> option to {{MAVEN_OPTS}}. When compiling with Java 7, this parameter is 
> essential. With Java 8, the parameter leads to the warning above.
> Because {{build/mvn}} adds {{MaxPermSize}} to {{MAVEN_OPTS}}, even if that 
> environment variable doesn't contain the option, setting {{MAVEN_OPTS}} to a 
> string that does not contain {{MaxPermSize}} has no effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472037#comment-15472037
 ] 

Qifan Pu commented on SPARK-17405:
--

[~joshrosen] Yes, running local[32] will reproduce the exception. 

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:783)
> at 
> 

[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472009#comment-15472009
 ] 

Qifan Pu commented on SPARK-17405:
--

One quick fix is to set memory capacity in configuration to make sure 
memory_capacity > x*cores (x being some number > 64MB)

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> 

[jira] [Commented] (SPARK-17441) Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data source table

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472005#comment-15472005
 ] 

Apache Spark commented on SPARK-17441:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15004

> Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data 
> source table
> -
>
> Key: SPARK-17441
> URL: https://issues.apache.org/jira/browse/SPARK-17441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just 
> like the other `ALTER PARTITION` commands. We should issue an exception 
> instead. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17441) Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data source table

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17441:


Assignee: (was: Apache Spark)

> Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data 
> source table
> -
>
> Key: SPARK-17441
> URL: https://issues.apache.org/jira/browse/SPARK-17441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just 
> like the other `ALTER PARTITION` commands. We should issue an exception 
> instead. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17440) Issue Exception when ALTER TABLE commands try to alter a VIEW

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17440:


Assignee: (was: Apache Spark)

> Issue Exception when ALTER TABLE commands try to alter a VIEW
> -
>
> Key: SPARK-17440
> URL: https://issues.apache.org/jira/browse/SPARK-17440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> For the following `ALTER TABLE` DDL, we should issue an exception when the 
> target table is a `VIEW`:
> {code}
>  ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart'
>  ALTER TABLE viewName SET SERDE 'whatever'
>  ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8')
>  ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2')
>  ALTER TABLE viewName RECOVER PARTITIONS
>  ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', 
> b='p')
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17440) Issue Exception when ALTER TABLE commands try to alter a VIEW

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472004#comment-15472004
 ] 

Apache Spark commented on SPARK-17440:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15004

> Issue Exception when ALTER TABLE commands try to alter a VIEW
> -
>
> Key: SPARK-17440
> URL: https://issues.apache.org/jira/browse/SPARK-17440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> For the following `ALTER TABLE` DDL, we should issue an exception when the 
> target table is a `VIEW`:
> {code}
>  ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart'
>  ALTER TABLE viewName SET SERDE 'whatever'
>  ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8')
>  ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2')
>  ALTER TABLE viewName RECOVER PARTITIONS
>  ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', 
> b='p')
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17441) Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data source table

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17441:


Assignee: Apache Spark

> Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data 
> source table
> -
>
> Key: SPARK-17441
> URL: https://issues.apache.org/jira/browse/SPARK-17441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just 
> like the other `ALTER PARTITION` commands. We should issue an exception 
> instead. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17440) Issue Exception when ALTER TABLE commands try to alter a VIEW

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17440:


Assignee: Apache Spark

> Issue Exception when ALTER TABLE commands try to alter a VIEW
> -
>
> Key: SPARK-17440
> URL: https://issues.apache.org/jira/browse/SPARK-17440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> For the following `ALTER TABLE` DDL, we should issue an exception when the 
> target table is a `VIEW`:
> {code}
>  ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart'
>  ALTER TABLE viewName SET SERDE 'whatever'
>  ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8')
>  ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2')
>  ALTER TABLE viewName RECOVER PARTITIONS
>  ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', 
> b='p')
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-09-07 Thread Ron Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472000#comment-15472000
 ] 

Ron Hu commented on SPARK-16026:


Hi Srinath, Thank you for your comments. Let me answer them one by one. 
First, we should consider the data shuffle cost. Yes, this is part of phase 2 
cost functions in our plan. As we already implemented the phase 1 cost 
function, we want to contribute our existing development work to Spark 
community ASAP. We will expand to phase 2 CBO work soon. In phase 2, we will 
develop cost function for each execution operator. The EXCHANGE operator is one 
we need to define its cost function. Your suggestion is quite reasonable.

Second, we define two statements: (1) ANALYZE TABLE table_name COMPUTE 
STATISTICS; (2) ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS 
column-name1, column-name2, …. As you know, the ANALYZE TABLE command collects 
the auxiliary statistics information. A good DBA needs to monitor the status of 
the statistics information. I mean there always exists an issue whether or not 
the statistics data is stale. Hence, we do not want to use the transaction 
criteria to view statistics data. On the other hand, we may do a little better 
to make them consistent. One way is to refresh table level statistics when we 
execute the command "ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS 
column-name1, column-name2, …. " to collect column level statistics. 

Third, we do not have default selectivity assumed. In the design spec, we 
defined how to estimate the cardinality for logical AND operator in section 6. 
In the future, we may use either 2-dimensional histogram and/or SQL hint to 
handle the correlation among multiple correlated columns. 


> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16026) Cost-based Optimizer framework

2016-09-07 Thread Ron Hu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-16026:
---
Comment: was deleted

(was: Hi Srinath,  Thank you for your comments.  Let me answer them one by one. 
 
First, we should consider the data shuffle cost.  Yes, this is part of phase 2 
cost functions in our plan.  As we already implemented the phase 1 cost 
function, we want to contribute our existing development work to Spark 
community ASAP.  We will expand to phase 2 CBO work soon.  In phase 2, we will 
develop cost function for each execution operator.  The EXCHANGE operator is 
one we need to define its cost function.  Your suggestion is quite reasonable.

Second, we define two statements: (1)  ANALYZE TABLE table_name COMPUTE 
STATISTICS; (2) ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS 
column-name1, column-name2, ….  As you know, the ANALYZE TABLE command collects 
the auxiliary statistics information.  A good DBA needs to monitor the status 
of the statistics information.  I mean there always exists an issue whether or 
not the statistics data is stale.  Hence, we do not want to use the transaction 
criteria to view statistics data.   On the other hand, we may do a little 
better to make them consistent.  One way is to refresh table level statistics 
when we execute the command "ANALYZE TABLE table_name COMPUTE STATISTICS FOR 
COLUMNS column-name1, column-name2, …. " to collect column level statistics.  

Third, we do not have default selectivity assumed.  In the design spec, we 
defined how to estimate the cardinality for logical AND operator in section 6.  
In the future, we may use either 2-dimensional histogram and/or SQL hint to 
handle the correlation among multiple correlated columns.  )

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17441) Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data source table

2016-09-07 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17441:
---

 Summary: Issue Exceptions when ALTER TABLE RENAME PARTITION tries 
to alter a data source table
 Key: SPARK-17441
 URL: https://issues.apache.org/jira/browse/SPARK-17441
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li


`ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just 
like the other `ALTER PARTITION` commands. We should issue an exception 
instead. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17440) Issue Exception when ALTER TABLE commands try to alter a VIEW

2016-09-07 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17440:
---

 Summary: Issue Exception when ALTER TABLE commands try to alter a 
VIEW
 Key: SPARK-17440
 URL: https://issues.apache.org/jira/browse/SPARK-17440
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li


For the following `ALTER TABLE` DDL, we should issue an exception when the 
target table is a `VIEW`:
{code}
 ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart'

 ALTER TABLE viewName SET SERDE 'whatever'

 ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y')

 ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y')

 ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8')

 ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2')

 ALTER TABLE viewName RECOVER PARTITIONS

 ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', 
b='p')
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-09-07 Thread Ron Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471986#comment-15471986
 ] 

Ron Hu commented on SPARK-16026:


Hi Srinath,  Thank you for your comments.  Let me answer them one by one.  
First, we should consider the data shuffle cost.  Yes, this is part of phase 2 
cost functions in our plan.  As we already implemented the phase 1 cost 
function, we want to contribute our existing development work to Spark 
community ASAP.  We will expand to phase 2 CBO work soon.  In phase 2, we will 
develop cost function for each execution operator.  The EXCHANGE operator is 
one we need to define its cost function.  Your suggestion is quite reasonable.

Second, we define two statements: (1)  ANALYZE TABLE table_name COMPUTE 
STATISTICS; (2) ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS 
column-name1, column-name2, ….  As you know, the ANALYZE TABLE command collects 
the auxiliary statistics information.  A good DBA needs to monitor the status 
of the statistics information.  I mean there always exists an issue whether or 
not the statistics data is stale.  Hence, we do not want to use the transaction 
criteria to view statistics data.   On the other hand, we may do a little 
better to make them consistent.  One way is to refresh table level statistics 
when we execute the command "ANALYZE TABLE table_name COMPUTE STATISTICS FOR 
COLUMNS column-name1, column-name2, …. " to collect column level statistics.  

Third, we do not have default selectivity assumed.  In the design spec, we 
defined how to estimate the cardinality for logical AND operator in section 6.  
In the future, we may use either 2-dimensional histogram and/or SQL hint to 
handle the correlation among multiple correlated columns.  

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17439) QuantilesSummaries returns the wrong result after compression

2016-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17439:
---
Labels: correctness  (was: )

> QuantilesSummaries returns the wrong result after compression
> -
>
> Key: SPARK-17439
> URL: https://issues.apache.org/jira/browse/SPARK-17439
> Project: Spark
>  Issue Type: Bug
>Reporter: Tim Hunter
>  Labels: correctness
>
> [~clockfly] found the following corner case that returns the wrong quantile 
> (off by 1):
> {code}
> test("test QuantileSummaries compression") {
> var left = new QuantileSummaries(1, 0.0001)
> System.out.println("LEFT  RIGHT")
> System.out.println("")
> (0 to 10).foreach { index =>
>   left = left.insert(index)
>   left = left.compress()
>   var right = new QuantileSummaries(1, 0.0001)
>   (0 to index).foreach(right.insert(_))
>   right = right.compress()
>   System.out.println(s"${left.query(0.5)}   ${right.query(0.5)}")
> }
>   }
> {code}
> The result is:
> {code}
> LEFT  RIGHT
> 
> 0.0   0.0
> 0.0   1.0
> 0.0   1.0
> 0.0   1.0
> 1.0   2.0
> 1.0   2.0
> 2.0   3.0
> 2.0   3.0
> 3.0   4.0
> 3.0   4.0
> 4.0   5.0
> {code}
> The value of the "LEFT" column represents the output when using 
> QuantileSummaries in Window function, the value on the "RIGHT" column 
> represents the expected result. The different between "LEFT" and "RIGHT" 
> column is that the "LEFT" column does intermediate compression on the storage 
> of QuantileSummaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17439) QuantilesSummaries returns the wrong result after compression

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471829#comment-15471829
 ] 

Apache Spark commented on SPARK-17439:
--

User 'thunterdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/15002

> QuantilesSummaries returns the wrong result after compression
> -
>
> Key: SPARK-17439
> URL: https://issues.apache.org/jira/browse/SPARK-17439
> Project: Spark
>  Issue Type: Bug
>Reporter: Tim Hunter
>
> [~clockfly] found the following corner case that returns the wrong quantile 
> (off by 1):
> {code}
> test("test QuantileSummaries compression") {
> var left = new QuantileSummaries(1, 0.0001)
> System.out.println("LEFT  RIGHT")
> System.out.println("")
> (0 to 10).foreach { index =>
>   left = left.insert(index)
>   left = left.compress()
>   var right = new QuantileSummaries(1, 0.0001)
>   (0 to index).foreach(right.insert(_))
>   right = right.compress()
>   System.out.println(s"${left.query(0.5)}   ${right.query(0.5)}")
> }
>   }
> {code}
> The result is:
> {code}
> LEFT  RIGHT
> 
> 0.0   0.0
> 0.0   1.0
> 0.0   1.0
> 0.0   1.0
> 1.0   2.0
> 1.0   2.0
> 2.0   3.0
> 2.0   3.0
> 3.0   4.0
> 3.0   4.0
> 4.0   5.0
> {code}
> The value of the "LEFT" column represents the output when using 
> QuantileSummaries in Window function, the value on the "RIGHT" column 
> represents the expected result. The different between "LEFT" and "RIGHT" 
> column is that the "LEFT" column does intermediate compression on the storage 
> of QuantileSummaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17439) QuantilesSummaries returns the wrong result after compression

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17439:


Assignee: (was: Apache Spark)

> QuantilesSummaries returns the wrong result after compression
> -
>
> Key: SPARK-17439
> URL: https://issues.apache.org/jira/browse/SPARK-17439
> Project: Spark
>  Issue Type: Bug
>Reporter: Tim Hunter
>
> [~clockfly] found the following corner case that returns the wrong quantile 
> (off by 1):
> {code}
> test("test QuantileSummaries compression") {
> var left = new QuantileSummaries(1, 0.0001)
> System.out.println("LEFT  RIGHT")
> System.out.println("")
> (0 to 10).foreach { index =>
>   left = left.insert(index)
>   left = left.compress()
>   var right = new QuantileSummaries(1, 0.0001)
>   (0 to index).foreach(right.insert(_))
>   right = right.compress()
>   System.out.println(s"${left.query(0.5)}   ${right.query(0.5)}")
> }
>   }
> {code}
> The result is:
> {code}
> LEFT  RIGHT
> 
> 0.0   0.0
> 0.0   1.0
> 0.0   1.0
> 0.0   1.0
> 1.0   2.0
> 1.0   2.0
> 2.0   3.0
> 2.0   3.0
> 3.0   4.0
> 3.0   4.0
> 4.0   5.0
> {code}
> The value of the "LEFT" column represents the output when using 
> QuantileSummaries in Window function, the value on the "RIGHT" column 
> represents the expected result. The different between "LEFT" and "RIGHT" 
> column is that the "LEFT" column does intermediate compression on the storage 
> of QuantileSummaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17439) QuantilesSummaries returns the wrong result after compression

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17439:


Assignee: Apache Spark

> QuantilesSummaries returns the wrong result after compression
> -
>
> Key: SPARK-17439
> URL: https://issues.apache.org/jira/browse/SPARK-17439
> Project: Spark
>  Issue Type: Bug
>Reporter: Tim Hunter
>Assignee: Apache Spark
>
> [~clockfly] found the following corner case that returns the wrong quantile 
> (off by 1):
> {code}
> test("test QuantileSummaries compression") {
> var left = new QuantileSummaries(1, 0.0001)
> System.out.println("LEFT  RIGHT")
> System.out.println("")
> (0 to 10).foreach { index =>
>   left = left.insert(index)
>   left = left.compress()
>   var right = new QuantileSummaries(1, 0.0001)
>   (0 to index).foreach(right.insert(_))
>   right = right.compress()
>   System.out.println(s"${left.query(0.5)}   ${right.query(0.5)}")
> }
>   }
> {code}
> The result is:
> {code}
> LEFT  RIGHT
> 
> 0.0   0.0
> 0.0   1.0
> 0.0   1.0
> 0.0   1.0
> 1.0   2.0
> 1.0   2.0
> 2.0   3.0
> 2.0   3.0
> 3.0   4.0
> 3.0   4.0
> 4.0   5.0
> {code}
> The value of the "LEFT" column represents the output when using 
> QuantileSummaries in Window function, the value on the "RIGHT" column 
> represents the expected result. The different between "LEFT" and "RIGHT" 
> column is that the "LEFT" column does intermediate compression on the storage 
> of QuantileSummaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17439) QuantilesSummaries returns the wrong result after compression

2016-09-07 Thread Tim Hunter (JIRA)
Tim Hunter created SPARK-17439:
--

 Summary: QuantilesSummaries returns the wrong result after 
compression
 Key: SPARK-17439
 URL: https://issues.apache.org/jira/browse/SPARK-17439
 Project: Spark
  Issue Type: Bug
Reporter: Tim Hunter


[~clockfly] found the following corner case that returns the wrong quantile 
(off by 1):

{code}
test("test QuantileSummaries compression") {
var left = new QuantileSummaries(1, 0.0001)
System.out.println("LEFT  RIGHT")
System.out.println("")
(0 to 10).foreach { index =>
  left = left.insert(index)
  left = left.compress()

  var right = new QuantileSummaries(1, 0.0001)
  (0 to index).foreach(right.insert(_))
  right = right.compress()
  System.out.println(s"${left.query(0.5)}   ${right.query(0.5)}")
}
  }
{code}

The result is:
{code}
LEFT  RIGHT

0.0   0.0
0.0   1.0
0.0   1.0
0.0   1.0
1.0   2.0
1.0   2.0
2.0   3.0
2.0   3.0
3.0   4.0
3.0   4.0
4.0   5.0
{code}


The value of the "LEFT" column represents the output when using 
QuantileSummaries in Window function, the value on the "RIGHT" column 
represents the expected result. The different between "LEFT" and "RIGHT" column 
is that the "LEFT" column does intermediate compression on the storage of 
QuantileSummaries.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17439) QuantilesSummaries returns the wrong result after compression

2016-09-07 Thread Tim Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471801#comment-15471801
 ] 

Tim Hunter commented on SPARK-17439:


I have a patch for that. It should be merged after SPARK-17306

> QuantilesSummaries returns the wrong result after compression
> -
>
> Key: SPARK-17439
> URL: https://issues.apache.org/jira/browse/SPARK-17439
> Project: Spark
>  Issue Type: Bug
>Reporter: Tim Hunter
>
> [~clockfly] found the following corner case that returns the wrong quantile 
> (off by 1):
> {code}
> test("test QuantileSummaries compression") {
> var left = new QuantileSummaries(1, 0.0001)
> System.out.println("LEFT  RIGHT")
> System.out.println("")
> (0 to 10).foreach { index =>
>   left = left.insert(index)
>   left = left.compress()
>   var right = new QuantileSummaries(1, 0.0001)
>   (0 to index).foreach(right.insert(_))
>   right = right.compress()
>   System.out.println(s"${left.query(0.5)}   ${right.query(0.5)}")
> }
>   }
> {code}
> The result is:
> {code}
> LEFT  RIGHT
> 
> 0.0   0.0
> 0.0   1.0
> 0.0   1.0
> 0.0   1.0
> 1.0   2.0
> 1.0   2.0
> 2.0   3.0
> 2.0   3.0
> 3.0   4.0
> 3.0   4.0
> 4.0   5.0
> {code}
> The value of the "LEFT" column represents the output when using 
> QuantileSummaries in Window function, the value on the "RIGHT" column 
> represents the expected result. The different between "LEFT" and "RIGHT" 
> column is that the "LEFT" column does intermediate compression on the storage 
> of QuantileSummaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17052) Remove Duplicate Test Cases auto_join from HiveCompatibilitySuite.scala

2016-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17052.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14635
[https://github.com/apache/spark/pull/14635]

> Remove Duplicate Test Cases auto_join from HiveCompatibilitySuite.scala
> ---
>
> Key: SPARK-17052
> URL: https://issues.apache.org/jira/browse/SPARK-17052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> The original [JIRA 
> Hive-1642](https://issues.apache.org/jira/browse/HIVE-1642) delivered the 
> test cases `auto_joinXYZ` for verifying the results when the joins are 
> automatically converted to map-join. Basically, most of them are just copied 
> from the corresponding `joinXYZ`. 
> After comparison between `auto_joinXYZ` and `joinXYZ`, below is a list of 
> duplicate cases:
> {noformat}
> "auto_join0",
> "auto_join1",
> "auto_join10",
> "auto_join11",
> "auto_join12",
> "auto_join13",
> "auto_join14",
> "auto_join14_hadoop20",
> "auto_join15",
> "auto_join17",
> "auto_join18",
> "auto_join2",
> "auto_join20",
> "auto_join21",
> "auto_join23",
> "auto_join24",
> "auto_join3",
> "auto_join4",
> "auto_join5",
> "auto_join6",
> "auto_join7",
> "auto_join8",
> "auto_join9"
> {noformat}
> We can remove all of them without affecting the test coverage. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17052) Remove Duplicate Test Cases auto_join from HiveCompatibilitySuite.scala

2016-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17052:
---
Assignee: Xiao Li

> Remove Duplicate Test Cases auto_join from HiveCompatibilitySuite.scala
> ---
>
> Key: SPARK-17052
> URL: https://issues.apache.org/jira/browse/SPARK-17052
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> The original [JIRA 
> Hive-1642](https://issues.apache.org/jira/browse/HIVE-1642) delivered the 
> test cases `auto_joinXYZ` for verifying the results when the joins are 
> automatically converted to map-join. Basically, most of them are just copied 
> from the corresponding `joinXYZ`. 
> After comparison between `auto_joinXYZ` and `joinXYZ`, below is a list of 
> duplicate cases:
> {noformat}
> "auto_join0",
> "auto_join1",
> "auto_join10",
> "auto_join11",
> "auto_join12",
> "auto_join13",
> "auto_join14",
> "auto_join14_hadoop20",
> "auto_join15",
> "auto_join17",
> "auto_join18",
> "auto_join2",
> "auto_join20",
> "auto_join21",
> "auto_join23",
> "auto_join24",
> "auto_join3",
> "auto_join4",
> "auto_join5",
> "auto_join6",
> "auto_join7",
> "auto_join8",
> "auto_join9"
> {noformat}
> We can remove all of them without affecting the test coverage. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471747#comment-15471747
 ] 

Qifan Pu commented on SPARK-17405:
--

[~joshrosen]
Yes likely. The new hashmap asks for 64MB per task, and the default single-node 
setting uses only hundreds of memory in total.
We decided on 64MB due to our single memory page design for simplicity and 
performance, and that in production it should hold 64MB * cores << 
memory_capacity.
Maybe we should increase default memory a bit? Or is it bad in general to have 
such upfront cost of 64MB?

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most 

[jira] [Commented] (SPARK-17438) Master UI should show the correct core limit when `ApplicationInfo.executorLimit` is set

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471741#comment-15471741
 ] 

Apache Spark commented on SPARK-17438:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15001

> Master UI should show the correct core limit when 
> `ApplicationInfo.executorLimit` is set
> 
>
> Key: SPARK-17438
> URL: https://issues.apache.org/jira/browse/SPARK-17438
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The core info of an application in Master UI doesn't consider 
> `ApplicationInfo.executorLimit`. It's pretty confusing that UI says 
> "Unlimited" when `executorLimit` is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17438) Master UI should show the correct core limit when `ApplicationInfo.executorLimit` is set

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17438:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Master UI should show the correct core limit when 
> `ApplicationInfo.executorLimit` is set
> 
>
> Key: SPARK-17438
> URL: https://issues.apache.org/jira/browse/SPARK-17438
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The core info of an application in Master UI doesn't consider 
> `ApplicationInfo.executorLimit`. It's pretty confusing that UI says 
> "Unlimited" when `executorLimit` is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17438) Master UI should show the correct core limit when `ApplicationInfo.executorLimit` is set

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17438:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Master UI should show the correct core limit when 
> `ApplicationInfo.executorLimit` is set
> 
>
> Key: SPARK-17438
> URL: https://issues.apache.org/jira/browse/SPARK-17438
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> The core info of an application in Master UI doesn't consider 
> `ApplicationInfo.executorLimit`. It's pretty confusing that UI says 
> "Unlimited" when `executorLimit` is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17438) Master UI should show the correct core limit when `ApplicationInfo.executorLimit` is set

2016-09-07 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-17438:


 Summary: Master UI should show the correct core limit when 
`ApplicationInfo.executorLimit` is set
 Key: SPARK-17438
 URL: https://issues.apache.org/jira/browse/SPARK-17438
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


The core info of an application in Master UI doesn't consider 
`ApplicationInfo.executorLimit`. It's pretty confusing that UI says "Unlimited" 
when `executorLimit` is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17370) Shuffle service files not invalidated when a slave is lost

2016-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17370:
---
Fix Version/s: 2.0.1

> Shuffle service files not invalidated when a slave is lost
> --
>
> Key: SPARK-17370
> URL: https://issues.apache.org/jira/browse/SPARK-17370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.1, 2.1.0
>
>
> DAGScheduler invalidates shuffle files when an executor loss event occurs, 
> but not when the external shuffle service is enabled. This is because when 
> shuffle service is on, the shuffle file lifetime can exceed the executor 
> lifetime.
> However, it doesn't invalidate shuffle files when the shuffle service itself 
> is lost (due to whole slave loss). This can cause long hangs when slaves are 
> lost since the file loss is not detected until a subsequent stage attempts to 
> read the shuffle files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471690#comment-15471690
 ] 

Josh Rosen commented on SPARK-17405:


My hunch is that this is affected by the default number of cores in local mode: 
I think that my MBP uses 16 tasks by default, while I think that the default 
parallelism is lower in Jenkins (and perhaps on your machine). If you have 
trouble reproducing this issue then I'd try explicitly running {{local\[16]}} 
or {{local\[32]}} to see if that can reproduce the issue.

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in 

[jira] [Assigned] (SPARK-17437) uiWebUrl is not accessible to JavaSparkContext or pyspark.SparkContext

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17437:


Assignee: Apache Spark

> uiWebUrl is not accessible to JavaSparkContext or pyspark.SparkContext
> --
>
> Key: SPARK-17437
> URL: https://issues.apache.org/jira/browse/SPARK-17437
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark, Web UI
>Affects Versions: 2.0.0
>Reporter: Adrian Petrescu
>Assignee: Apache Spark
>
> The Scala version of {{SparkContext}} has a handy field called {{uiWebUrl}} 
> that tells you which URL the SparkUI spawned by that instance lives at. This 
> is often very useful because the value for {{spark.ui.port}} in the config is 
> only a suggestion; if that port number is taken by another Spark instance on 
> the same machine, Spark will just keep incrementing the port until it finds a 
> free one. So, on a machine with a lot of running PySpark instances, you often 
> have to start trying all of them one-by-one until you find your application 
> name.
> Scala users have a way around this with {{uiWebUrl}} but Java and Python 
> users do not. This ticket (and the attached PR) fix this in the most 
> straightforward way possible, simply propagating this field through the 
> {{JavaSparkContext}} and into pyspark through the Java gateway.
> Please let me know if any additional documentation/testing is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17437) uiWebUrl is not accessible to JavaSparkContext or pyspark.SparkContext

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471667#comment-15471667
 ] 

Apache Spark commented on SPARK-17437:
--

User 'apetresc' has created a pull request for this issue:
https://github.com/apache/spark/pull/15000

> uiWebUrl is not accessible to JavaSparkContext or pyspark.SparkContext
> --
>
> Key: SPARK-17437
> URL: https://issues.apache.org/jira/browse/SPARK-17437
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark, Web UI
>Affects Versions: 2.0.0
>Reporter: Adrian Petrescu
>
> The Scala version of {{SparkContext}} has a handy field called {{uiWebUrl}} 
> that tells you which URL the SparkUI spawned by that instance lives at. This 
> is often very useful because the value for {{spark.ui.port}} in the config is 
> only a suggestion; if that port number is taken by another Spark instance on 
> the same machine, Spark will just keep incrementing the port until it finds a 
> free one. So, on a machine with a lot of running PySpark instances, you often 
> have to start trying all of them one-by-one until you find your application 
> name.
> Scala users have a way around this with {{uiWebUrl}} but Java and Python 
> users do not. This ticket (and the attached PR) fix this in the most 
> straightforward way possible, simply propagating this field through the 
> {{JavaSparkContext}} and into pyspark through the Java gateway.
> Please let me know if any additional documentation/testing is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17437) uiWebUrl is not accessible to JavaSparkContext or pyspark.SparkContext

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17437:


Assignee: (was: Apache Spark)

> uiWebUrl is not accessible to JavaSparkContext or pyspark.SparkContext
> --
>
> Key: SPARK-17437
> URL: https://issues.apache.org/jira/browse/SPARK-17437
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark, Web UI
>Affects Versions: 2.0.0
>Reporter: Adrian Petrescu
>
> The Scala version of {{SparkContext}} has a handy field called {{uiWebUrl}} 
> that tells you which URL the SparkUI spawned by that instance lives at. This 
> is often very useful because the value for {{spark.ui.port}} in the config is 
> only a suggestion; if that port number is taken by another Spark instance on 
> the same machine, Spark will just keep incrementing the port until it finds a 
> free one. So, on a machine with a lot of running PySpark instances, you often 
> have to start trying all of them one-by-one until you find your application 
> name.
> Scala users have a way around this with {{uiWebUrl}} but Java and Python 
> users do not. This ticket (and the attached PR) fix this in the most 
> straightforward way possible, simply propagating this field through the 
> {{JavaSparkContext}} and into pyspark through the Java gateway.
> Please let me know if any additional documentation/testing is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471640#comment-15471640
 ] 

Qifan Pu commented on SPARK-17405:
--

[~joshrosen][~jlaskowski]Thanks for the comments and suggestions. I have run 
both of your queries on 03d77af9ec4ce9a42affd6ab4381ae5bd3c79a5a and was able 
to finish both of them without any exceptions. 
I'll do some static code analysis based on the log from [~jlaskowski]

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of 

[jira] [Created] (SPARK-17437) uiWebUrl is not accessible to JavaSparkContext or pyspark.SparkContext

2016-09-07 Thread Adrian Petrescu (JIRA)
Adrian Petrescu created SPARK-17437:
---

 Summary: uiWebUrl is not accessible to JavaSparkContext or 
pyspark.SparkContext
 Key: SPARK-17437
 URL: https://issues.apache.org/jira/browse/SPARK-17437
 Project: Spark
  Issue Type: Improvement
  Components: Java API, PySpark, Web UI
Affects Versions: 2.0.0
Reporter: Adrian Petrescu


The Scala version of {{SparkContext}} has a handy field called {{uiWebUrl}} 
that tells you which URL the SparkUI spawned by that instance lives at. This is 
often very useful because the value for {{spark.ui.port}} in the config is only 
a suggestion; if that port number is taken by another Spark instance on the 
same machine, Spark will just keep incrementing the port until it finds a free 
one. So, on a machine with a lot of running PySpark instances, you often have 
to start trying all of them one-by-one until you find your application name.

Scala users have a way around this with {{uiWebUrl}} but Java and Python users 
do not. This ticket (and the attached PR) fix this in the most straightforward 
way possible, simply propagating this field through the {{JavaSparkContext}} 
and into pyspark through the Java gateway.

Please let me know if any additional documentation/testing is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17370) Shuffle service files not invalidated when a slave is lost

2016-09-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17370.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14931
[https://github.com/apache/spark/pull/14931]

> Shuffle service files not invalidated when a slave is lost
> --
>
> Key: SPARK-17370
> URL: https://issues.apache.org/jira/browse/SPARK-17370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> DAGScheduler invalidates shuffle files when an executor loss event occurs, 
> but not when the external shuffle service is enabled. This is because when 
> shuffle service is on, the shuffle file lifetime can exceed the executor 
> lifetime.
> However, it doesn't invalidate shuffle files when the shuffle service itself 
> is lost (due to whole slave loss). This can cause long hangs when slaves are 
> lost since the file loss is not detected until a subsequent stage attempts to 
> read the shuffle files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471563#comment-15471563
 ] 

Nick Pentreath commented on SPARK-17094:


It's true that constructor doesn't exist. It could be {{new 
Pipeline().setStages(Array(new Tokenizer(), new CountVectorizer(), ...}}

> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> {code}
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> {code}
> Overall, the feature would 
> 1. Allow people (especially starters) to create a ML application in one 
> simple line of code. 
> 2. And can be handy for users as they don't have to set the input, output 
> columns.
> 3. Thinking further, we may not need code any longer to build a Spark ML 
> application as it can be done by configuration:
> {code}
> "ml.pipeline.input": "hdfs://path.svm"
> "ml.pipeline": "tokenizer", "hashingTF", "lda"
> "ml.tokenizer.toLowercase": "false"
> ...
> {code}, which can be quite efficient for tuning on cluster.
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17433) YarnShuffleService doesn't handle moving credentials levelDb

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17433:


Assignee: Apache Spark

> YarnShuffleService doesn't handle moving credentials levelDb
> 
>
> Key: SPARK-17433
> URL: https://issues.apache.org/jira/browse/SPARK-17433
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> In SPARK-16711, I added a leveldb to store credentials to fix an issue with 
> NM restart.  I missed that getRecoveryPath also handles moving the DB from 
> the old local dirs to the yarn recoverypath. This routine is hardcoded for 
> the registeredexecutors db and needs to be updated to handle the new 
> credentials db I added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17433) YarnShuffleService doesn't handle moving credentials levelDb

2016-09-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17433:


Assignee: (was: Apache Spark)

> YarnShuffleService doesn't handle moving credentials levelDb
> 
>
> Key: SPARK-17433
> URL: https://issues.apache.org/jira/browse/SPARK-17433
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> In SPARK-16711, I added a leveldb to store credentials to fix an issue with 
> NM restart.  I missed that getRecoveryPath also handles moving the DB from 
> the old local dirs to the yarn recoverypath. This routine is hardcoded for 
> the registeredexecutors db and needs to be updated to handle the new 
> credentials db I added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17433) YarnShuffleService doesn't handle moving credentials levelDb

2016-09-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471503#comment-15471503
 ] 

Apache Spark commented on SPARK-17433:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/14999

> YarnShuffleService doesn't handle moving credentials levelDb
> 
>
> Key: SPARK-17433
> URL: https://issues.apache.org/jira/browse/SPARK-17433
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> In SPARK-16711, I added a leveldb to store credentials to fix an issue with 
> NM restart.  I missed that getRecoveryPath also handles moving the DB from 
> the old local dirs to the yarn recoverypath. This routine is hardcoded for 
> the registeredexecutors db and needs to be updated to handle the new 
> credentials db I added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17431) How to get 2 years prior date from currentdate using Spark Sql

2016-09-07 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-17431.
-
Resolution: Not A Problem

> How to get 2 years prior date from currentdate using Spark Sql
> --
>
> Key: SPARK-17431
> URL: https://issues.apache.org/jira/browse/SPARK-17431
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Farman Ali
>
> I need to derive 2 years prior date of current date using a query in Spark 
> Sql. For ex : today's date is 2016-09-07. I need to get the date exactly 2 
> years before this date in the above format (-MM-DD).
> Please let me know if there are multiple approaches and which one would be 
> better.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2016-09-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471450#comment-15471450
 ] 

Sean Owen commented on SPARK-17368:
---

This is beyond my knowledge I'm afraid. I'd help take a look if I can but not 
sure I'd know where to start on it myself!

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17429) spark sql length(1) return error

2016-09-07 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471443#comment-15471443
 ] 

Dongjoon Hyun commented on SPARK-17429:
---

Ya. I agree with [~hvanhovell]. It seems to be debatable. PostgreSQL gives 
error like Spark while MySQL work like Hive.
{code}
ERROR:  function length(integer) does not exist
LINE 1: select length(1)
{code}
[~cenyuhai], could you make a PR for this. At least, I think this should be 
discussed and make a conclusion in any way.

> spark sql length(1) return error
> 
>
> Key: SPARK-17429
> URL: https://issues.apache.org/jira/browse/SPARK-17429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> select length(11);
> select length(2.0);
> these sql will return errors, but hive is ok.
> Error in query: cannot resolve 'length(11)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '11' is of int type.; 
> line 1 pos 14
> Error in query: cannot resolve 'length(2.0)' due to data type mismatch: 
> argument 1 requires (string or binary) type, however, '2.0' is of double 
> type.; line 1 pos 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2016-09-07 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471428#comment-15471428
 ] 

Jakob Odersky commented on SPARK-17368:
---

Hmm, you're right my assumption was of using only value classes in the 
beginning and at the end was too naive.

[~srowen], how likely do you think it is that we can include a meta-encoder in 
Spark? It could be included in the form of an optional import. Since the 
existing encoders/ScalaReflection framework already use runtime-reflection, my 
guess is that adding compile-time reflection will not be too difficult.

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17431) How to get 2 years prior date from currentdate using Spark Sql

2016-09-07 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471414#comment-15471414
 ] 

Dongjoon Hyun commented on SPARK-17431:
---

Could you close this issue, [~farman.bsse1855]? I think you already got the 
answer for this from mailing list.

> How to get 2 years prior date from currentdate using Spark Sql
> --
>
> Key: SPARK-17431
> URL: https://issues.apache.org/jira/browse/SPARK-17431
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Farman Ali
>
> I need to derive 2 years prior date of current date using a query in Spark 
> Sql. For ex : today's date is 2016-09-07. I need to get the date exactly 2 
> years before this date in the above format (-MM-DD).
> Please let me know if there are multiple approaches and which one would be 
> better.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17094) provide simplified API for ML pipeline

2016-09-07 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-17094:
---
Description: 
Many machine learning pipeline has the API for easily assembling transformers.

One example would be:
{code}
val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
{code}
Overall, the feature would 
1. Allow people (especially starters) to create a ML application in one simple 
line of code. 
2. And can be handy for users as they don't have to set the input, output 
columns.
3. Thinking further, we may not need code any longer to build a Spark ML 
application as it can be done by configuration:
{code}
"ml.pipeline.input": "hdfs://path.svm"
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}, which can be quite efficient for tuning on cluster.

Appreciate feedback and suggestions.

  was:
Many machine learning pipeline has the API for easily assembling transformers.

One example would be:
{code}
val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
{code}
Overall, the feature would 
1. Allow people (especially starters) to create a ML application in one simple 
line of code. 
2. And can be handy for users as they don't have to set the input, output 
columns.
3. Thinking further, we may not need code any longer to build a Spark ML 
application as it can be done by configuration:
{code}
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}, which can be quite efficient for tuning on cluster.

Appreciate feedback and suggestions.


> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> {code}
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> {code}
> Overall, the feature would 
> 1. Allow people (especially starters) to create a ML application in one 
> simple line of code. 
> 2. And can be handy for users as they don't have to set the input, output 
> columns.
> 3. Thinking further, we may not need code any longer to build a Spark ML 
> application as it can be done by configuration:
> {code}
> "ml.pipeline.input": "hdfs://path.svm"
> "ml.pipeline": "tokenizer", "hashingTF", "lda"
> "ml.tokenizer.toLowercase": "false"
> ...
> {code}, which can be quite efficient for tuning on cluster.
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17094) provide simplified API for ML pipeline

2016-09-07 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-17094:
---
Description: 
Many machine learning pipeline has the API for easily assembling transformers.

One example would be:
{code}
val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
{code}
Overall, the feature would 
1. Allow people (especially starters) to create a ML application in one simple 
line of code. 
2. And can be handy for users as they don't have to set the input, output 
columns.
3. Thinking further, we may not need code any longer to build a Spark ML 
application as it can be done by configuration:
{code}
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}, which can be quite efficient for tuning on cluster.

Appreciate feedback and suggestions.

  was:
Many machine learning pipeline has the API for easily assembling transformers.

One example would be:
val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).

Overall, the feature would 
1. Allow people (especially starters) to create a ML application in one simple 
line of code. 
2. And can be handy for users as they don't have to set the input, output 
columns.
3. Thinking further, we may not need code any longer to build a Spark ML 
application as it can be done by configuration:
{code}
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}, which can be quite efficient for tuning on cluster.

Appreciate feedback and suggestions.


> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> {code}
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> {code}
> Overall, the feature would 
> 1. Allow people (especially starters) to create a ML application in one 
> simple line of code. 
> 2. And can be handy for users as they don't have to set the input, output 
> columns.
> 3. Thinking further, we may not need code any longer to build a Spark ML 
> application as it can be done by configuration:
> {code}
> "ml.pipeline": "tokenizer", "hashingTF", "lda"
> "ml.tokenizer.toLowercase": "false"
> ...
> {code}, which can be quite efficient for tuning on cluster.
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17094) provide simplified API for ML pipeline

2016-09-07 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-17094:
---
Description: 
Many machine learning pipeline has the API for easily assembling transformers.

One example would be:
val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).

Overall, the feature would 
1. Allow people (especially starters) to create a ML application in one simple 
line of code. 
2. And can be handy for users as they don't have to set the input, output 
columns.
3. Thinking further, we may not need code any longer to build a Spark ML 
application as it can be done by configuration:
{code}
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}, which can be quite efficient for tuning on cluster.

Appreciate feedback and suggestions.

  was:
Many machine learning pipeline has the API for easily assembling transformers.

One example would be:
val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).

Appreciate feedback and suggestions.


> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Overall, the feature would 
> 1. Allow people (especially starters) to create a ML application in one 
> simple line of code. 
> 2. And can be handy for users as they don't have to set the input, output 
> columns.
> 3. Thinking further, we may not need code any longer to build a Spark ML 
> application as it can be done by configuration:
> {code}
> "ml.pipeline": "tokenizer", "hashingTF", "lda"
> "ml.tokenizer.toLowercase": "false"
> ...
> {code}, which can be quite efficient for tuning on cluster.
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-07 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471366#comment-15471366
 ] 

yuhao yang commented on SPARK-17094:


Thanks for the comment, Sean. The two questions were great. 
1. For the configuration, it might be something like 
{code}
pipeline("tokenizer").asInstanceOf[Tokenizer].set...
pipeline(2).asInstanceOf[Tokenizer].set...
{code}
It will be great if there's a way to avoid the cast. 
Eventually, I think it would be great to have configuration support for ML 
transformers, thus we can do:
{code}
sc.set("ml.tokenizer.toLowercase", "false") 
{code}
and configuration file support, which can avoid hard coding and provide great 
support for tuning on cluster. (Anyone like the idea? cc [~josephkb] [~mengxr])

2. I'm thinking most users would only use linear pipeline. Could you please 
provide an example for non-linear pipelines? So we can have a specific 
discussion.

I tried your code yet I cannot find a constructor for Pipeline like that. Is it 
something under development? And do we need to set the input column and output 
column for each stage?

Overall, the feature would 
1. Allow people (especially starters) to create a ML application in one simple 
line of code. 
2. And can be handy for users as they don't have to set the input, output 
columns.
3. Thinking further, we may not need code any longer to build a Spark ML 
application as it can be done by configuration:
{code}
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}. 




> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >