date:20170710

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081744#comment-16081744
 ] 

Felix Cheung commented on SPARK-21367:
--

I think I found the first error, it's one build before the build failures 
listed above, 79470

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79470/console

{code}
Updating roxygen version in  
/home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/DESCRIPTION 
Deleting AFTSurvivalRegressionModel-class.Rd
Deleting ALSModel-class.Rd
...
There were 50 or more warnings (use warnings() to see the first 50)
{code}

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown

2017-07-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20331:
---

Assignee: Michael Allman

> Broaden support for Hive partition pruning predicate pushdown
> -
>
> Key: SPARK-20331
> URL: https://issues.apache.org/jira/browse/SPARK-20331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Allman
>Assignee: Michael Allman
> Fix For: 2.3.0
>
>
> Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
> partitions. Key to leveraging this support is the ability to prune 
> unnecessary table partitions to answer queries. Spark supports a subset of 
> the class of partition pruning predicates that the Hive metastore supports. 
> If a user writes a query with a partition pruning predicate that is *not* 
> supported by Spark, Spark falls back to loading all partitions and pruning 
> client-side. We want to broaden Spark's current partition pruning predicate 
> pushdown capabilities.
> One of the key missing capabilities is support for disjunctions. For example, 
> for a table partitioned by date, specifying with a predicate like
> {code}date = 20161011 or date = 20161014{code}
> will result in Spark fetching all partitions. For a table partitioned by date 
> and hour, querying a range of hours across dates can be quite difficult to 
> accomplish without fetching all partition metadata.
> The current partition pruning support supports only comparisons against 
> literals. We can expand that to foldable expressions by evaluating them at 
> planning time.
> We can also implement support for the "IN" comparison by expanding it to a 
> sequence of "OR"s.
> This ticket covers those enhancements.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown

2017-07-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20331.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17633
[https://github.com/apache/spark/pull/17633]

> Broaden support for Hive partition pruning predicate pushdown
> -
>
> Key: SPARK-20331
> URL: https://issues.apache.org/jira/browse/SPARK-20331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Allman
> Fix For: 2.3.0
>
>
> Spark 2.1 introduced scalable support for Hive tables with huge numbers of 
> partitions. Key to leveraging this support is the ability to prune 
> unnecessary table partitions to answer queries. Spark supports a subset of 
> the class of partition pruning predicates that the Hive metastore supports. 
> If a user writes a query with a partition pruning predicate that is *not* 
> supported by Spark, Spark falls back to loading all partitions and pruning 
> client-side. We want to broaden Spark's current partition pruning predicate 
> pushdown capabilities.
> One of the key missing capabilities is support for disjunctions. For example, 
> for a table partitioned by date, specifying with a predicate like
> {code}date = 20161011 or date = 20161014{code}
> will result in Spark fetching all partitions. For a table partitioned by date 
> and hour, querying a range of hours across dates can be quite difficult to 
> accomplish without fetching all partition metadata.
> The current partition pruning support supports only comparisons against 
> literals. We can expand that to foldable expressions by evaluating them at 
> planning time.
> We can also implement support for the "IN" comparison by expanding it to a 
> sequence of "OR"s.
> This ticket covers those enhancements.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081720#comment-16081720
 ] 

Felix Cheung edited comment on SPARK-21367 at 7/11/17 6:33 AM:
---

I'm not sure exactly why yet, but comparing the working and non-working build

working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
Writing SparkDataFrame.Rd
Writing printSchema.Rd
Writing schema.Rd
Writing explain.Rd
...
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
2: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
* installing *source* package 'SparkR' ...
{code}

not working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
There were 50 or more warnings (use warnings() to see the first 50)
* installing *source* package 'SparkR' ...
{code}

Bascially, the .Rd files are not getting created (because of warnings that are 
not captured)
That cause the CRAN check to fail with 
"checking for missing documentation entries ... WARNING
Undocumented code objects:
  '%<=>%' 'add_months' 'agg' 'approxCountDistinc"

(which would be completely expected - without Rd files it will not have the 
documentation hence the check will fail)




was (Author: felixcheung):
I'm not sure exactly why yet, but comparing the working and non-working build

working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
Writing SparkDataFrame.Rd
Writing printSchema.Rd
Writing schema.Rd
Writing explain.Rd
...
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
2: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
* installing *source* package 'SparkR' ...
{code}

not working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
There were 50 or more warnings (use warnings() to see the first 50)
* installing *source* package 'SparkR' ...
{code}

Bascially, the .Rd files are not getting created (because of warnings that are 
not captured)
That cause the CRAN check to fail with 
"checking for missing documentation entries ... WARNING
Undocumented code objects:
  '%<=>%' 'add_months' 'agg' 'approxCountDistinc"

(which would be to be expected)



> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081744#comment-16081744
 ] 

Felix Cheung edited comment on SPARK-21367 at 7/11/17 6:26 AM:
---

I think I found the first error, it's one build before the build failures 
listed above, 79470

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79470/console

{code}
Updating roxygen version in  
/home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/DESCRIPTION 
Deleting AFTSurvivalRegressionModel-class.Rd
Deleting ALSModel-class.Rd
...
There were 50 or more warnings (use warnings() to see the first 50)
{code}

Whereas this build from mid June
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/78020/console



was (Author: felixcheung):
I think I found the first error, it's one build before the build failures 
listed above, 79470

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79470/console

{code}
Updating roxygen version in  
/home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/DESCRIPTION 
Deleting AFTSurvivalRegressionModel-class.Rd
Deleting ALSModel-class.Rd
...
There were 50 or more warnings (use warnings() to see the first 50)
{code}

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081744#comment-16081744
 ] 

Felix Cheung edited comment on SPARK-21367 at 7/11/17 6:26 AM:
---

I think I found the first error, it's one build before the build failures 
listed above, 79470

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79470/console

{code}
Updating roxygen version in  
/home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/DESCRIPTION 
Deleting AFTSurvivalRegressionModel-class.Rd
Deleting ALSModel-class.Rd
...
There were 50 or more warnings (use warnings() to see the first 50)
{code}

Whereas this build from mid June
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/78020/console

Does NOT have this "Need roxygen2 >= 5.0.0 but loaded version is 4.1.1" message 
in the console output


was (Author: felixcheung):
I think I found the first error, it's one build before the build failures 
listed above, 79470

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79470/console

{code}
Updating roxygen version in  
/home/jenkins/workspace/SparkPullRequestBuilder/R/pkg/DESCRIPTION 
Deleting AFTSurvivalRegressionModel-class.Rd
Deleting ALSModel-class.Rd
...
There were 50 or more warnings (use warnings() to see the first 50)
{code}

Whereas this build from mid June
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/78020/console


> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21367:
-
Comment: was deleted

(was: it looks like instead of 5.x, the older 4.0 is being loaded?

First time using roxygen2 4.0. Upgrading automatically...

)

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081720#comment-16081720
 ] 

Felix Cheung edited comment on SPARK-21367 at 7/11/17 6:06 AM:
---

I'm not sure exactly why yet, but comparing the working and non-working build

working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
Writing SparkDataFrame.Rd
Writing printSchema.Rd
Writing schema.Rd
Writing explain.Rd
...
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
2: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
* installing *source* package 'SparkR' ...
{code}

not working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
There were 50 or more warnings (use warnings() to see the first 50)
* installing *source* package 'SparkR' ...
{code}

Bascially, the .Rd files are not getting created (because of warnings that are 
not captured)
That cause the CRAN check to fail with 
"checking for missing documentation entries ... WARNING
Undocumented code objects:
  '%<=>%' 'add_months' 'agg' 'approxCountDistinc"
(which would be to be expected)

As explained in the description above, I"m pretty sure these are not in the 
build a while ago
"
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
"


was (Author: felixcheung):
I'm not sure exactly why yet, but comparing the working and non-working buid

working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
Writing SparkDataFrame.Rd
Writing printSchema.Rd
Writing schema.Rd
Writing explain.Rd
...
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
2: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
* installing *source* package 'SparkR' ...
{code}

not working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
There were 50 or more warnings (use warnings() to see the first 50)
* installing *source* package 'SparkR' ...
{code}

Bascially, the .Rd files are not getting created (because of warnings that are 
not captured)
That cause the CRAN check to fail with 
"checking for missing documentation entries ... WARNING
Undocumented code objects:
  '%<=>%' 'add_months' 'agg' 'approxCountDistinc"
(which would be to be expected)

As explained in the description above, I"m pretty sure these are not in the 
build a while ago
"
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
"

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081723#comment-16081723
 ] 

Felix Cheung commented on SPARK-21367:
--

And I'm pretty sure we should build with Roxygen2 5.0.1

https://github.com/apache/spark/blob/master/R/pkg/DESCRIPTION#L60
RoxygenNote: 5.0.1

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19285) Java - Provide user-defined function of 0 arguments (UDF0)

2017-07-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19285:

Component/s: (was: Java API)
 SQL

> Java - Provide user-defined function of 0 arguments (UDF0)
> --
>
> Key: SPARK-19285
> URL: https://issues.apache.org/jira/browse/SPARK-19285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Amit Baghel
>Priority: Minor
>
> I need to implement zero argument UDF but Spark java api doesn't provide 
> UDF0. 
> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java
> For workaround I am creating UDF1 with one argument and not using this 
> argument.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081720#comment-16081720
 ] 

Felix Cheung edited comment on SPARK-21367 at 7/11/17 6:02 AM:
---

I'm not sure exactly why yet, but comparing the working and non-working buid

working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
Writing SparkDataFrame.Rd
Writing printSchema.Rd
Writing schema.Rd
Writing explain.Rd
...
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
2: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
* installing *source* package 'SparkR' ...
{code}

not working:
{code}
First time using roxygen2 4.0. Upgrading automatically...
There were 50 or more warnings (use warnings() to see the first 50)
* installing *source* package 'SparkR' ...
{code}

Bascially, the .Rd files are not getting created (because of warnings that are 
not captured)
That cause the CRAN check to fail with 
"checking for missing documentation entries ... WARNING
Undocumented code objects:
  '%<=>%' 'add_months' 'agg' 'approxCountDistinc"
(which would be to be expected)

As explained in the description above, I"m pretty sure these are not in the 
build a while ago
"
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
"


was (Author: felixcheung):
I'm not sure exactly why yet, but comparing the working and non-working buid

working:
First time using roxygen2 4.0. Upgrading automatically...
Writing SparkDataFrame.Rd
Writing printSchema.Rd
Writing schema.Rd
Writing explain.Rd
...
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
2: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
* installing *source* package 'SparkR' ...

not working:
First time using roxygen2 4.0. Upgrading automatically...
There were 50 or more warnings (use warnings() to see the first 50)
* installing *source* package 'SparkR' ...

Bascially, the .Rd files are not getting created (because of warnings that are 
not captured)
That cause the CRAN check to fail with 
"checking for missing documentation entries ... WARNING
Undocumented code objects:
  '%<=>%' 'add_months' 'agg' 'approxCountDistinc"
(which would be to be expected)

As explained in the description above, I"m pretty sure these are not in the 
build a while ago
"
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
"

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081720#comment-16081720
 ] 

Felix Cheung commented on SPARK-21367:
--

I'm not sure exactly why yet, but comparing the working and non-working buid

working:
First time using roxygen2 4.0. Upgrading automatically...
Writing SparkDataFrame.Rd
Writing printSchema.Rd
Writing schema.Rd
Writing explain.Rd
...
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
2: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
* installing *source* package 'SparkR' ...

not working:
First time using roxygen2 4.0. Upgrading automatically...
There were 50 or more warnings (use warnings() to see the first 50)
* installing *source* package 'SparkR' ...

Bascially, the .Rd files are not getting created (because of warnings that are 
not captured)
That cause the CRAN check to fail with 
"checking for missing documentation entries ... WARNING
Undocumented code objects:
  '%<=>%' 'add_months' 'agg' 'approxCountDistinc"
(which would be to be expected)

As explained in the description above, I"m pretty sure these are not in the 
build a while ago
"
Warning messages:
1: In check_dep_version(pkg, version, compare) :
  Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
"

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081712#comment-16081712
 ] 

Felix Cheung commented on SPARK-21367:
--

it looks like instead of 5.x, the older 4.0 is being loaded?

First time using roxygen2 4.0. Upgrading automatically...



> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21360) Spark failing to query SQL Server. Query contains a column having space in where clause

2017-07-10 Thread feroz khan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081685#comment-16081685
 ] 

feroz khan commented on SPARK-21360:


Thanks for inputs. I will go through the link. 

> Spark failing to query SQL Server. Query contains a column having space  in 
> where clause 
> -
>
> Key: SPARK-21360
> URL: https://issues.apache.org/jira/browse/SPARK-21360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: feroz khan
>
> I have a table on table on Microsoft SQL server 
> ===
> CREATE TABLE [dbo].[aircraftdata](
>   [ID] [float] NULL,
>   [SN] [float] NULL,
>   [F1] [float] NULL,
>   [F 2] [float] NULL,
>   
> ) ON [PRIMARY]
> GO
> =
> I have a scala component that take data integration request in form of xml 
> and create an sql query to fetch data. Suppose i want to read column "ID" and 
> "F 2" , generated query is - 
> SELECT `id` AS `p_id` , `F 2` AS `p_F2` FROM Maqplex_IrisDataset_aircraftdata 
> WHERE   Maqplex_IrisDataset_aircraftdata.`F 2` = '.001'
> this fails with error - 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): com.microsoft.sqlserver.jdbc.SQLServerException: 
> Incorrect syntax near '2'.
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350)
>   at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
>   at 
> org.apach

[jira] [Updated] (SPARK-21360) Spark failing to query SQL Server. Query contains a column having space in where clause

2017-07-10 Thread feroz khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feroz khan updated SPARK-21360:
---
Description: 
I have a table on table on Microsoft SQL server 
===
CREATE TABLE [dbo].[aircraftdata](
[ID] [float] NULL,
[SN] [float] NULL,
[F1] [float] NULL,
[F 2] [float] NULL,

) ON [PRIMARY]

GO
=

I have a scala component that take data integration request in form of xml and 
create an sql query to fetch data. Suppose i want to read column "ID" and "F 2" 
, generated query is - 

SELECT `id` AS `p_id` , `F 2` AS `p_F2` FROM Maqplex_IrisDataset_aircraftdata 
WHERE   Maqplex_IrisDataset_aircraftdata.`F 2` = '.001'

this fails with error - 

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect 
syntax near '2'.
at 
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
at 
com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180)
at 
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155)
at 
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
at org

[jira] [Closed] (SPARK-21361) Spark failing to query SQL Server. Query contains a column having space in where clause

2017-07-10 Thread feroz khan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feroz khan closed SPARK-21361.
--

Created a duplicate issue. SPARK-21360 is open for resolution. 

> Spark failing to query SQL Server. Query contains a column having space  in 
> where clause 
> -
>
> Key: SPARK-21361
> URL: https://issues.apache.org/jira/browse/SPARK-21361
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: feroz khan
>Priority: Blocker
>
> I have a table on table on SQL server 
> ===
> CREATE TABLE [dbo].[aircraftdata](
>   [ID] [float] NULL,
>   [SN] [float] NULL,
>   [F1] [float] NULL,
>   [F 2] [float] NULL,
>   
> ) ON [PRIMARY]
> GO
> =
> I have a scala component that take data integration request in form of xml 
> and create an sql query on the dataframe to fetch data. Suppose i want to 
> read column "ID" and "F 2" and generate query as - 
> SELECT `id` AS `p_id` , `F 2` AS `p_F2` FROM Maqplex_IrisDataset_aircraftdata 
> WHERE   Maqplex_IrisDataset_aircraftdata.`F 2` = '.001'
> this fails with error - 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): com.microsoft.sqlserver.jdbc.SQLServerException: 
> Incorrect syntax near '2'.
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350)
>   at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155)
>   at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
>   at 
> org.apache.sp

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081518#comment-16081518
 ] 

Jiang Xingbo commented on SPARK-21349:
--

[~dongjoon] Are you running the test for Spark SQL? Or running some 
user-defined RDD directly? This information should help us narrowing down the 
scope of the problem. Thanks!

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19285) Java - Provide user-defined function of 0 arguments (UDF0)

2017-07-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19285:
---

Assignee: Xiao Li

> Java - Provide user-defined function of 0 arguments (UDF0)
> --
>
> Key: SPARK-19285
> URL: https://issues.apache.org/jira/browse/SPARK-19285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Amit Baghel
>Assignee: Xiao Li
>Priority: Minor
>
> I need to implement zero argument UDF but Spark java api doesn't provide 
> UDF0. 
> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java
> For workaround I am creating UDF1 with one argument and not using this 
> argument.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19285) Java - Provide user-defined function of 0 arguments (UDF0)

2017-07-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081662#comment-16081662
 ] 

Apache Spark commented on SPARK-19285:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18598

> Java - Provide user-defined function of 0 arguments (UDF0)
> --
>
> Key: SPARK-19285
> URL: https://issues.apache.org/jira/browse/SPARK-19285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Amit Baghel
>Assignee: Xiao Li
>Priority: Minor
>
> I need to implement zero argument UDF but Spark java api doesn't provide 
> UDF0. 
> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java
> For workaround I am creating UDF1 with one argument and not using this 
> argument.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19285) Java - Provide user-defined function of 0 arguments (UDF0)

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19285:


Assignee: Apache Spark  (was: Xiao Li)

> Java - Provide user-defined function of 0 arguments (UDF0)
> --
>
> Key: SPARK-19285
> URL: https://issues.apache.org/jira/browse/SPARK-19285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Amit Baghel
>Assignee: Apache Spark
>Priority: Minor
>
> I need to implement zero argument UDF but Spark java api doesn't provide 
> UDF0. 
> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java
> For workaround I am creating UDF1 with one argument and not using this 
> argument.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19285) Java - Provide user-defined function of 0 arguments (UDF0)

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19285:


Assignee: Xiao Li  (was: Apache Spark)

> Java - Provide user-defined function of 0 arguments (UDF0)
> --
>
> Key: SPARK-19285
> URL: https://issues.apache.org/jira/browse/SPARK-19285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Amit Baghel
>Assignee: Xiao Li
>Priority: Minor
>
> I need to implement zero argument UDF but Spark java api doesn't provide 
> UDF0. 
> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/api/java
> For workaround I am creating UDF1 with one argument and not using this 
> argument.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21371) dev/make-distribution.sh scripts use of $@ without ""

2017-07-10 Thread liuzhaokun (JIRA)

liuzhaokun created SPARK-21371:
--

 Summary: dev/make-distribution.sh scripts use of $@ without ""
 Key: SPARK-21371
 URL: https://issues.apache.org/jira/browse/SPARK-21371
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.1
Reporter: liuzhaokun
Priority: Trivial


dev/make-distribution.sh scripts use of $@ without " ",this will affect the 
length of args.For example, if there is a space in the parameter,it will be 
identified as two parameter.Mean while,other modules in spark have used $@ with 
" ",it's right,I think dev/make-distribution.sh should be consistent with  
others,because it's safety.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20456) Add examples for functions collection for pyspark

2017-07-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081628#comment-16081628
 ] 

Apache Spark commented on SPARK-20456:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18597

> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Assignee: Michael Patterson
>Priority: Minor
> Fix For: 2.3.0
>
>
> Document sql.functions.py:
> 1. Add examples for the common string functions (upper, lower, and reverse)
> 2. Rename columns in datetime examples to be more informative (e.g. from 'd' 
> to 'date')
> 3. Add examples for unix_timestamp, from_unixtime, rand, randn, collect_list, 
> collect_set, lit, 
> 4. Add note to all trigonometry functions that units are radians.
> 5. Add links between functions, (e.g. add link to radians from toRadians)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21315) Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray.

2017-07-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21315.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18541
[https://github.com/apache/spark/pull/18541]

> Skip some spill files when generateIterator(startIndex) in 
> ExternalAppendOnlyUnsafeRowArray.
> 
>
> Key: SPARK-21315
> URL: https://issues.apache.org/jira/browse/SPARK-21315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: jin xing
> Fix For: 2.3.0
>
>
> In current code, it is expensive to use 
> {{UnboundedFollowingWindowFunctionFrame}}, because it is iterating from the 
> start to lower bound every time calling {{write}} method. When traverse the 
> iterator, it's possible to skip some spilled files thus to save some time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21315) Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray.

2017-07-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21315:
---

Assignee: jin xing

> Skip some spill files when generateIterator(startIndex) in 
> ExternalAppendOnlyUnsafeRowArray.
> 
>
> Key: SPARK-21315
> URL: https://issues.apache.org/jira/browse/SPARK-21315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 2.3.0
>
>
> In current code, it is expensive to use 
> {{UnboundedFollowingWindowFunctionFrame}}, because it is iterating from the 
> start to lower bound every time calling {{write}} method. When traverse the 
> iterator, it's possible to skip some spilled files thus to save some time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21371) dev/make-distribution.sh scripts use of $@ without ""

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21371:


Assignee: Apache Spark

> dev/make-distribution.sh scripts use of $@ without ""
> -
>
> Key: SPARK-21371
> URL: https://issues.apache.org/jira/browse/SPARK-21371
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: Apache Spark
>Priority: Trivial
>
> dev/make-distribution.sh scripts use of $@ without " ",this will affect the 
> length of args.For example, if there is a space in the parameter,it will be 
> identified as two parameter.Mean while,other modules in spark have used $@ 
> with " ",it's right,I think dev/make-distribution.sh should be consistent 
> with  others,because it's safety.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21350) Fix the error message when the number of arguments is wrong when invoking a UDF

2017-07-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21350.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18574
[https://github.com/apache/spark/pull/18574]

> Fix the error message when the number of arguments is wrong when invoking a 
> UDF
> ---
>
> Key: SPARK-21350
> URL: https://issues.apache.org/jira/browse/SPARK-21350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Got a confusing error message when the number of arguments is wrong when 
> invoking a UDF. 
> {noformat}
> val df = spark.emptyDataFrame
> spark.udf.register("foo", (_: String).length)
> df.selectExpr("foo(2, 3, 4)")
> {noformat}
> {noformat}
> org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be 
> cast to scala.Function3
> java.lang.ClassCastException: 
> org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be 
> cast to scala.Function3
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.(ScalaUDF.scala:109)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21369) Don't use Scala classes in external shuffle service

2017-07-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21369.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 18593
[https://github.com/apache/spark/pull/18593]

> Don't use Scala classes in external shuffle service
> ---
>
> Key: SPARK-21369
> URL: https://issues.apache.org/jira/browse/SPARK-21369
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.1, 2.3.0
>
>
> Right now the external shuffle service uses Scala Tuple2. However, the Scala 
> library won't be shaded into the yarn shuffle assembly jar. Then when the 
> codes are called, it will throw ClassNotFoundException.
> Right now it's safe because we disabled spark.reducer.maxReqSizeShuffleToMem 
> by default. However,  to allow using spark.reducer.maxReqSizeShuffleToMem for 
> Yarn users, we should remove all usages of Tuples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21371) dev/make-distribution.sh scripts use of $@ without ""

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21371:


Assignee: (was: Apache Spark)

> dev/make-distribution.sh scripts use of $@ without ""
> -
>
> Key: SPARK-21371
> URL: https://issues.apache.org/jira/browse/SPARK-21371
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
>
> dev/make-distribution.sh scripts use of $@ without " ",this will affect the 
> length of args.For example, if there is a space in the parameter,it will be 
> identified as two parameter.Mean while,other modules in spark have used $@ 
> with " ",it's right,I think dev/make-distribution.sh should be consistent 
> with  others,because it's safety.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21371) dev/make-distribution.sh scripts use of $@ without ""

2017-07-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081585#comment-16081585
 ] 

Apache Spark commented on SPARK-21371:
--

User 'liu-zhaokun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18596

> dev/make-distribution.sh scripts use of $@ without ""
> -
>
> Key: SPARK-21371
> URL: https://issues.apache.org/jira/browse/SPARK-21371
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
>
> dev/make-distribution.sh scripts use of $@ without " ",this will affect the 
> length of args.For example, if there is a space in the parameter,it will be 
> identified as two parameter.Mean while,other modules in spark have used $@ 
> with " ",it's right,I think dev/make-distribution.sh should be consistent 
> with  others,because it's safety.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21043) Add unionByName API to Dataset

2017-07-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21043.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.3.0

> Add unionByName API to Dataset
> --
>
> Key: SPARK-21043
> URL: https://issues.apache.org/jira/browse/SPARK-21043
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> It would be useful to add unionByName which resolves columns by name, in 
> addition to the existing union (which resolves by position).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21370) Avoid doing anything on HDFSBackedStateStore.abort() when there are no updates to commit

2017-07-10 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz closed SPARK-21370.
---
Resolution: Not A Problem

> Avoid doing anything on HDFSBackedStateStore.abort() when there are no 
> updates to commit
> 
>
> Key: SPARK-21370
> URL: https://issues.apache.org/jira/browse/SPARK-21370
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Minor
>
> Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
> initialized.
> For every trigger, we create two state stores, one used by 
> "StateStoreRestore" operator to only read data and one by "StateStoreSave" 
> operator to write updates. So, the "Restore" StateStore is read-only. This 
> state store gets "aborted" after a task is completed, and this abort attempts 
> to delete files
> This can be avoided if there is an INITIALIZED state and abort deletes files 
> only when there is an update to the state store using "put" or "remove".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20604) Allow Imputer to handle all numeric types

2017-07-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20604:
--
Issue Type: Improvement  (was: Bug)

> Allow Imputer to handle all numeric types
> -
>
> Key: SPARK-20604
> URL: https://issues.apache.org/jira/browse/SPARK-20604
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> Imputer currently requires input column to be Double or Float, but the logic 
> should work on any numeric data types. Many practical problems have integer  
> data types, and it could get very tedious to manually cast them into Double 
> before calling imputer. This transformer could be extended to handle all 
> numeric types.  
> The example below shows failure of Imputer on integer data. 
> {code}
> val df = spark.createDataFrame( Seq(
>   (0, 1.0, 1.0, 1.0),
>   (1, 11.0, 11.0, 11.0),
>   (2, 1.5, 1.5, 1.5),
>   (3, Double.NaN, 4.5, 1.5)
> )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
> val imputer = new Imputer()
>   .setInputCols(Array("value1"))
>   .setOutputCols(Array("out1"))
> imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))
> java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
> of type equal to one of the following types: [DoubleType, FloatType] but was 
> actually of type IntegerType.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD

2017-07-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21241:
--
Issue Type: New Feature  (was: Bug)

> Add intercept to StreamingLinearRegressionWithSGD
> -
>
> Key: SPARK-21241
> URL: https://issues.apache.org/jira/browse/SPARK-21241
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Soulaimane GUEDRIA
>
> StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
> Method which offers the possibility to turn on/off the intercept value. API 
> parity is not respected between Python and Scala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest

2017-07-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081503#comment-16081503
 ] 

Joseph K. Bradley commented on SPARK-20133:
---

Sorry for the slow response; please feel free to!

> User guide for spark.ml.stat.ChiSquareTest
> --
>
> Key: SPARK-20133
> URL: https://issues.apache.org/jira/browse/SPARK-20133
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add new user guide section for spark.ml.stat, and document ChiSquareTest.  
> This may involve adding new example scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21370) Avoid doing anything on HDFSBackedStateStore.abort() when there are no updates to commit

2017-07-10 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-21370:
--
Priority: Minor  (was: Major)

> Avoid doing anything on HDFSBackedStateStore.abort() when there are no 
> updates to commit
> 
>
> Key: SPARK-21370
> URL: https://issues.apache.org/jira/browse/SPARK-21370
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Minor
>
> Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
> initialized.
> For every trigger, we create two state stores, one used by 
> "StateStoreRestore" operator to only read data and one by "StateStoreSave" 
> operator to write updates. So, the "Restore" StateStore is read-only. This 
> state store gets "aborted" after a task is completed, and this abort attempts 
> to delete files
> This can be avoided if there is an INITIALIZED state and abort deletes files 
> only when there is an update to the state store using "put" or "remove".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21359) frequency discretizer

2017-07-10 Thread Fu Shanshan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081467#comment-16081467
 ] 

Fu Shanshan commented on SPARK-21359:
-

but why in the example:
Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 1.0), (6, 9.1), 
(7, 10.1), (8, 1.1), (9, 16.0), (10, 20.0), (11, 20.0)) 

QuantileDiscretizer result   
+---++--+
| id|hour|result|
+---++--+
|  0|18.0|   3.0|
|  1|19.0|   3.0|
|  2| 8.0|   1.0|
|  3| 5.0|   1.0|
|  4| 2.2|   1.0|
|  5| 1.0|   0.0|
|  6| 9.1|   2.0|
|  7|10.1|   2.0|
|  8| 1.1|   0.0|
|  9|16.0|   2.0|
| 10|20.0|   3.0|
| 11|20.0|   3.0|
+---++--+

for number 18. it belong to bin 3. I thought it is because it makes equal-width 
bins, so the bin array is (0, 5, 10, 15, 20), so 18 is in the last bin.
but my result, for number 18, it should be in bin 2. for equal frequency 
definition, so the bin array is (-inf, 5.0, 10.1, 19, inf or 20), so 18 in the 
bin 2, instead of the last bin.
Not sure am I misunderstood this questions. Thank you for your patiences.

> frequency discretizer
> -
>
> Key: SPARK-21359
> URL: https://issues.apache.org/jira/browse/SPARK-21359
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Fu Shanshan
>
> Typically data is discretized into partitions of K equal lengths/width (equal 
> intervals) or K% of the total data (equal frequencies)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-07-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081500#comment-16081500
 ] 

Joseph K. Bradley commented on SPARK-21086:
---

I like the idea for that path, but it could become really long in some cases, 
so I'd prefer to use indices instead for robustness.

Driver memory shouldn't be a big problem since all models are already collected 
to the driver.

> CrossValidator, TrainValidationSplit should preserve all models after fitting
> -
>
> Key: SPARK-21086
> URL: https://issues.apache.org/jira/browse/SPARK-21086
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> I've heard multiple requests for having CrossValidatorModel and 
> TrainValidationSplitModel preserve the full list of fitted models.  This 
> sounds very valuable.
> One decision should be made before we do this: Should we save and load the 
> models in ML persistence?  That could blow up the size of a saved Pipeline if 
> the models are large.
> * I suggest *not* saving the models by default but allowing saving if 
> specified.  We could specify whether to save the model as an extra Param for 
> CrossValidatorModelWriter, but we would have to make sure to expose 
> CrossValidatorModelWriter as a public API and modify the return type of 
> CrossValidatorModel.write to be CrossValidatorModelWriter (but this will not 
> be a breaking change).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21358) Argument of repartitionandsortwithinpartitions at pyspark

2017-07-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-21358.
-
   Resolution: Fixed
 Assignee: chie hayashida
Fix Version/s: 2.3.0

> Argument of repartitionandsortwithinpartitions at pyspark
> -
>
> Key: SPARK-21358
> URL: https://issues.apache.org/jira/browse/SPARK-21358
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples
>Affects Versions: 2.1.1
>Reporter: chie hayashida
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.3.0
>
>
> In rdd.py, implementation of repartitionandsortwithinpartitions is below.
> {code}
>  def repartitionAndSortWithinPartitions(self, numPartitions=None, 
> partitionFunc=portable_hash,
>ascending=True, keyfunc=lambda x: 
> x):
> {code}
> And at document, there is following sample script.
> {code}
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> 2)
> {code}
> The third argument (ascending) expected to be boolean, so following script is 
> better, I think.
> {code}
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> True)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel

2017-07-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081483#comment-16081483
 ] 

Joseph K. Bradley commented on SPARK-21341:
---

+1 for the built-in save/load.  Saving as an object file is not something MLlib 
is meant to support.

> Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel 
> -
>
> Key: SPARK-21341
> URL: https://issues.apache.org/jira/browse/SPARK-21341
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Zied Sellami
>
> I am using sparContext.saveAsObjectFile to save a complex object containing a 
> pipelineModel with a Word2Vec ML Transformer. When I load the object and call 
> myPipelineModel.transform, Word2VecModel raise a null pointer error on line 
> 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by 
> removing@transient annotation on val wordVectors and @transient lazy val on 
> getVectors function.
> -Why this 2 val are transient ?
> -Any solution to add a boolean function on the Word2Vec Transformer to force 
> the serialization of wordVectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-07-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21208:
--
Issue Type: New Feature  (was: Bug)

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel

2017-07-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21341.
---
Resolution: Not A Problem

> Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel 
> -
>
> Key: SPARK-21341
> URL: https://issues.apache.org/jira/browse/SPARK-21341
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Zied Sellami
>
> I am using sparContext.saveAsObjectFile to save a complex object containing a 
> pipelineModel with a Word2Vec ML Transformer. When I load the object and call 
> myPipelineModel.transform, Word2VecModel raise a null pointer error on line 
> 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by 
> removing@transient annotation on val wordVectors and @transient lazy val on 
> getVectors function.
> -Why this 2 val are transient ?
> -Any solution to add a boolean function on the Word2Vec Transformer to force 
> the serialization of wordVectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21364) IndexOutOfBoundsException on equality check of two complex array elements

2017-07-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21364.
--
Resolution: Cannot Reproduce

I can't reproduce this against the current master by the same reproducer in 
this JIRA description. I guess it is properly backported per [~kiszk]'s comment 
above.

I don't know which JIRA fixes it so resolving this as a Cannot Reproduce. 
Please fix my resolution if anyone knows. 

> IndexOutOfBoundsException on equality check of two complex array elements
> -
>
> Key: SPARK-21364
> URL: https://issues.apache.org/jira/browse/SPARK-21364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Vivek Patangiwar
>Priority: Minor
>
> Getting an IndexOutOfBoundsException with the following code:
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> object ArrayEqualityTest {
>   def main(s:Array[String]) {
> val sparkSession = 
> SparkSession.builder().master("local[*]").appName("app").getOrCreate()
> val sqlContext = sparkSession.sqlContext
> val sc = sparkSession.sqlContext.sparkContext
> import sparkSession.implicits._
> val df = 
> sqlContext.read.json(sc.parallelize(Seq("{\"menu\":{\"id\":\"file\",\"value\":\"File\",\"popup\":{\"menuitem\":[{\"value\":\"New\",\"onclick\":\"CreateNewDoc()\"},{\"value\":\"Open\",\"onclick\":\"OpenDoc()\"},{\"value\":\"Close\",\"onclick\":\"CloseDoc()\"}]}}}")))
> 
> df.select($"menu.popup.menuitem"(lit(0)).===($"menu.popup.menuitem"(lit(1.show
>   }
> }
> Here's the complete stack-trace:
> Exception in thread "main" java.lang.IndexOutOfBoundsException: 1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$3.apply(GenerateOrdering.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$3.apply(GenerateOrdering.scala:75)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.genComp(CodeGenerator.scala:559)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.genEqual(CodeGenerator.scala:486)
>   at 
> org.apache.spark.sql.catalyst.expressions.EqualTo$$anonfun$doGenCode$4.apply(predicates.scala:437)
>   at 
> org.apache.spark.sql.catalyst.expressions.EqualTo$$anonfun$doGenCode$4.apply(predicates.scala:437)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression$$anonfun$defineCodeGen$2.apply(Expression.scala:442)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression$$anonfun$defineCodeGen$2.apply(Expression.scala:441)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:460)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:441)
>   at 
> org.apache.spark.sql.catalyst.expressions.EqualTo.doGenCode(predicates.scala:437)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.execution.Pr

[jira] [Updated] (SPARK-21370) Avoid doing anything on HDFSBackedStateStore.abort() when there are no updates to commit

2017-07-10 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-21370:
--
Description: 
Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
initialized.

For every trigger, we create two state stores, one used by "StateStoreRestore" 
operator to only read data and one by "StateStoreSave" operator to write 
updates. So, the "Restore" StateStore is read-only. This state store gets 
"aborted" after a task is completed, and this abort attempts to delete files

This can be avoided if there is an INITIALIZED state and abort deletes files 
only when there is an update to the state store using "put" or "remove".

  was:
Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
initialized.

For every trigger, we create two state stores, one used during "Restore" and 
one during "Save". The "Restore" StateStore is read-only. This state store gets 
"aborted" after a task is completed, which results in a file being created and 
immediately deleted.

This can be avoided if there is an INITIALIZED state and abort deletes files 
only when there is an update to the state store using "put" or "remove".


> Avoid doing anything on HDFSBackedStateStore.abort() when there are no 
> updates to commit
> 
>
> Key: SPARK-21370
> URL: https://issues.apache.org/jira/browse/SPARK-21370
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
> initialized.
> For every trigger, we create two state stores, one used by 
> "StateStoreRestore" operator to only read data and one by "StateStoreSave" 
> operator to write updates. So, the "Restore" StateStore is read-only. This 
> state store gets "aborted" after a task is completed, and this abort attempts 
> to delete files
> This can be avoided if there is an INITIALIZED state and abort deletes files 
> only when there is an update to the state store using "put" or "remove".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21370) Avoid doing anything on HDFSBackedStateStore.abort() when there are no updates to commit

2017-07-10 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-21370:
--
Summary: Avoid doing anything on HDFSBackedStateStore.abort() when there 
are no updates to commit  (was: Clarify In-Memory State Store purpose 
(read-only, read-write) with an additional state)

> Avoid doing anything on HDFSBackedStateStore.abort() when there are no 
> updates to commit
> 
>
> Key: SPARK-21370
> URL: https://issues.apache.org/jira/browse/SPARK-21370
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
> initialized.
> For every trigger, we create two state stores, one used during "Restore" and 
> one during "Save". The "Restore" StateStore is read-only. This state store 
> gets "aborted" after a task is completed, which results in a file being 
> created and immediately deleted.
> This can be avoided if there is an INITIALIZED state and abort deletes files 
> only when there is an update to the state store using "put" or "remove".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21370) Clarify In-Memory State Store purpose (read-only, read-write) with an additional state

2017-07-10 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-21370:
---

 Summary: Clarify In-Memory State Store purpose (read-only, 
read-write) with an additional state
 Key: SPARK-21370
 URL: https://issues.apache.org/jira/browse/SPARK-21370
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.1
Reporter: Burak Yavuz
Assignee: Burak Yavuz


Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
initialized.

For every trigger, we create two state stores, one used during "Restore" and 
one during "Save". The "Restore" StateStore is read-only. This state store gets 
"aborted" after a task is completed, which results in a file being created and 
immediately deleted.

This can be avoided if there is an INITIALIZED state and abort deletes files 
only when there is an update to the state store using "put" or "remove".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081431#comment-16081431
 ] 

Wenchen Fan commented on SPARK-21349:
-

[~rxin] that only helps with internal accumulators, but seems the problems here 
is we have too many sql metrics. Maybe we should prioritize sql metrics 
accumulators and have some special optimization for them to reduce the size.

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21370) Clarify In-Memory State Store purpose (read-only, read-write) with an additional state

2017-07-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081436#comment-16081436
 ] 

Apache Spark commented on SPARK-21370:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/18595

> Clarify In-Memory State Store purpose (read-only, read-write) with an 
> additional state
> --
>
> Key: SPARK-21370
> URL: https://issues.apache.org/jira/browse/SPARK-21370
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
> initialized.
> For every trigger, we create two state stores, one used during "Restore" and 
> one during "Save". The "Restore" StateStore is read-only. This state store 
> gets "aborted" after a task is completed, which results in a file being 
> created and immediately deleted.
> This can be avoided if there is an INITIALIZED state and abort deletes files 
> only when there is an update to the state store using "put" or "remove".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21370) Clarify In-Memory State Store purpose (read-only, read-write) with an additional state

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21370:


Assignee: Apache Spark  (was: Burak Yavuz)

> Clarify In-Memory State Store purpose (read-only, read-write) with an 
> additional state
> --
>
> Key: SPARK-21370
> URL: https://issues.apache.org/jira/browse/SPARK-21370
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
> initialized.
> For every trigger, we create two state stores, one used during "Restore" and 
> one during "Save". The "Restore" StateStore is read-only. This state store 
> gets "aborted" after a task is completed, which results in a file being 
> created and immediately deleted.
> This can be avoided if there is an INITIALIZED state and abort deletes files 
> only when there is an update to the state store using "put" or "remove".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21370) Clarify In-Memory State Store purpose (read-only, read-write) with an additional state

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21370:


Assignee: Burak Yavuz  (was: Apache Spark)

> Clarify In-Memory State Store purpose (read-only, read-write) with an 
> additional state
> --
>
> Key: SPARK-21370
> URL: https://issues.apache.org/jira/browse/SPARK-21370
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> Currently the HDFSBackedStateStore sets it's state as UPDATING as it is 
> initialized.
> For every trigger, we create two state stores, one used during "Restore" and 
> one during "Save". The "Restore" StateStore is read-only. This state store 
> gets "aborted" after a task is completed, which results in a file being 
> created and immediately deleted.
> This can be avoided if there is an INITIALIZED state and abort deletes files 
> only when there is an update to the state store using "put" or "remove".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20904) Task failures during shutdown cause problems with preempted executors

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20904:


Assignee: Apache Spark

> Task failures during shutdown cause problems with preempted executors
> -
>
> Key: SPARK-20904
> URL: https://issues.apache.org/jira/browse/SPARK-20904
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> Spark runs tasks in a thread pool that uses daemon threads in each executor. 
> That means that when the JVM gets a signal to shut down, those tasks keep 
> running.
> Now when YARN preempts an executor, it sends a SIGTERM to the process, 
> triggering the JVM shutdown. That causes shutdown hooks to run which may 
> cause user code running in those tasks to fail, and report task failures to 
> the driver. Those failures are then counted towards the maximum number of 
> allowed failures, even though in this case we don't want that because the 
> executor was preempted.
> So we need a better way to handle that situation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20904) Task failures during shutdown cause problems with preempted executors

2017-07-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081404#comment-16081404
 ] 

Apache Spark commented on SPARK-20904:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18594

> Task failures during shutdown cause problems with preempted executors
> -
>
> Key: SPARK-20904
> URL: https://issues.apache.org/jira/browse/SPARK-20904
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> Spark runs tasks in a thread pool that uses daemon threads in each executor. 
> That means that when the JVM gets a signal to shut down, those tasks keep 
> running.
> Now when YARN preempts an executor, it sends a SIGTERM to the process, 
> triggering the JVM shutdown. That causes shutdown hooks to run which may 
> cause user code running in those tasks to fail, and report task failures to 
> the driver. Those failures are then counted towards the maximum number of 
> allowed failures, even though in this case we don't want that because the 
> executor was preempted.
> So we need a better way to handle that situation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20904) Task failures during shutdown cause problems with preempted executors

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20904:


Assignee: (was: Apache Spark)

> Task failures during shutdown cause problems with preempted executors
> -
>
> Key: SPARK-20904
> URL: https://issues.apache.org/jira/browse/SPARK-20904
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> Spark runs tasks in a thread pool that uses daemon threads in each executor. 
> That means that when the JVM gets a signal to shut down, those tasks keep 
> running.
> Now when YARN preempts an executor, it sends a SIGTERM to the process, 
> triggering the JVM shutdown. That causes shutdown hooks to run which may 
> cause user code running in those tasks to fail, and report task failures to 
> the driver. Those failures are then counted towards the maximum number of 
> allowed failures, even though in this case we don't want that because the 
> executor was preempted.
> So we need a better way to handle that situation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21362) Add JDBCDialect for Apache Drill

2017-07-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081377#comment-16081377
 ] 

Dongjoon Hyun commented on SPARK-21362:
---

Hi, [~radford1]
Usually, in Spark community, *Assiginee* field is filled by committers after 
the PR is really merged.
You can just proceed and make a PR. Since you left your intention here, 
probably, no one will start this.

> Add JDBCDialect for Apache Drill
> 
>
> Key: SPARK-21362
> URL: https://issues.apache.org/jira/browse/SPARK-21362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: David Radford
>Priority: Minor
>
> Apache Drill does not allow quotation marks (") so a custom jdbc dialect is 
> needed to return the field names surround in tick marks (`) similar to how 
> MySQL dialect works. This requires an override to the method: quoteIdentifier



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20920) ForkJoinPool pools are leaked when writing hive tables with many partitions

2017-07-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20920:

Fix Version/s: (was: 2.3.0)

> ForkJoinPool pools are leaked when writing hive tables with many partitions
> ---
>
> Key: SPARK-20920
> URL: https://issues.apache.org/jira/browse/SPARK-20920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Rares Mirica
>Assignee: Sean Owen
> Fix For: 2.1.2, 2.2.0
>
>
> This bug is loosely related to SPARK-17396
> In this case it happens when writing to a hive table with many, many, 
> partitions (my table is partitioned by hour and stores data it gets from 
> kafka in a spark streaming application):
> df.repartition()
>   .write
>   .format("orc")
>   .option("path", s"$tablesStoragePath/$tableName")
>   .mode(SaveMode.Append)
>   .partitionBy("dt", "hh")
>   .saveAsTable(tableName)
> As this table grows beyond a certain size, ForkJoinPool pools start leaking. 
> Upon examination (with a debugger) I found that the caller is 
> AlterTableRecoverPartitionsCommand and the problem happens when 
> `evalTaskSupport` is used (line 555). I have tried setting a very large 
> threshold via `spark.rdd.parallelListingThreshold` and the problem went away.
> My assumption is that the problem happens in this case and not in the one in 
> SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case 
> class while UnionRDD is an object so multiple instances are not possible, 
> therefore no leak.
> Regards,
> Rares



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21059) LikeSimplification can NPE on null pattern

2017-07-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21059:

Fix Version/s: (was: 2.3.0)

> LikeSimplification can NPE on null pattern
> --
>
> Key: SPARK-21059
> URL: https://issues.apache.org/jira/browse/SPARK-21059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21079) ANALYZE TABLE fails to calculate totalSize for a partitioned table

2017-07-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21079:

Labels:   (was: easyfix)

> ANALYZE TABLE fails to calculate totalSize for a partitioned table
> --
>
> Key: SPARK-21079
> URL: https://issues.apache.org/jira/browse/SPARK-21079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>Assignee: Maria
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ANALYZE TABLE table COMPUTE STATISTICS invoked for a partition table produces 
> totalSize = 0.
> AnalyzeTableCommand fetches table-level storage URI and calculated total size 
> of files in the corresponding directory recursively. However, for partitioned 
> tables, each partition has its own storage URI which may not be a 
> subdirectory of the table-level storage URI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17196) Can not initializeing SparkConent plus Kerberos env

2017-07-10 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17196.

Resolution: Invalid

There's not a whole lot of information here, but from the fact that this works 
on every version that I remember, this might be something with the HDP distro 
you're using, so I'd check with them first.

> Can not initializeing SparkConent plus Kerberos env
> ---
>
> Key: SPARK-17196
> URL: https://issues.apache.org/jira/browse/SPARK-17196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: HDP 2.3.4(Spark 1.5.2)+Kerberos
>Reporter: sangshenghong
>
> When we submit a application and get the following exception :
> java.lang.ClassNotFoundException: org.spark_project.protobuf.GeneratedMessage
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at 
> com.spss.utilities.classloading.dynamicclassloader.ChildFirstDynamicClassLoader.loadClass(ChildFirstDynamicClassLoader.java:108)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:274)
>   at 
> akka.actor.ReflectiveDynamicAccess$$anonfun$getClassFor$1.apply(DynamicAccess.scala:67)
>   at 
> akka.actor.ReflectiveDynamicAccess$$anonfun$getClassFor$1.apply(DynamicAccess.scala:66)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> akka.actor.ReflectiveDynamicAccess.getClassFor(DynamicAccess.scala:66)
>   at 
> akka.serialization.Serialization$$anonfun$6.apply(Serialization.scala:181)
>   at 
> akka.serialization.Serialization$$anonfun$6.apply(Serialization.scala:181)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
>   at akka.serialization.Serialization.(Serialization.scala:181)
>   at 
> akka.serialization.SerializationExtension$.createExtension(SerializationExtension.scala:15)
>   at 
> akka.serialization.SerializationExtension$.createExtension(SerializationExtension.scala:12)
>   at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:713)
>   at akka.actor.ExtensionId$class.apply(Extension.scala:79)
>   at 
> akka.serialization.SerializationExtension$.apply(SerializationExtension.scala:12)
>   at 
> akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:175)
>   at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:620)
>   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:617)
>   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:617)
>   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:634)
>   at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
>   at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
>   at 
> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
>   at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
>   at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
>   at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1920)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1911)
>   at 
> org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:253)
>   at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:53)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:254)
>   at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194)
>   at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
>   at org.apache.spark.SparkContext.(SparkContext.scala:450)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:75)
> Also I checked spark assembly jar file and do not find the packa

[jira] [Commented] (SPARK-20394) Replication factor value Not changing properly

2017-07-10 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081318#comment-16081318
 ] 

Marcelo Vanzin commented on SPARK-20394:


Have you tried setting the replication to 1 in your {{hdfs-site.xml}}?

IIRC Spark 1.6 doesn't propagate the HiveContext configuration to the Hive 
library in some cases.

> Replication factor value Not changing properly
> --
>
> Key: SPARK-20394
> URL: https://issues.apache.org/jira/browse/SPARK-20394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Kannan Subramanian
>
> To save SparkSQL dataframe to a persistent hive table using the below steps.
> a) RegisterTempTable to the dataframe as a tempTable
> b) create table  (cols)PartitionedBy(col1, col2) stored as 
> parquet
> c) Insert into  partition(col1, col2) select * from tempTable
> I have set dfs.replication is equal to "1" in hiveContext object. But It did 
> not work properly. That is replica is 1 for 80 % of the generated parquet 
> files on HDFS and default replica 3 is for remaining 20 % of parquet files in 
> HDFS. I am not sure why the replica is not reflecting to all the generated 
> parquet files. Please let me know if you have any suggestions or solutions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081206#comment-16081206
 ] 

shane knapp commented on SPARK-21367:
-

reverting now.

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081274#comment-16081274
 ] 

Dongjoon Hyun commented on SPARK-21367:
---

Thank you, [~shaneknapp].
It seems to be `Had CRAN check errors; see logs.`.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/79474/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/79473/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/79472/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/79471/console

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081230#comment-16081230
 ] 

shane knapp commented on SPARK-21367:
-

shiv:  some of the builds that launched post-upgrade were showing a v4 version 
of roxygen2 being imported (from 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79476/console)

```First time using roxygen2 4.0. Upgrading automatically...```

this build launched after the upgrade, so i'd rather roll back and un-break 
things and take a closer look in a little bit (i have an appt in ~30 mins).  
i'll check back later this afternoon and see if anything has turned up.

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081226#comment-16081226
 ] 

shane knapp commented on SPARK-21367:
-

ill take a closer look at this tomorrow.

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081223#comment-16081223
 ] 

Shivaram Venkataraman commented on SPARK-21367:
---

[~dongjoon] Do you have a particular output file that we can use to debug ?

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081220#comment-16081220
 ] 

shane knapp commented on SPARK-21367:
-

reverted

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reopened SPARK-21367:
-

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081173#comment-16081173
 ] 

Dongjoon Hyun commented on SPARK-21367:
---

Hi, All.
After this, many builds seems to fail consecutively.
{code}
This patch fails SparkR unit tests.
{code}

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17044) Add window function test in SQLQueryTestSuite

2017-07-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-17044.
-
Resolution: Duplicate

> Add window function test in SQLQueryTestSuite
> -
>
> Key: SPARK-17044
> URL: https://issues.apache.org/jira/browse/SPARK-17044
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds a SQL query test for Window functions for new 
> `SQLQueryTestSuite`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17044) Add window function test in SQLQueryTestSuite

2017-07-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-17044:
---

> Add window function test in SQLQueryTestSuite
> -
>
> Key: SPARK-17044
> URL: https://issues.apache.org/jira/browse/SPARK-17044
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds a SQL query test for Window functions for new 
> `SQLQueryTestSuite`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081173#comment-16081173
 ] 

Dongjoon Hyun edited comment on SPARK-21367 at 7/10/17 9:21 PM:


Hi, All.
After this, many builds seems to fail consecutively.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/
{code}
This patch fails SparkR unit tests.
{code}


was (Author: dongjoon):
Hi, All.
After this, many builds seems to fail consecutively.
{code}
This patch fails SparkR unit tests.
{code}

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters

2017-07-10 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081136#comment-16081136
 ] 

Andrew Ash commented on SPARK-21289:


Looks like this will fix SPARK-17227 also

> Text and CSV formats do not support custom end-of-line delimiters
> -
>
> Key: SPARK-21289
> URL: https://issues.apache.org/jira/browse/SPARK-21289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Yevgen Galchenko
>Priority: Minor
>
> Spark csv and text readers always use default CR, LF or CRLF line terminators 
> without an option to configure a custom delimiter.
> Option "textinputformat.record.delimiter" is not being used to set delimiter 
> in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() 
> is used to read file.
> Possible solution would be to change HadoopFileLinesReader and create 
> LineRecordReader with delimiters specified in configuration. LineRecordReader 
> already supports passing recordDelimiter in its constructor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21369) Don't use Scala classes in external shuffle service

2017-07-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081113#comment-16081113
 ] 

Apache Spark commented on SPARK-21369:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18593

> Don't use Scala classes in external shuffle service
> ---
>
> Key: SPARK-21369
> URL: https://issues.apache.org/jira/browse/SPARK-21369
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now the external shuffle service uses Scala Tuple2. However, the Scala 
> library won't be shaded into the yarn shuffle assembly jar. Then when the 
> codes are called, it will throw ClassNotFoundException.
> Right now it's safe by default because we disabled 
> spark.reducer.maxReqSizeShuffleToMem. However,  to allow using 
> spark.reducer.maxReqSizeShuffleToMem, we should remove all usages of Tuples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-07-10 Thread Dominic Ricard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080503#comment-16080503
 ] 

Dominic Ricard edited comment on SPARK-21067 at 7/10/17 9:04 PM:
-

Interesting findings today.

- Open *{color:#59afe1}Beeline Session 1{color}*
-- Create Table 1 (Success)
- Open *{color:#14892c}Beeline Session 2{color}*
-- Create Table 2 (Success)
- Close *{color:#59afe1}Beeline Session 1{color}*
- Create Table 3 in *{color:#14892c}Beeline Session 2{color}* 
({color:#d04437}FAIL{color})

So, it seems like the problem occurs after the 1st session is closed. What does 
the Thrift server do when a session is closed that could cause this issue?

Looking at the Hive Metastore logs, I notice that the same SQL query (CREATE 
TABLE ...) translate to different actions between the 1st and later sessions:

Session 1:
{noformat}
PERFLOG method=alter_table_with_cascade 
from=org.apache.hadoop.hive.metastore.RetryingHMSHandler
{noformat}

Session 2:
{noformat}
PERFLOG method=drop_table_with_environment_context 
from=org.apache.hadoop.hive.metastore.RetryingHMSHandler
{noformat}


was (Author: dricard):
Interesting findings today.

- Open *{color:#59afe1}Beeline Session 1{color}*
-- Create Table 1 (Success)
- Open *{color:#14892c}Beeline Session 2{color}*
-- Create Table 2 (Success)
- Close *{color:#59afe1}Beeline Session 1{color}*
- Create Table 3 in *{color:#14892c}Beeline Session 2{color}* 
({color:#d04437}FAIL{color})

So, it seems like the problem occurs after the 1st session is closed. What does 
the Thrift server do when a session is closed that could cause this issue?

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
>

[jira] [Closed] (SPARK-15226) CSV file data-line with newline at first line load error

2017-07-10 Thread Andrew Ash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-15226.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Fixed by Fixed by https://issues.apache.org/jira/browse/SPARK-19610

> CSV file data-line with newline at first line load error
> 
>
> Key: SPARK-15226
> URL: https://issues.apache.org/jira/browse/SPARK-15226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Weichen Xu
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> CSV file such as:
> ---
> v1,v2,"v
> 3",v4,v5
> a,b,c,d,e
> ---
> it contains two row,first row :
> v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is 
> legal)
> second row:
> a,b,c,d,e
> then in spark-shell run commands like:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> var df = reader.csv("path/to/csvfile")
> df.collect
> then we find the load data is wrong,
> the load data has only 3 columns, but in fact it has 5 columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15226) CSV file data-line with newline at first line load error

2017-07-10 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081154#comment-16081154
 ] 

Andrew Ash edited comment on SPARK-15226 at 7/10/17 9:07 PM:
-

Fixed by https://issues.apache.org/jira/browse/SPARK-19610


was (Author: aash):
Fixed by Fixed by https://issues.apache.org/jira/browse/SPARK-19610

> CSV file data-line with newline at first line load error
> 
>
> Key: SPARK-15226
> URL: https://issues.apache.org/jira/browse/SPARK-15226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Weichen Xu
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> CSV file such as:
> ---
> v1,v2,"v
> 3",v4,v5
> a,b,c,d,e
> ---
> it contains two row,first row :
> v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is 
> legal)
> second row:
> a,b,c,d,e
> then in spark-shell run commands like:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> var df = reader.csv("path/to/csvfile")
> df.collect
> then we find the load data is wrong,
> the load data has only 3 columns, but in fact it has 5 columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20263) create empty dataframes in sparkR

2017-07-10 Thread Grishma Jena (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081126#comment-16081126
 ] 

Grishma Jena commented on SPARK-20263:
--

[~otoomet] Have you tried creating a Spark dataframe with a dummy record and 
then filtering it out?

> create empty dataframes in sparkR
> -
>
> Key: SPARK-20263
> URL: https://issues.apache.org/jira/browse/SPARK-20263
> Project: Spark
>  Issue Type: Wish
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Ott Toomet
>Priority: Minor
>
> SparkR 2.1 does not support creating empty dataframes, nor conversion of 
> empty R dataframes to spark ones:
> createDataFrame(data.frame(a=integer()))
> gives 
> Error in takeRDD(x, 1)[[1]] : subscript out of bounds



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081128#comment-16081128
 ] 

Reynold Xin commented on SPARK-21349:
-

cc [~cloud_fan]

Shouldn't task metric just be a single accumulator, rather than a list of them? 
That would substantially cut down the serialization size.


> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21369) Don't use Scala classes in external shuffle service

2017-07-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21369:
-
Description: 
Right now the external shuffle service uses Scala Tuple2. However, the Scala 
library won't be shaded into the yarn shuffle assembly jar. Then when the codes 
are called, it will throw ClassNotFoundException.

Right now it's safe because we disabled spark.reducer.maxReqSizeShuffleToMem by 
default. However,  to allow using spark.reducer.maxReqSizeShuffleToMem for Yarn 
users, we should remove all usages of Tuples.

  was:
Right now the external shuffle service uses Scala Tuple2. However, the Scala 
library won't be shaded into the yarn shuffle assembly jar. Then when the codes 
are called, it will throw ClassNotFoundException.

Right now it's safe because we disabled spark.reducer.maxReqSizeShuffleToMem by 
default. However,  to allow using spark.reducer.maxReqSizeShuffleToMem, we 
should remove all usages of Tuples.


> Don't use Scala classes in external shuffle service
> ---
>
> Key: SPARK-21369
> URL: https://issues.apache.org/jira/browse/SPARK-21369
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now the external shuffle service uses Scala Tuple2. However, the Scala 
> library won't be shaded into the yarn shuffle assembly jar. Then when the 
> codes are called, it will throw ClassNotFoundException.
> Right now it's safe because we disabled spark.reducer.maxReqSizeShuffleToMem 
> by default. However,  to allow using spark.reducer.maxReqSizeShuffleToMem for 
> Yarn users, we should remove all usages of Tuples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21369) Don't use Scala classes in external shuffle service

2017-07-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21369:
-
Description: 
Right now the external shuffle service uses Scala Tuple2. However, the Scala 
library won't be shaded into the yarn shuffle assembly jar. Then when the codes 
are called, it will throw ClassNotFoundException.

Right now it's safe because we disabled spark.reducer.maxReqSizeShuffleToMem by 
default. However,  to allow using spark.reducer.maxReqSizeShuffleToMem, we 
should remove all usages of Tuples.

  was:
Right now the external shuffle service uses Scala Tuple2. However, the Scala 
library won't be shaded into the yarn shuffle assembly jar. Then when the codes 
are called, it will throw ClassNotFoundException.

Right now it's safe by default because we disabled 
spark.reducer.maxReqSizeShuffleToMem. However,  to allow using 
spark.reducer.maxReqSizeShuffleToMem, we should remove all usages of Tuples.


> Don't use Scala classes in external shuffle service
> ---
>
> Key: SPARK-21369
> URL: https://issues.apache.org/jira/browse/SPARK-21369
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now the external shuffle service uses Scala Tuple2. However, the Scala 
> library won't be shaded into the yarn shuffle assembly jar. Then when the 
> codes are called, it will throw ClassNotFoundException.
> Right now it's safe because we disabled spark.reducer.maxReqSizeShuffleToMem 
> by default. However,  to allow using spark.reducer.maxReqSizeShuffleToMem, we 
> should remove all usages of Tuples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21369) Don't use Scala classes in external shuffle service

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21369:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Don't use Scala classes in external shuffle service
> ---
>
> Key: SPARK-21369
> URL: https://issues.apache.org/jira/browse/SPARK-21369
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now the external shuffle service uses Scala Tuple2. However, the Scala 
> library won't be shaded into the yarn shuffle assembly jar. Then when the 
> codes are called, it will throw ClassNotFoundException.
> Right now it's safe by default because we disabled 
> spark.reducer.maxReqSizeShuffleToMem. However,  to allow using 
> spark.reducer.maxReqSizeShuffleToMem, we should remove all usages of Tuples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21369) Don't use Scala classes in external shuffle service

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21369:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Don't use Scala classes in external shuffle service
> ---
>
> Key: SPARK-21369
> URL: https://issues.apache.org/jira/browse/SPARK-21369
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Right now the external shuffle service uses Scala Tuple2. However, the Scala 
> library won't be shaded into the yarn shuffle assembly jar. Then when the 
> codes are called, it will throw ClassNotFoundException.
> Right now it's safe by default because we disabled 
> spark.reducer.maxReqSizeShuffleToMem. However,  to allow using 
> spark.reducer.maxReqSizeShuffleToMem, we should remove all usages of Tuples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21369) Don't use Scala classes in external shuffle service

2017-07-10 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-21369:


 Summary: Don't use Scala classes in external shuffle service
 Key: SPARK-21369
 URL: https://issues.apache.org/jira/browse/SPARK-21369
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, YARN
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now the external shuffle service uses Scala Tuple2. However, the Scala 
library won't be shaded into the yarn shuffle assembly jar. Then when the codes 
are called, it will throw ClassNotFoundException.

Right now it's safe by default because we disabled 
spark.reducer.maxReqSizeShuffleToMem. However,  to allow using 
spark.reducer.maxReqSizeShuffleToMem, we should remove all usages of Tuples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21364) IndexOutOfBoundsException on equality check of two complex array elements

2017-07-10 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080963#comment-16080963
 ] 

Kazuaki Ishizaki commented on SPARK-21364:
--

When I ran the following test case that is derived from the repro, I succeeded 
to get the result without any exception on the master or 2.1.1.
Do I make some mistakes?

{code}
  test("SPARK-21364") {
val data = Seq(
  "{\"menu\":{\"id\":\"file\",\"value\":\"File\",\"popup\":{\"menuitem\":[" 
+
"{\"value\":\"New\",\"onclick\":\"CreateNewDoc()\"}," +
"{\"value\":\"Open\",\"onclick\":\"OpenDoc()\"}, " +
"{\"value\":\"Close\",\"onclick\":\"CloseDoc()\"}" +
"]}}}")
val df = sqlContext.read.json(sparkContext.parallelize(data))
df.select($"menu.popup.menuitem"(lit(0)). === 
($"menu.popup.menuitem"(lit(1.show
  }
{code}

{code}
+-+
|(menu.popup.menuitem[0] = menu.popup.menuitem[1])|
+-+
|false|
+-+
{code}

> IndexOutOfBoundsException on equality check of two complex array elements
> -
>
> Key: SPARK-21364
> URL: https://issues.apache.org/jira/browse/SPARK-21364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Vivek Patangiwar
>Priority: Minor
>
> Getting an IndexOutOfBoundsException with the following code:
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> object ArrayEqualityTest {
>   def main(s:Array[String]) {
> val sparkSession = 
> SparkSession.builder().master("local[*]").appName("app").getOrCreate()
> val sqlContext = sparkSession.sqlContext
> val sc = sparkSession.sqlContext.sparkContext
> import sparkSession.implicits._
> val df = 
> sqlContext.read.json(sc.parallelize(Seq("{\"menu\":{\"id\":\"file\",\"value\":\"File\",\"popup\":{\"menuitem\":[{\"value\":\"New\",\"onclick\":\"CreateNewDoc()\"},{\"value\":\"Open\",\"onclick\":\"OpenDoc()\"},{\"value\":\"Close\",\"onclick\":\"CloseDoc()\"}]}}}")))
> 
> df.select($"menu.popup.menuitem"(lit(0)).===($"menu.popup.menuitem"(lit(1.show
>   }
> }
> Here's the complete stack-trace:
> Exception in thread "main" java.lang.IndexOutOfBoundsException: 1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$3.apply(GenerateOrdering.scala:76)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$3.apply(GenerateOrdering.scala:75)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.genComp(CodeGenerator.scala:559)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.genEqual(CodeGenerator.scala:486)
>   at 
> org.apache.spark.sql.catalyst.expressions.EqualTo$$anonfun$doGenCode$4.apply(predicates.scala:437)
>   at 
> org.apache.spark.sql.catalyst.expressions.EqualTo$$anonfun$doGenCode$4.apply(predicates.scala:437)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression$$anonfun$defineCodeGen$2.apply(Expression.scala:442)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression$$anonfun$defineCodeGen$2.apply(Expression.scala:441)
>   at 
> org.apache.spark.sql.catalyst.ex

[jira] [Resolved] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-21367.
-
Resolution: Fixed

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080890#comment-16080890
 ] 

Dongjoon Hyun commented on SPARK-21349:
---

Thank you, [~shivaram] and @Kay Ousterhout. Okay. It looks a consensus.
Then, in order to make it final, let me ping SQL committers here.

Hi, [~rxin], [~cloud_fan], [~smilegator].
Could you give us your opinion here, too?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080890#comment-16080890
 ] 

Dongjoon Hyun edited comment on SPARK-21349 at 7/10/17 7:04 PM:


Thank you, [~shivaram] and [~kayousterhout]. Okay. It looks a consensus.
Then, in order to make it final, let me ping SQL committers here.

Hi, [~rxin], [~cloud_fan], [~smilegator].
Could you give us your opinion here, too?


was (Author: dongjoon):
Thank you, [~shivaram] and @Kay Ousterhout. Okay. It looks a consensus.
Then, in order to make it final, let me ping SQL committers here.

Hi, [~rxin], [~cloud_fan], [~smilegator].
Could you give us your opinion here, too?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reassigned SPARK-21367:
---

Assignee: shane knapp

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080875#comment-16080875
 ] 

Shivaram Venkataraman commented on SPARK-21349:
---

Well 100K is already too large IMHO and I'm not sure adding another config 
property is really helping things just to silence some log messages.  Looking 
at the code it seems that the larger task sizes mostly stem from the 
TaskMetrics objects getting bigger -- especially with a number of new SQL 
metrics being added. I think the right fix here is to improve the serialization 
of TaskMetrics (especially if the structure is empty, why bother sending 
anything at all to the worker ?)

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21368) TPCDSQueryBenchmark can't refer query files.

2017-07-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080857#comment-16080857
 ] 

Apache Spark commented on SPARK-21368:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/18592

> TPCDSQueryBenchmark can't refer query files.
> 
>
> Key: SPARK-21368
> URL: https://issues.apache.org/jira/browse/SPARK-21368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
> It's because of the failure of reference query files in the jar file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080817#comment-16080817
 ] 

Dongjoon Hyun commented on SPARK-21349:
---

I usually saw 200K~300K. 

And, the following is our Apache Spark unit test logs.
{code}
$ curl -LO 
"https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/3170/consoleFull"
$ grep 'contains a task of very large size' consoleFull | awk -F"(" '{print 
$2}' | awk '{print $1}' | sort -n | uniq -c
   6 104
   4 234
   4 235
   4 251
   4 255
   4 264
   4 272
   4 275
   4 278
   4 568
   4 658
   4 677
   4 684
   4 687
   4 692
   4 736
   4 761
   4 764
   4 778
   4 795
   4 817
   4 874
   4 1009
   1 1370
   1 2065
   1 2760
   1 2763
   1 3007
   1 3012
   1 3015
   1 3016
   1 3021
   1 3022
   2 3917
  12 4051
   3 4792
   1 15050
   1 15056
{code}

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21368) TPCDSQueryBenchmark can't refer query files.

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21368:


Assignee: Apache Spark

> TPCDSQueryBenchmark can't refer query files.
> 
>
> Key: SPARK-21368
> URL: https://issues.apache.org/jira/browse/SPARK-21368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
> It's because of the failure of reference query files in the jar file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21368) TPCDSQueryBenchmark can't refer query files.

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21368:


Assignee: (was: Apache Spark)

> TPCDSQueryBenchmark can't refer query files.
> 
>
> Key: SPARK-21368
> URL: https://issues.apache.org/jira/browse/SPARK-21368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
> It's because of the failure of reference query files in the jar file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080827#comment-16080827
 ] 

shane knapp commented on SPARK-21367:
-

copy that.  installing this version now...

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21368) TPCDSQueryBenchmark can't refer query files.

2017-07-10 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-21368:
--

 Summary: TPCDSQueryBenchmark can't refer query files.
 Key: SPARK-21368
 URL: https://issues.apache.org/jira/browse/SPARK-21368
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kousuke Saruta
Priority: Minor


TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
It's because of the failure of reference query files in the jar file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080830#comment-16080830
 ] 

shane knapp commented on SPARK-21367:
-

...and this is done.

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080826#comment-16080826
 ] 

Shivaram Venkataraman commented on SPARK-21367:
---

I just checked the transitive dependencies and I think it should be fine to 
manually install roxygen 5.0.0

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2017-07-10 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080801#comment-16080801
 ] 

shane knapp commented on SPARK-21367:
-

we haven't ever explicitly installed any version of Roxygen2, so whatever is 
there was installed via deps on other packages.

that being said, i should be able to upgrade this to 5.0.0 pretty easily.  let 
me check w/[~shivaram] before proceeding, however.

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21362) Add JDBCDialect for Apache Drill

2017-07-10 Thread David Radford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080810#comment-16080810
 ] 

David Radford commented on SPARK-21362:
---

I'm planning to work on this but cannot assign myself the task

> Add JDBCDialect for Apache Drill
> 
>
> Key: SPARK-21362
> URL: https://issues.apache.org/jira/browse/SPARK-21362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: David Radford
>Priority: Minor
>
> Apache Drill does not allow quotation marks (") so a custom jdbc dialect is 
> needed to return the field names surround in tick marks (`) similar to how 
> MySQL dialect works. This requires an override to the method: quoteIdentifier



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-10 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080789#comment-16080789
 ] 

Kay Ousterhout commented on SPARK-21349:


Out of curiosity, what are the task sizes that you're seeing?

+[~shivaram] -- I know you've looked at task size a lot.  Are these getting 
bigger / do you think we should just raise the warning size for everyone?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21366) Add sql test for window functions

2017-07-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21366:


Assignee: (was: Apache Spark)

> Add sql test for window functions
> -
>
> Key: SPARK-21366
> URL: https://issues.apache.org/jira/browse/SPARK-21366
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 158 matches

Mail list logo