[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-10-01 Thread John Steidley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187644#comment-16187644
 ] 

John Steidley commented on SPARK-19984:
---

[~kiszk] I just hit this issue. I am able to reproduce it consistently on Spark 
2.1.1 from inside my codebase, but not in the spark shell.

Here's the a rough equivalent of my code (again, this does not reproduce the 
bug in the spark console)
```
val dfA = sc.parallelize(Seq("John", "Kazuaki")).toDF("id")

val dfB = dfA.select(col("id").alias("A"), col("id").alias("B"))

val dfC = sc.parallelize(Seq("John", "Kazuaki")).toDF("A")
dfC.join(dfB, "A")
```

I have though about nullability, making dfA's "id" column non-nullable makes 
the issue go away. (following the directions here 
https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe)

In my real code dfA and dfC are related (both are derived from the same 
dataframe), could that matter here?

What could I be missing? Is there any more information you would like from me? 
I can try harder to track down the issue, but I can't share the whole codebase.

> ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-19984
> URL: https://issues.apache.org/jira/browse/SPARK-19984
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Andrey Yakovenko
>
> I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. 
> This is not permanent error, next time i run it could disappear. 
> Unfortunately i don't know how to reproduce the issue.  As you can see from 
> the log my logic is pretty complicated.
> Here is a part of log i've got (container_1489514660953_0015_01_01)
> {code}
> 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 151, Column 29: A method named "compare" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator[] inputs;
> /* 008 */   private boolean agg_initAgg;
> /* 009 */   private boolean agg_bufIsNull;
> /* 010 */   private long agg_bufValue;
> /* 011 */   private boolean agg_initAgg1;
> /* 012 */   private boolean agg_bufIsNull1;
> /* 013 */   private long agg_bufValue1;
> /* 014 */   private scala.collection.Iterator smj_leftInput;
> /* 015 */   private scala.collection.Iterator smj_rightInput;
> /* 016 */   private InternalRow smj_leftRow;
> /* 017 */   private InternalRow smj_rightRow;
> /* 018 */   private UTF8String smj_value2;
> /* 019 */   private java.util.ArrayList smj_matches;
> /* 020 */   private UTF8String smj_value3;
> /* 021 */   private UTF8String smj_value4;
> /* 022 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> smj_numOutputRows;
> /* 023 */   private UnsafeRow smj_result;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> smj_rowWriter;
> /* 026 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows;
> /* 027 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime;
> /* 028 */   private UnsafeRow agg_result;
> /* 029 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
> /* 030 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter;
> /* 031 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows1;
> /* 032 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime1;
> /* 033 */   private UnsafeRow agg_result1;
> /* 034 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1;
> /* 035 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter1;
> /* 036 */
> /* 037 */   public GeneratedIterator(Object[] references) {
> /* 038 */ this.references = references;
> /* 039 */   }
> /* 040 */
> /* 041 */   public void init(int index, scala.collection.Iterator[] inputs) {
> /* 042 */ partitionIndex = index;
> /* 043 */ this.inputs = inputs;
> /* 044 */ wholestagecodegen_init_0();
> /* 045 */ wholestagecodegen_init_1();
> /* 046 */
> /* 047 */   }
> 

[jira] [Updated] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage

2017-10-01 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-22179:
-
Description: 
percentile_approx should choose the first element if it already reaches the 
percentage.
For example, given input data 1 to 10, if a user queries 10% (or even less) 
percentile, it should return 1 (instead of 2), because the first value 1 
already reaches 10%. Currently it returns a wrong answer: 2.

  was:
percentile_approx should choose the first element if it already reaches the 
percentage.
For example, given input data 1 to 10, if a user queries 10% (or even less) 
percentile, it should return 1 (instead of 2), because the first value 1 
already reaches 10% percentage. Currently it returns a wrong answer: 2.


> percentile_approx should choose the first element if it already reaches the 
> percentage
> --
>
> Key: SPARK-22179
> URL: https://issues.apache.org/jira/browse/SPARK-22179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> percentile_approx should choose the first element if it already reaches the 
> percentage.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1 (instead of 2), because the first value 1 
> already reaches 10%. Currently it returns a wrong answer: 2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage

2017-10-01 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-22179:
-
Description: 
percentile_approx should choose the first element if it already reaches the 
percentage.
For example, given input data 1 to 10, if a user queries 10% (or even less) 
percentile, it should return 1 (instead of 2), because the first value 1 
already reaches 10% percentage. Currently it returns a wrong answer: 2.

  was:
percentile_approx should choose the first element if it already reaches the 
percentage.
For example, given input data 1 to 10, if a user queries 10% (or even less) 
percentile, it should return 1 (instead of 2), because the first value 1 
already reaches 10% percentage.


> percentile_approx should choose the first element if it already reaches the 
> percentage
> --
>
> Key: SPARK-22179
> URL: https://issues.apache.org/jira/browse/SPARK-22179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> percentile_approx should choose the first element if it already reaches the 
> percentage.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1 (instead of 2), because the first value 1 
> already reaches 10% percentage. Currently it returns a wrong answer: 2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage

2017-10-01 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-22179:
-
Description: 
percentile_approx should choose the first element if it already reaches the 
percentage.
For example, given input data 1 to 10, if a user queries 10% (or even less) 
percentile, it should return 1 (instead of 2), because the first value 1 
already reaches 10% percentage.

  was:percentile_approx should choose the first element if it already reaches 
the percentage


> percentile_approx should choose the first element if it already reaches the 
> percentage
> --
>
> Key: SPARK-22179
> URL: https://issues.apache.org/jira/browse/SPARK-22179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> percentile_approx should choose the first element if it already reaches the 
> percentage.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1 (instead of 2), because the first value 1 
> already reaches 10% percentage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2017-10-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187574#comment-16187574
 ] 

Felix Cheung commented on SPARK-22063:
--

surely, I think we could even start with something simple with 
install.package(..., lib =) (or install_github(..., lib=)) and then library(... 
lib.loc)


> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^
> R/functions.R:93:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and 
> \code{shiftRightUnsigned},
> ^~~
> R/functions.R:483:52: style: Remove spaces before the left parenthesis in a 
> function call.
> jcols <- lapply(list(x, ...), function 

[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2017-10-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187551#comment-16187551
 ] 

Shivaram Venkataraman commented on SPARK-22063:
---

[~shaneknapp] [~felixcheung] Lets move the discussion to the JIRA ? 

I think there are a couple of ways to address this issue -- the first as 
[~hyukjin.kwon] pointed out we can make the lint-r script do the installation. 
I am not too much in favor of that as it will result in the script affecting 
packages at runtime.

Instead I was thinking if we could create R environments for each Spark version 
-- https://stackoverflow.com/questions/24283171/virtual-environment-in-r has a 
bunch of ideas on how to do this. Any thoughts on the approaches listed there ?

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^
> 

[jira] [Commented] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent

2017-10-01 Thread Sathiya Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187526#comment-16187526
 ] 

Sathiya Kumar commented on SPARK-22181:
---

I will create the PR if you agree..

> ReplaceExceptWithNotFilter if one or both of the datasets are fully derived 
> out of Filters from a same parent
> -
>
> Key: SPARK-22181
> URL: https://issues.apache.org/jira/browse/SPARK-22181
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> While applying Except operator between two datasets, if one or both of the 
> datasets are purely transformed using filter operations, then instead of 
> rewriting the Except operator using expensive join operation, we can rewrite 
> it using filter operation by flipping the filter condition of the right node.
> ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png)
> ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent

2017-10-01 Thread Sathiya Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187525#comment-16187525
 ] 

Sathiya Kumar commented on SPARK-22181:
---

Here is the rule implementation, which could be scheduled before the 
`ReplaceExceptWithAntiJoin` rule:

{code:java}
object ReplaceExceptWithNotFilter extends Rule[LogicalPlan] {

  implicit def nodeToFilter(node: LogicalPlan): Filter = 
node.asInstanceOf[Filter]

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case Except(left, right) if isEligible(left, right) =>
  Distinct(
Filter(Not(replaceAttributesIn(combineFilters(right).condition, left)), 
left)
  )
  }

  def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, 
right) match {
case (left: Filter, right: Filter) => parent(left).sameResult(parent(right))
case (left, right: Filter) => left.sameResult(parent(right))
case _ => false
  }

  def parent(plan: LogicalPlan): LogicalPlan = plan match {
case x @ Filter(_, child) => parent(child)
case x => x
  }

  def combineFilters(plan: LogicalPlan): LogicalPlan = CombineFilters(plan) 
match {
case result if !result.fastEquals(plan) => combineFilters(result)
case result => result
  }

  def replaceAttributesIn(condition: Expression, leftChild: LogicalPlan): 
Expression = {
condition transform {
  case AttributeReference(name, _, _, _) =>
leftChild.output.find(_.name == name).get
}
  }
}
{code}


> ReplaceExceptWithNotFilter if one or both of the datasets are fully derived 
> out of Filters from a same parent
> -
>
> Key: SPARK-22181
> URL: https://issues.apache.org/jira/browse/SPARK-22181
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> While applying Except operator between two datasets, if one or both of the 
> datasets are purely transformed using filter operations, then instead of 
> rewriting the Except operator using expensive join operation, we can rewrite 
> it using filter operation by flipping the filter condition of the right node.
> ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png)
> ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent

2017-10-01 Thread Sathiya Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187519#comment-16187519
 ] 

Sathiya Kumar commented on SPARK-22181:
---

Please give me assign right to assign this task to myself, thank you.

> ReplaceExceptWithNotFilter if one or both of the datasets are fully derived 
> out of Filters from a same parent
> -
>
> Key: SPARK-22181
> URL: https://issues.apache.org/jira/browse/SPARK-22181
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> While applying Except operator between two datasets, if one or both of the 
> datasets are purely transformed using filter operations, then instead of 
> rewriting the Except operator using expensive join operation, we can rewrite 
> it using filter operation by flipping the filter condition of the right node.
> ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png)
> ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22180) Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort

2017-10-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187508#comment-16187508
 ] 

Sean Owen commented on SPARK-22180:
---

To be clear I don't think this make IPv6 fully work for Spark, but, may let 
some more things happen to work

> Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort
> ---
>
> Key: SPARK-22180
> URL: https://issues.apache.org/jira/browse/SPARK-22180
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Stefan Obermeier
>Priority: Minor
>
> External applications like Apache Cassandra are able to deal with IPv6 
> addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
> with Apache Spark.
> This combination is very useful  IMHO. 
> One problem is that  {code:java} 
> org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the 
> last colon to sepperate the port from host path. This conflicts with literal 
> IPv6 addresses.
> I think we can take {code}hostPort{code} as literal IPv6 address if it 
> contains tow ore more colons. If IPv6 addresses are enclosed in square 
> brackets port definition is still possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22180) Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort

2017-10-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22180:
--
Priority: Minor  (was: Critical)

> Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort
> ---
>
> Key: SPARK-22180
> URL: https://issues.apache.org/jira/browse/SPARK-22180
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Stefan Obermeier
>Priority: Minor
>
> External applications like Apache Cassandra are able to deal with IPv6 
> addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
> with Apache Spark.
> This combination is very useful  IMHO. 
> One problem is that  {code:java} 
> org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the 
> last colon to sepperate the port from host path. This conflicts with literal 
> IPv6 addresses.
> I think we can take {code}hostPort{code} as literal IPv6 address if it 
> contains tow ore more colons. If IPv6 addresses are enclosed in square 
> brackets port definition is still possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22180) Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort

2017-10-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22180:
--
Labels:   (was: features patch)

> Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort
> ---
>
> Key: SPARK-22180
> URL: https://issues.apache.org/jira/browse/SPARK-22180
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Stefan Obermeier
>Priority: Minor
>
> External applications like Apache Cassandra are able to deal with IPv6 
> addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
> with Apache Spark.
> This combination is very useful  IMHO. 
> One problem is that  {code:java} 
> org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the 
> last colon to sepperate the port from host path. This conflicts with literal 
> IPv6 addresses.
> I think we can take {code}hostPort{code} as literal IPv6 address if it 
> contains tow ore more colons. If IPv6 addresses are enclosed in square 
> brackets port definition is still possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage

2017-10-01 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22179:

Description: percentile_approx should choose the first element if it 
already reaches the percentage

> percentile_approx should choose the first element if it already reaches the 
> percentage
> --
>
> Key: SPARK-22179
> URL: https://issues.apache.org/jira/browse/SPARK-22179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> percentile_approx should choose the first element if it already reaches the 
> percentage



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22001) ImputerModel can do withColumn for all input columns at one pass

2017-10-01 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-22001:
---

Assignee: Liang-Chi Hsieh

> ImputerModel can do withColumn for all input columns at one pass
> 
>
> Key: SPARK-22001
> URL: https://issues.apache.org/jira/browse/SPARK-22001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> SPARK-21690 makes one-pass {{Imputer}} by parallelizing the computation of 
> all input columns. When we transform dataset with {{ImputerModel}}, we do 
> {{withColumn}} on all input columns sequentially. We can also do this on all 
> input columns at once.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22001) ImputerModel can do withColumn for all input columns at one pass

2017-10-01 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22001.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> ImputerModel can do withColumn for all input columns at one pass
> 
>
> Key: SPARK-22001
> URL: https://issues.apache.org/jira/browse/SPARK-22001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> SPARK-21690 makes one-pass {{Imputer}} by parallelizing the computation of 
> all input columns. When we transform dataset with {{ImputerModel}}, we do 
> {{withColumn}} on all input columns sequentially. We can also do this on all 
> input columns at once.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent

2017-10-01 Thread Sathiya Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-22181:
--
Description: 
While applying Except operator between two datasets, if one or both of the 
datasets are purely transformed using filter operations, then instead of 
rewriting the Except operator using expensive join operation, we can rewrite it 
using filter operation by flipping the filter condition of the right node.

![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png)

![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png)


  was:
While applying Except operator between two datasets, if one or both of the 
datasets are purely transformed using filter operations, then instead of 
rewriting the Except operator using expensive join operation, we can rewrite it 
using filter operation by flipping the filter condition of the right node.

!https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png!

!https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png!


> ReplaceExceptWithNotFilter if one or both of the datasets are fully derived 
> out of Filters from a same parent
> -
>
> Key: SPARK-22181
> URL: https://issues.apache.org/jira/browse/SPARK-22181
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> While applying Except operator between two datasets, if one or both of the 
> datasets are purely transformed using filter operations, then instead of 
> rewriting the Except operator using expensive join operation, we can rewrite 
> it using filter operation by flipping the filter condition of the right node.
> ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png)
> ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent

2017-10-01 Thread Sathiya Kumar (JIRA)
Sathiya Kumar created SPARK-22181:
-

 Summary: ReplaceExceptWithNotFilter if one or both of the datasets 
are fully derived out of Filters from a same parent
 Key: SPARK-22181
 URL: https://issues.apache.org/jira/browse/SPARK-22181
 Project: Spark
  Issue Type: New Feature
  Components: Optimizer, SQL
Affects Versions: 2.2.0, 2.1.1
Reporter: Sathiya Kumar
Priority: Minor


While applying Except operator between two datasets, if one or both of the 
datasets are purely transformed using filter operations, then instead of 
rewriting the Except operator using expensive join operation, we can rewrite it 
using filter operation by flipping the filter condition of the right node.

!https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png!

!https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22180) Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort

2017-10-01 Thread Stefan Obermeier (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Obermeier updated SPARK-22180:
-
Summary: Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort  
(was: Allow IPv6 address)

> Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort
> ---
>
> Key: SPARK-22180
> URL: https://issues.apache.org/jira/browse/SPARK-22180
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Stefan Obermeier
>Priority: Critical
>  Labels: features, patch
>
> External applications like Apache Cassandra are able to deal with IPv6 
> addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
> with Apache Spark.
> This combination is very useful  IMHO. 
> One problem is that  {code:java} 
> org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the 
> last colon to sepperate the port from host path. This conflicts with literal 
> IPv6 addresses.
> I think we can take {code}hostPort{code} as literal IPv6 address if it 
> contains tow ore more colons. If IPv6 addresses are enclosed in square 
> brackets port definition is still possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22180) Allow IPv6 address

2017-10-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22180:


Assignee: (was: Apache Spark)

> Allow IPv6 address
> --
>
> Key: SPARK-22180
> URL: https://issues.apache.org/jira/browse/SPARK-22180
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Stefan Obermeier
>Priority: Critical
>  Labels: features, patch
>
> External applications like Apache Cassandra are able to deal with IPv6 
> addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
> with Apache Spark.
> This combination is very useful  IMHO. 
> One problem is that  {code:java} 
> org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the 
> last colon to sepperate the port from host path. This conflicts with literal 
> IPv6 addresses.
> I think we can take {code}hostPort{code} as literal IPv6 address if it 
> contains tow ore more colons. If IPv6 addresses are enclosed in square 
> brackets port definition is still possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22180) Allow IPv6 address

2017-10-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187408#comment-16187408
 ] 

Apache Spark commented on SPARK-22180:
--

User 'obermeier' has created a pull request for this issue:
https://github.com/apache/spark/pull/19408

> Allow IPv6 address
> --
>
> Key: SPARK-22180
> URL: https://issues.apache.org/jira/browse/SPARK-22180
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Stefan Obermeier
>Priority: Critical
>  Labels: features, patch
>
> External applications like Apache Cassandra are able to deal with IPv6 
> addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
> with Apache Spark.
> This combination is very useful  IMHO. 
> One problem is that  {code:java} 
> org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the 
> last colon to sepperate the port from host path. This conflicts with literal 
> IPv6 addresses.
> I think we can take {code}hostPort{code} as literal IPv6 address if it 
> contains tow ore more colons. If IPv6 addresses are enclosed in square 
> brackets port definition is still possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22180) Allow IPv6 address

2017-10-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22180:


Assignee: Apache Spark

> Allow IPv6 address
> --
>
> Key: SPARK-22180
> URL: https://issues.apache.org/jira/browse/SPARK-22180
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Stefan Obermeier
>Assignee: Apache Spark
>Priority: Critical
>  Labels: features, patch
>
> External applications like Apache Cassandra are able to deal with IPv6 
> addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
> with Apache Spark.
> This combination is very useful  IMHO. 
> One problem is that  {code:java} 
> org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the 
> last colon to sepperate the port from host path. This conflicts with literal 
> IPv6 addresses.
> I think we can take {code}hostPort{code} as literal IPv6 address if it 
> contains tow ore more colons. If IPv6 addresses are enclosed in square 
> brackets port definition is still possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22180) Allow IPv6 address

2017-10-01 Thread Stefan Obermeier (JIRA)
Stefan Obermeier created SPARK-22180:


 Summary: Allow IPv6 address
 Key: SPARK-22180
 URL: https://issues.apache.org/jira/browse/SPARK-22180
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Stefan Obermeier
Priority: Critical


External applications like Apache Cassandra are able to deal with IPv6 
addresses. Libraries like spark-cassandra-connector combine Apache Cassandra 
with Apache Spark.
This combination is very useful  IMHO. 

One problem is that  {code:java} 
org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the 
last colon to sepperate the port from host path. This conflicts with literal 
IPv6 addresses.

I think we can take {code}hostPort{code} as literal IPv6 address if it contains 
tow ore more colons. If IPv6 addresses are enclosed in square brackets port 
definition is still possible.







--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2017-10-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187328#comment-16187328
 ] 

Hyukjin Kwon commented on SPARK-22063:
--

The lint failures were fixed first in 
https://github.com/apache/spark/pull/19290; however, it is not actually 
upgraded to {{jimhester/lintr@5431140}} due to the concern of breaking other 
builds. Please see the discussion in the PR if anyone is interested in this. 
This is not yet fully solved.

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^
> R/functions.R:93:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and 
> \code{shiftRightUnsigned},
> ^~~
> 

[jira] [Resolved] (SPARK-22177) Error running ml_ops.sh(SPOT): Can not create a Path from an empty string

2017-10-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22177.
---
Resolution: Invalid

This is not a Spark issue, but something to do with Spot. Somewhere you're 
feeding an empty path to some argument

> Error running ml_ops.sh(SPOT): Can not create a Path from an empty string
> -
>
> Key: SPARK-22177
> URL: https://issues.apache.org/jira/browse/SPARK-22177
> Project: Spark
>  Issue Type: Question
>  Components: ML, Spark Submit, YARN
>Affects Versions: 2.2.0
> Environment: CentOS 7 1708
> Hadoop 2.6.0
> Scala 2.11.8
> SPOT 1.0
>Reporter: Jorge Pizarro
>Priority: Minor
>  Labels: newbie
>
> Error message running "./ml_ops.sh 20170922 dns 1e-4".
> Complete error message:
> [soluser@master spot-ml]$ bash -x ./ml_ops.sh 20170922 dns 1e-4
> + FDATE=20170922
> + DSOURCE=dns
> + YR=2017
> + MH=09
> + DY=22
> + [[ 8 != \8 ]]
> + [[ -z dns ]]
> + source /etc/spot.conf
> ++ UINODE=master
> ++ MLNODE=master
> ++ GWNODE=master
> ++ DBNAME=spotdb
> ++ HUSER=/user/soluser
> ++ NAME_NODE=master
> ++ WEB_PORT=50070
> ++ DNS_PATH=/user/soluser/dns/hive/y=2017/m=09/d=22/
> ++ PROXY_PATH=/user/soluser/dns/hive/y=2017/m=09/d=22/
> ++ FLOW_PATH=/user/soluser/dns/hive/y=2017/m=09/d=22/
> ++ HPATH=/user/soluser/dns/scored_results/20170922
> ++ IMPALA_DEM=master
> ++ IMPALA_PORT=21050
> ++ LUSER=/home/soluser
> ++ LPATH=/home/soluser/ml/dns/20170922
> ++ RPATH=/home/soluser/ipython/user/20170922
> ++ LIPATH=/home/soluser/ingest
> ++ USER_DOMAIN=neosecure
> ++ SPK_EXEC=1
> ++ SPK_EXEC_MEM=1g
> ++ SPK_DRIVER_MEM=1g
> ++ SPK_DRIVER_MAX_RESULTS=200m
> ++ SPK_EXEC_CORES=2
> ++ SPK_DRIVER_MEM_OVERHEAD=100m
> ++ SPK_EXEC_MEM_OVERHEAD=100m
> ++ SPK_AUTO_BRDCST_JOIN_THR=10485760
> ++ LDA_OPTIMIZER=em
> ++ LDA_ALPHA=1.02
> ++ LDA_BETA=1.001
> ++ PRECISION=64
> ++ TOL=1e-6
> ++ TOPIC_COUNT=20
> ++ DUPFACTOR=1000
> + '[' -n 1e-4 ']'
> + TOL=1e-4
> + '[' -n '' ']'
> + MAXRESULTS=-1
> + '[' dns == flow ']'
> + '[' dns == dns ']'
> + RAWDATA_PATH=/user/soluser/dns/hive/y=2017/m=09/d=22/
> + '[' '!' -z neosecure ']'
> + USER_DOMAIN_CMD='--userdomain neosecure'
> + 
> FEEDBACK_PATH=/user/soluser/dns/scored_results/20170922/feedback/ml_feedback.csv
> + HDFS_SCORED_CONNECTS=/user/soluser/dns/scored_results/20170922/scores
> + hdfs dfs -rm -R -f /user/soluser/dns/scored_results/20170922/scores
> + spark-submit --class org.apache.spot.SuspiciousConnects --master yarn 
> --deploy-mode cluster --driver-memory 1g --conf 
> spark.driver.maxResultSize=200m --conf spark.driver.maxPermSize=512m --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=1 --conf spark.executor.cores=2 --conf 
> spark.executor.memory=1g --conf spark.sql.autoBroadcastJoinThreshold=10485760 
> --conf 'spark.executor.extraJavaOptions=-XX:MaxPermSize=512M 
> -XX:PermSize=512M' --conf spark.kryoserializer.buffer.max=512m --conf 
> spark.yarn.am.waitTime=100s --conf spark.yarn.am.memoryOverhead=100m --conf 
> spark.yarn.executor.memoryOverhead=100m 
> target/scala-2.11/spot-ml-assembly-1.1.jar --analysis dns --input 
> /user/soluser/dns/hive/y=2017/m=09/d=22/ --dupfactor 1000 --feedback 
> /user/soluser/dns/scored_results/20170922/feedback/ml_feedback.csv 
> --ldatopiccount 20 --scored /user/soluser/dns/scored_results/20170922/scores 
> --threshold 1e-4 --maxresults -1 --ldamaxiterations 20 --ldaalpha 1.02 
> --ldabeta 1.001 --ldaoptimizer em --precision 64 --userdomain neosecure
> 17/09/29 13:51:56 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8032
> 17/09/29 13:51:56 INFO yarn.Client: Requesting a new application from cluster 
> with 0 NodeManagers
> 17/09/29 13:51:56 INFO yarn.Client: Verifying our application has not 
> requested more than the maximum memory capability of the cluster (8192 MB per 
> container)
> 17/09/29 13:51:56 INFO yarn.Client: Will allocate AM container, with 1408 MB 
> memory including 384 MB overhead
> 17/09/29 13:51:56 INFO yarn.Client: Setting up container launch context for 
> our AM
> 17/09/29 13:51:56 INFO yarn.Client: Setting up the launch environment for our 
> AM container
> 17/09/29 13:51:56 INFO yarn.Client: Preparing resources for our AM container
> 17/09/29 13:51:57 INFO yarn.Client: Deleted staging directory 
> hdfs://master:9000/user/soluser/.sparkStaging/application_1506636890912_0058
> Exception in thread "main" java.lang.IllegalArgumentException: Can not create 
> a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
>   at org.apache.hadoop.fs.Path.(Path.java:134)
>   at org.apache.hadoop.fs.Path.(Path.java:93)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:337)
>   at 
> 

[jira] [Assigned] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option

2017-10-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21667:


Assignee: Apache Spark

> ConsoleSink should not fail streaming query with checkpointLocation option
> --
>
> Key: SPARK-21667
> URL: https://issues.apache.org/jira/browse/SPARK-21667
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> As agreed on the Spark users mailing list in the thread "\[SS] Console sink 
> not supporting recovering from checkpoint location? Why?" in which 
> [~marmbrus] said:
> {quote}
> I think there is really no good reason for this limitation.
> {quote}
> Using {{ConsoleSink}} should therefore not fail a streaming query when used 
> with {{checkpointLocation}} option.
> {code}
> // today's build from the master
> scala> spark.version
> res8: String = 2.3.0-SNAPSHOT
> scala> val q = records.
>  |   writeStream.
>  |   format("console").
>  |   option("truncate", false).
>  |   option("checkpointLocation", "/tmp/checkpoint"). // <--
> checkpoint directory
>  |   trigger(Trigger.ProcessingTime(10.seconds)).
>  |   outputMode(OutputMode.Update).
>  |   start
> org.apache.spark.sql.AnalysisException: This query does not support 
> recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start 
> over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284)
>   ... 61 elided
> {code}
> The "trigger" is SPARK-16116 and [this 
> line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277]
>  in particular.
> This also relates to SPARK-19768 that was resolved as not a bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option

2017-10-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187303#comment-16187303
 ] 

Apache Spark commented on SPARK-21667:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/19407

> ConsoleSink should not fail streaming query with checkpointLocation option
> --
>
> Key: SPARK-21667
> URL: https://issues.apache.org/jira/browse/SPARK-21667
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> As agreed on the Spark users mailing list in the thread "\[SS] Console sink 
> not supporting recovering from checkpoint location? Why?" in which 
> [~marmbrus] said:
> {quote}
> I think there is really no good reason for this limitation.
> {quote}
> Using {{ConsoleSink}} should therefore not fail a streaming query when used 
> with {{checkpointLocation}} option.
> {code}
> // today's build from the master
> scala> spark.version
> res8: String = 2.3.0-SNAPSHOT
> scala> val q = records.
>  |   writeStream.
>  |   format("console").
>  |   option("truncate", false).
>  |   option("checkpointLocation", "/tmp/checkpoint"). // <--
> checkpoint directory
>  |   trigger(Trigger.ProcessingTime(10.seconds)).
>  |   outputMode(OutputMode.Update).
>  |   start
> org.apache.spark.sql.AnalysisException: This query does not support 
> recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start 
> over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284)
>   ... 61 elided
> {code}
> The "trigger" is SPARK-16116 and [this 
> line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277]
>  in particular.
> This also relates to SPARK-19768 that was resolved as not a bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option

2017-10-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21667:


Assignee: (was: Apache Spark)

> ConsoleSink should not fail streaming query with checkpointLocation option
> --
>
> Key: SPARK-21667
> URL: https://issues.apache.org/jira/browse/SPARK-21667
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> As agreed on the Spark users mailing list in the thread "\[SS] Console sink 
> not supporting recovering from checkpoint location? Why?" in which 
> [~marmbrus] said:
> {quote}
> I think there is really no good reason for this limitation.
> {quote}
> Using {{ConsoleSink}} should therefore not fail a streaming query when used 
> with {{checkpointLocation}} option.
> {code}
> // today's build from the master
> scala> spark.version
> res8: String = 2.3.0-SNAPSHOT
> scala> val q = records.
>  |   writeStream.
>  |   format("console").
>  |   option("truncate", false).
>  |   option("checkpointLocation", "/tmp/checkpoint"). // <--
> checkpoint directory
>  |   trigger(Trigger.ProcessingTime(10.seconds)).
>  |   outputMode(OutputMode.Update).
>  |   start
> org.apache.spark.sql.AnalysisException: This query does not support 
> recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start 
> over.;
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284)
>   ... 61 elided
> {code}
> The "trigger" is SPARK-16116 and [this 
> line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277]
>  in particular.
> This also relates to SPARK-19768 that was resolved as not a bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org