[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
[ https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187644#comment-16187644 ] John Steidley commented on SPARK-19984: --- [~kiszk] I just hit this issue. I am able to reproduce it consistently on Spark 2.1.1 from inside my codebase, but not in the spark shell. Here's the a rough equivalent of my code (again, this does not reproduce the bug in the spark console) ``` val dfA = sc.parallelize(Seq("John", "Kazuaki")).toDF("id") val dfB = dfA.select(col("id").alias("A"), col("id").alias("B")) val dfC = sc.parallelize(Seq("John", "Kazuaki")).toDF("A") dfC.join(dfB, "A") ``` I have though about nullability, making dfA's "id" column non-nullable makes the issue go away. (following the directions here https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe) In my real code dfA and dfC are related (both are derived from the same dataframe), could that matter here? What could I be missing? Is there any more information you would like from me? I can try harder to track down the issue, but I can't share the whole codebase. > ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java' > - > > Key: SPARK-19984 > URL: https://issues.apache.org/jira/browse/SPARK-19984 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Andrey Yakovenko > > I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. > This is not permanent error, next time i run it could disappear. > Unfortunately i don't know how to reproduce the issue. As you can see from > the log my logic is pretty complicated. > Here is a part of log i've got (container_1489514660953_0015_01_01) > {code} > 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 151, Column 29: A method named "compare" is not declared in any enclosing > class nor any supertype, nor through a static import > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIterator(references); > /* 003 */ } > /* 004 */ > /* 005 */ final class GeneratedIterator extends > org.apache.spark.sql.execution.BufferedRowIterator { > /* 006 */ private Object[] references; > /* 007 */ private scala.collection.Iterator[] inputs; > /* 008 */ private boolean agg_initAgg; > /* 009 */ private boolean agg_bufIsNull; > /* 010 */ private long agg_bufValue; > /* 011 */ private boolean agg_initAgg1; > /* 012 */ private boolean agg_bufIsNull1; > /* 013 */ private long agg_bufValue1; > /* 014 */ private scala.collection.Iterator smj_leftInput; > /* 015 */ private scala.collection.Iterator smj_rightInput; > /* 016 */ private InternalRow smj_leftRow; > /* 017 */ private InternalRow smj_rightRow; > /* 018 */ private UTF8String smj_value2; > /* 019 */ private java.util.ArrayList smj_matches; > /* 020 */ private UTF8String smj_value3; > /* 021 */ private UTF8String smj_value4; > /* 022 */ private org.apache.spark.sql.execution.metric.SQLMetric > smj_numOutputRows; > /* 023 */ private UnsafeRow smj_result; > /* 024 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder; > /* 025 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > smj_rowWriter; > /* 026 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows; > /* 027 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime; > /* 028 */ private UnsafeRow agg_result; > /* 029 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; > /* 030 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter; > /* 031 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows1; > /* 032 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime1; > /* 033 */ private UnsafeRow agg_result1; > /* 034 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1; > /* 035 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter1; > /* 036 */ > /* 037 */ public GeneratedIterator(Object[] references) { > /* 038 */ this.references = references; > /* 039 */ } > /* 040 */ > /* 041 */ public void init(int index, scala.collection.Iterator[] inputs) { > /* 042 */ partitionIndex = index; > /* 043 */ this.inputs = inputs; > /* 044 */ wholestagecodegen_init_0(); > /* 045 */ wholestagecodegen_init_1(); > /* 046 */ >
[jira] [Updated] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage
[ https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-22179: - Description: percentile_approx should choose the first element if it already reaches the percentage. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1 (instead of 2), because the first value 1 already reaches 10%. Currently it returns a wrong answer: 2. was: percentile_approx should choose the first element if it already reaches the percentage. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1 (instead of 2), because the first value 1 already reaches 10% percentage. Currently it returns a wrong answer: 2. > percentile_approx should choose the first element if it already reaches the > percentage > -- > > Key: SPARK-22179 > URL: https://issues.apache.org/jira/browse/SPARK-22179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > percentile_approx should choose the first element if it already reaches the > percentage. > For example, given input data 1 to 10, if a user queries 10% (or even less) > percentile, it should return 1 (instead of 2), because the first value 1 > already reaches 10%. Currently it returns a wrong answer: 2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage
[ https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-22179: - Description: percentile_approx should choose the first element if it already reaches the percentage. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1 (instead of 2), because the first value 1 already reaches 10% percentage. Currently it returns a wrong answer: 2. was: percentile_approx should choose the first element if it already reaches the percentage. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1 (instead of 2), because the first value 1 already reaches 10% percentage. > percentile_approx should choose the first element if it already reaches the > percentage > -- > > Key: SPARK-22179 > URL: https://issues.apache.org/jira/browse/SPARK-22179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > percentile_approx should choose the first element if it already reaches the > percentage. > For example, given input data 1 to 10, if a user queries 10% (or even less) > percentile, it should return 1 (instead of 2), because the first value 1 > already reaches 10% percentage. Currently it returns a wrong answer: 2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage
[ https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-22179: - Description: percentile_approx should choose the first element if it already reaches the percentage. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1 (instead of 2), because the first value 1 already reaches 10% percentage. was:percentile_approx should choose the first element if it already reaches the percentage > percentile_approx should choose the first element if it already reaches the > percentage > -- > > Key: SPARK-22179 > URL: https://issues.apache.org/jira/browse/SPARK-22179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > percentile_approx should choose the first element if it already reaches the > percentage. > For example, given input data 1 to 10, if a user queries 10% (or even less) > percentile, it should return 1 (instead of 2), because the first value 1 > already reaches 10% percentage. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID
[ https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187574#comment-16187574 ] Felix Cheung commented on SPARK-22063: -- surely, I think we could even start with something simple with install.package(..., lib =) (or install_github(..., lib=)) and then library(... lib.loc) > Upgrade lintr to latest commit sha1 ID > -- > > Key: SPARK-22063 > URL: https://issues.apache.org/jira/browse/SPARK-22063 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this > pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026]) > and SPARK-14074. > Today, I tried to upgrade the latest, > https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72 > This fixes many bugs and now finds many instances that I have observed and > thought should be caught time to time: > {code} > inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis > in a function call. > return (output) > ^ > R/column.R:241:1: style: Lines should not be more than 100 characters. > #' > \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{ > ^~~~ > R/context.R:332:1: style: Variable and function names should not be longer > than 30 characters. > spark.getSparkFilesRootDirectory <- function() { > ^~~~ > R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters. > #' @param j,select expression for the single Column or a list of columns to > select from the SparkDataFrame. > ^~~ > R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters. > #' @return A new SparkDataFrame containing only the rows that meet the > condition with selected columns. > ^~~ > R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a > function call. > return (joinRes) > ^ > R/DataFrame.R:2652:1: style: Variable and function names should not be longer > than 30 characters. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a > function call. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a > function call. > stop ("The following column name: ", newJoin, " occurs more than once > in the 'DataFrame'.", > ^ > R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters. > #' @note The statistics provided by \code{summary} were change in 2.3.0 use > \link{describe} for previous defaults. > ^~ > R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{cube} creates a single global > aggregate and is equivalent to > ^~~ > R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{rollup} creates a single global > aggregate and is equivalent to > ^ > R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a > function call. > switch (type, > ^ > R/functions.R:41:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{window}, it must be a time Column > of \code{TimestampType}. > ^ > R/functions.R:93:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and > \code{shiftRightUnsigned}, > ^~~ > R/functions.R:483:52: style: Remove spaces before the left parenthesis in a > function call. > jcols <- lapply(list(x, .
[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID
[ https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187551#comment-16187551 ] Shivaram Venkataraman commented on SPARK-22063: --- [~shaneknapp] [~felixcheung] Lets move the discussion to the JIRA ? I think there are a couple of ways to address this issue -- the first as [~hyukjin.kwon] pointed out we can make the lint-r script do the installation. I am not too much in favor of that as it will result in the script affecting packages at runtime. Instead I was thinking if we could create R environments for each Spark version -- https://stackoverflow.com/questions/24283171/virtual-environment-in-r has a bunch of ideas on how to do this. Any thoughts on the approaches listed there ? > Upgrade lintr to latest commit sha1 ID > -- > > Key: SPARK-22063 > URL: https://issues.apache.org/jira/browse/SPARK-22063 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this > pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026]) > and SPARK-14074. > Today, I tried to upgrade the latest, > https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72 > This fixes many bugs and now finds many instances that I have observed and > thought should be caught time to time: > {code} > inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis > in a function call. > return (output) > ^ > R/column.R:241:1: style: Lines should not be more than 100 characters. > #' > \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{ > ^~~~ > R/context.R:332:1: style: Variable and function names should not be longer > than 30 characters. > spark.getSparkFilesRootDirectory <- function() { > ^~~~ > R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters. > #' @param j,select expression for the single Column or a list of columns to > select from the SparkDataFrame. > ^~~ > R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters. > #' @return A new SparkDataFrame containing only the rows that meet the > condition with selected columns. > ^~~ > R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a > function call. > return (joinRes) > ^ > R/DataFrame.R:2652:1: style: Variable and function names should not be longer > than 30 characters. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a > function call. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a > function call. > stop ("The following column name: ", newJoin, " occurs more than once > in the 'DataFrame'.", > ^ > R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters. > #' @note The statistics provided by \code{summary} were change in 2.3.0 use > \link{describe} for previous defaults. > ^~ > R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{cube} creates a single global > aggregate and is equivalent to > ^~~ > R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{rollup} creates a single global > aggregate and is equivalent to > ^ > R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a > function call. > switch (type, > ^ > R/functions.R:41:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{window}, it must be a time Column > of \code{TimestampType}. > ^~~
[jira] [Commented] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent
[ https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187526#comment-16187526 ] Sathiya Kumar commented on SPARK-22181: --- I will create the PR if you agree.. > ReplaceExceptWithNotFilter if one or both of the datasets are fully derived > out of Filters from a same parent > - > > Key: SPARK-22181 > URL: https://issues.apache.org/jira/browse/SPARK-22181 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Sathiya Kumar >Priority: Minor > > While applying Except operator between two datasets, if one or both of the > datasets are purely transformed using filter operations, then instead of > rewriting the Except operator using expensive join operation, we can rewrite > it using filter operation by flipping the filter condition of the right node. > ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png) > ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent
[ https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187525#comment-16187525 ] Sathiya Kumar commented on SPARK-22181: --- Here is the rule implementation, which could be scheduled before the `ReplaceExceptWithAntiJoin` rule: {code:java} object ReplaceExceptWithNotFilter extends Rule[LogicalPlan] { implicit def nodeToFilter(node: LogicalPlan): Filter = node.asInstanceOf[Filter] def apply(plan: LogicalPlan): LogicalPlan = plan transform { case Except(left, right) if isEligible(left, right) => Distinct( Filter(Not(replaceAttributesIn(combineFilters(right).condition, left)), left) ) } def isEligible(left: LogicalPlan, right: LogicalPlan): Boolean = (left, right) match { case (left: Filter, right: Filter) => parent(left).sameResult(parent(right)) case (left, right: Filter) => left.sameResult(parent(right)) case _ => false } def parent(plan: LogicalPlan): LogicalPlan = plan match { case x @ Filter(_, child) => parent(child) case x => x } def combineFilters(plan: LogicalPlan): LogicalPlan = CombineFilters(plan) match { case result if !result.fastEquals(plan) => combineFilters(result) case result => result } def replaceAttributesIn(condition: Expression, leftChild: LogicalPlan): Expression = { condition transform { case AttributeReference(name, _, _, _) => leftChild.output.find(_.name == name).get } } } {code} > ReplaceExceptWithNotFilter if one or both of the datasets are fully derived > out of Filters from a same parent > - > > Key: SPARK-22181 > URL: https://issues.apache.org/jira/browse/SPARK-22181 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Sathiya Kumar >Priority: Minor > > While applying Except operator between two datasets, if one or both of the > datasets are purely transformed using filter operations, then instead of > rewriting the Except operator using expensive join operation, we can rewrite > it using filter operation by flipping the filter condition of the right node. > ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png) > ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent
[ https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187519#comment-16187519 ] Sathiya Kumar commented on SPARK-22181: --- Please give me assign right to assign this task to myself, thank you. > ReplaceExceptWithNotFilter if one or both of the datasets are fully derived > out of Filters from a same parent > - > > Key: SPARK-22181 > URL: https://issues.apache.org/jira/browse/SPARK-22181 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Sathiya Kumar >Priority: Minor > > While applying Except operator between two datasets, if one or both of the > datasets are purely transformed using filter operations, then instead of > rewriting the Except operator using expensive join operation, we can rewrite > it using filter operation by flipping the filter condition of the right node. > ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png) > ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22180) Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort
[ https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187508#comment-16187508 ] Sean Owen commented on SPARK-22180: --- To be clear I don't think this make IPv6 fully work for Spark, but, may let some more things happen to work > Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort > --- > > Key: SPARK-22180 > URL: https://issues.apache.org/jira/browse/SPARK-22180 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Stefan Obermeier >Priority: Minor > > External applications like Apache Cassandra are able to deal with IPv6 > addresses. Libraries like spark-cassandra-connector combine Apache Cassandra > with Apache Spark. > This combination is very useful IMHO. > One problem is that {code:java} > org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the > last colon to sepperate the port from host path. This conflicts with literal > IPv6 addresses. > I think we can take {code}hostPort{code} as literal IPv6 address if it > contains tow ore more colons. If IPv6 addresses are enclosed in square > brackets port definition is still possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22180) Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort
[ https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22180: -- Priority: Minor (was: Critical) > Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort > --- > > Key: SPARK-22180 > URL: https://issues.apache.org/jira/browse/SPARK-22180 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Stefan Obermeier >Priority: Minor > > External applications like Apache Cassandra are able to deal with IPv6 > addresses. Libraries like spark-cassandra-connector combine Apache Cassandra > with Apache Spark. > This combination is very useful IMHO. > One problem is that {code:java} > org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the > last colon to sepperate the port from host path. This conflicts with literal > IPv6 addresses. > I think we can take {code}hostPort{code} as literal IPv6 address if it > contains tow ore more colons. If IPv6 addresses are enclosed in square > brackets port definition is still possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22180) Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort
[ https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22180: -- Labels: (was: features patch) > Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort > --- > > Key: SPARK-22180 > URL: https://issues.apache.org/jira/browse/SPARK-22180 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Stefan Obermeier >Priority: Minor > > External applications like Apache Cassandra are able to deal with IPv6 > addresses. Libraries like spark-cassandra-connector combine Apache Cassandra > with Apache Spark. > This combination is very useful IMHO. > One problem is that {code:java} > org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the > last colon to sepperate the port from host path. This conflicts with literal > IPv6 addresses. > I think we can take {code}hostPort{code} as literal IPv6 address if it > contains tow ore more colons. If IPv6 addresses are enclosed in square > brackets port definition is still possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22179) percentile_approx should choose the first element if it already reaches the percentage
[ https://issues.apache.org/jira/browse/SPARK-22179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22179: Description: percentile_approx should choose the first element if it already reaches the percentage > percentile_approx should choose the first element if it already reaches the > percentage > -- > > Key: SPARK-22179 > URL: https://issues.apache.org/jira/browse/SPARK-22179 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > percentile_approx should choose the first element if it already reaches the > percentage -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22001) ImputerModel can do withColumn for all input columns at one pass
[ https://issues.apache.org/jira/browse/SPARK-22001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-22001: --- Assignee: Liang-Chi Hsieh > ImputerModel can do withColumn for all input columns at one pass > > > Key: SPARK-22001 > URL: https://issues.apache.org/jira/browse/SPARK-22001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > SPARK-21690 makes one-pass {{Imputer}} by parallelizing the computation of > all input columns. When we transform dataset with {{ImputerModel}}, we do > {{withColumn}} on all input columns sequentially. We can also do this on all > input columns at once. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22001) ImputerModel can do withColumn for all input columns at one pass
[ https://issues.apache.org/jira/browse/SPARK-22001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22001. - Resolution: Fixed Fix Version/s: 2.3.0 > ImputerModel can do withColumn for all input columns at one pass > > > Key: SPARK-22001 > URL: https://issues.apache.org/jira/browse/SPARK-22001 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > SPARK-21690 makes one-pass {{Imputer}} by parallelizing the computation of > all input columns. When we transform dataset with {{ImputerModel}}, we do > {{withColumn}} on all input columns sequentially. We can also do this on all > input columns at once. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent
[ https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sathiya Kumar updated SPARK-22181: -- Description: While applying Except operator between two datasets, if one or both of the datasets are purely transformed using filter operations, then instead of rewriting the Except operator using expensive join operation, we can rewrite it using filter operation by flipping the filter condition of the right node. ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png) ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png) was: While applying Except operator between two datasets, if one or both of the datasets are purely transformed using filter operations, then instead of rewriting the Except operator using expensive join operation, we can rewrite it using filter operation by flipping the filter condition of the right node. !https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png! !https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png! > ReplaceExceptWithNotFilter if one or both of the datasets are fully derived > out of Filters from a same parent > - > > Key: SPARK-22181 > URL: https://issues.apache.org/jira/browse/SPARK-22181 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Sathiya Kumar >Priority: Minor > > While applying Except operator between two datasets, if one or both of the > datasets are purely transformed using filter operations, then instead of > rewriting the Except operator using expensive join operation, we can rewrite > it using filter operation by flipping the filter condition of the right node. > ![Case-1](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png) > ![Case-2](https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22181) ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent
Sathiya Kumar created SPARK-22181: - Summary: ReplaceExceptWithNotFilter if one or both of the datasets are fully derived out of Filters from a same parent Key: SPARK-22181 URL: https://issues.apache.org/jira/browse/SPARK-22181 Project: Spark Issue Type: New Feature Components: Optimizer, SQL Affects Versions: 2.2.0, 2.1.1 Reporter: Sathiya Kumar Priority: Minor While applying Except operator between two datasets, if one or both of the datasets are purely transformed using filter operations, then instead of rewriting the Except operator using expensive join operation, we can rewrite it using filter operation by flipping the filter condition of the right node. !https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case1.png! !https://github.com/sathiyapk/Blog-Posts/blob/master/images/spark-optimizer/ReplaceExceptWithNotFilter-case2.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22180) Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort
[ https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Obermeier updated SPARK-22180: - Summary: Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort (was: Allow IPv6 address) > Allow IPv6 address in org.apache.spark.util.Utils.parseHostPort > --- > > Key: SPARK-22180 > URL: https://issues.apache.org/jira/browse/SPARK-22180 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Stefan Obermeier >Priority: Critical > Labels: features, patch > > External applications like Apache Cassandra are able to deal with IPv6 > addresses. Libraries like spark-cassandra-connector combine Apache Cassandra > with Apache Spark. > This combination is very useful IMHO. > One problem is that {code:java} > org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the > last colon to sepperate the port from host path. This conflicts with literal > IPv6 addresses. > I think we can take {code}hostPort{code} as literal IPv6 address if it > contains tow ore more colons. If IPv6 addresses are enclosed in square > brackets port definition is still possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22180) Allow IPv6 address
[ https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22180: Assignee: (was: Apache Spark) > Allow IPv6 address > -- > > Key: SPARK-22180 > URL: https://issues.apache.org/jira/browse/SPARK-22180 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Stefan Obermeier >Priority: Critical > Labels: features, patch > > External applications like Apache Cassandra are able to deal with IPv6 > addresses. Libraries like spark-cassandra-connector combine Apache Cassandra > with Apache Spark. > This combination is very useful IMHO. > One problem is that {code:java} > org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the > last colon to sepperate the port from host path. This conflicts with literal > IPv6 addresses. > I think we can take {code}hostPort{code} as literal IPv6 address if it > contains tow ore more colons. If IPv6 addresses are enclosed in square > brackets port definition is still possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22180) Allow IPv6 address
[ https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187408#comment-16187408 ] Apache Spark commented on SPARK-22180: -- User 'obermeier' has created a pull request for this issue: https://github.com/apache/spark/pull/19408 > Allow IPv6 address > -- > > Key: SPARK-22180 > URL: https://issues.apache.org/jira/browse/SPARK-22180 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Stefan Obermeier >Priority: Critical > Labels: features, patch > > External applications like Apache Cassandra are able to deal with IPv6 > addresses. Libraries like spark-cassandra-connector combine Apache Cassandra > with Apache Spark. > This combination is very useful IMHO. > One problem is that {code:java} > org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the > last colon to sepperate the port from host path. This conflicts with literal > IPv6 addresses. > I think we can take {code}hostPort{code} as literal IPv6 address if it > contains tow ore more colons. If IPv6 addresses are enclosed in square > brackets port definition is still possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22180) Allow IPv6 address
[ https://issues.apache.org/jira/browse/SPARK-22180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22180: Assignee: Apache Spark > Allow IPv6 address > -- > > Key: SPARK-22180 > URL: https://issues.apache.org/jira/browse/SPARK-22180 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Stefan Obermeier >Assignee: Apache Spark >Priority: Critical > Labels: features, patch > > External applications like Apache Cassandra are able to deal with IPv6 > addresses. Libraries like spark-cassandra-connector combine Apache Cassandra > with Apache Spark. > This combination is very useful IMHO. > One problem is that {code:java} > org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the > last colon to sepperate the port from host path. This conflicts with literal > IPv6 addresses. > I think we can take {code}hostPort{code} as literal IPv6 address if it > contains tow ore more colons. If IPv6 addresses are enclosed in square > brackets port definition is still possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22180) Allow IPv6 address
Stefan Obermeier created SPARK-22180: Summary: Allow IPv6 address Key: SPARK-22180 URL: https://issues.apache.org/jira/browse/SPARK-22180 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Stefan Obermeier Priority: Critical External applications like Apache Cassandra are able to deal with IPv6 addresses. Libraries like spark-cassandra-connector combine Apache Cassandra with Apache Spark. This combination is very useful IMHO. One problem is that {code:java} org.apache.spark.util.Utils.parseHostPort(hostPort: String) {code} takes the last colon to sepperate the port from host path. This conflicts with literal IPv6 addresses. I think we can take {code}hostPort{code} as literal IPv6 address if it contains tow ore more colons. If IPv6 addresses are enclosed in square brackets port definition is still possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID
[ https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187328#comment-16187328 ] Hyukjin Kwon commented on SPARK-22063: -- The lint failures were fixed first in https://github.com/apache/spark/pull/19290; however, it is not actually upgraded to {{jimhester/lintr@5431140}} due to the concern of breaking other builds. Please see the discussion in the PR if anyone is interested in this. This is not yet fully solved. > Upgrade lintr to latest commit sha1 ID > -- > > Key: SPARK-22063 > URL: https://issues.apache.org/jira/browse/SPARK-22063 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this > pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026]) > and SPARK-14074. > Today, I tried to upgrade the latest, > https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72 > This fixes many bugs and now finds many instances that I have observed and > thought should be caught time to time: > {code} > inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis > in a function call. > return (output) > ^ > R/column.R:241:1: style: Lines should not be more than 100 characters. > #' > \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{ > ^~~~ > R/context.R:332:1: style: Variable and function names should not be longer > than 30 characters. > spark.getSparkFilesRootDirectory <- function() { > ^~~~ > R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters. > #' @param j,select expression for the single Column or a list of columns to > select from the SparkDataFrame. > ^~~ > R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters. > #' @return A new SparkDataFrame containing only the rows that meet the > condition with selected columns. > ^~~ > R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a > function call. > return (joinRes) > ^ > R/DataFrame.R:2652:1: style: Variable and function names should not be longer > than 30 characters. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a > function call. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a > function call. > stop ("The following column name: ", newJoin, " occurs more than once > in the 'DataFrame'.", > ^ > R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters. > #' @note The statistics provided by \code{summary} were change in 2.3.0 use > \link{describe} for previous defaults. > ^~ > R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{cube} creates a single global > aggregate and is equivalent to > ^~~ > R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{rollup} creates a single global > aggregate and is equivalent to > ^ > R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a > function call. > switch (type, > ^ > R/functions.R:41:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{window}, it must be a time Column > of \code{TimestampType}. > ^ > R/functions.R:93:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and > \code{shiftRightUnsigned}, > ^~~~
[jira] [Resolved] (SPARK-22177) Error running ml_ops.sh(SPOT): Can not create a Path from an empty string
[ https://issues.apache.org/jira/browse/SPARK-22177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22177. --- Resolution: Invalid This is not a Spark issue, but something to do with Spot. Somewhere you're feeding an empty path to some argument > Error running ml_ops.sh(SPOT): Can not create a Path from an empty string > - > > Key: SPARK-22177 > URL: https://issues.apache.org/jira/browse/SPARK-22177 > Project: Spark > Issue Type: Question > Components: ML, Spark Submit, YARN >Affects Versions: 2.2.0 > Environment: CentOS 7 1708 > Hadoop 2.6.0 > Scala 2.11.8 > SPOT 1.0 >Reporter: Jorge Pizarro >Priority: Minor > Labels: newbie > > Error message running "./ml_ops.sh 20170922 dns 1e-4". > Complete error message: > [soluser@master spot-ml]$ bash -x ./ml_ops.sh 20170922 dns 1e-4 > + FDATE=20170922 > + DSOURCE=dns > + YR=2017 > + MH=09 > + DY=22 > + [[ 8 != \8 ]] > + [[ -z dns ]] > + source /etc/spot.conf > ++ UINODE=master > ++ MLNODE=master > ++ GWNODE=master > ++ DBNAME=spotdb > ++ HUSER=/user/soluser > ++ NAME_NODE=master > ++ WEB_PORT=50070 > ++ DNS_PATH=/user/soluser/dns/hive/y=2017/m=09/d=22/ > ++ PROXY_PATH=/user/soluser/dns/hive/y=2017/m=09/d=22/ > ++ FLOW_PATH=/user/soluser/dns/hive/y=2017/m=09/d=22/ > ++ HPATH=/user/soluser/dns/scored_results/20170922 > ++ IMPALA_DEM=master > ++ IMPALA_PORT=21050 > ++ LUSER=/home/soluser > ++ LPATH=/home/soluser/ml/dns/20170922 > ++ RPATH=/home/soluser/ipython/user/20170922 > ++ LIPATH=/home/soluser/ingest > ++ USER_DOMAIN=neosecure > ++ SPK_EXEC=1 > ++ SPK_EXEC_MEM=1g > ++ SPK_DRIVER_MEM=1g > ++ SPK_DRIVER_MAX_RESULTS=200m > ++ SPK_EXEC_CORES=2 > ++ SPK_DRIVER_MEM_OVERHEAD=100m > ++ SPK_EXEC_MEM_OVERHEAD=100m > ++ SPK_AUTO_BRDCST_JOIN_THR=10485760 > ++ LDA_OPTIMIZER=em > ++ LDA_ALPHA=1.02 > ++ LDA_BETA=1.001 > ++ PRECISION=64 > ++ TOL=1e-6 > ++ TOPIC_COUNT=20 > ++ DUPFACTOR=1000 > + '[' -n 1e-4 ']' > + TOL=1e-4 > + '[' -n '' ']' > + MAXRESULTS=-1 > + '[' dns == flow ']' > + '[' dns == dns ']' > + RAWDATA_PATH=/user/soluser/dns/hive/y=2017/m=09/d=22/ > + '[' '!' -z neosecure ']' > + USER_DOMAIN_CMD='--userdomain neosecure' > + > FEEDBACK_PATH=/user/soluser/dns/scored_results/20170922/feedback/ml_feedback.csv > + HDFS_SCORED_CONNECTS=/user/soluser/dns/scored_results/20170922/scores > + hdfs dfs -rm -R -f /user/soluser/dns/scored_results/20170922/scores > + spark-submit --class org.apache.spot.SuspiciousConnects --master yarn > --deploy-mode cluster --driver-memory 1g --conf > spark.driver.maxResultSize=200m --conf spark.driver.maxPermSize=512m --conf > spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.maxExecutors=1 --conf spark.executor.cores=2 --conf > spark.executor.memory=1g --conf spark.sql.autoBroadcastJoinThreshold=10485760 > --conf 'spark.executor.extraJavaOptions=-XX:MaxPermSize=512M > -XX:PermSize=512M' --conf spark.kryoserializer.buffer.max=512m --conf > spark.yarn.am.waitTime=100s --conf spark.yarn.am.memoryOverhead=100m --conf > spark.yarn.executor.memoryOverhead=100m > target/scala-2.11/spot-ml-assembly-1.1.jar --analysis dns --input > /user/soluser/dns/hive/y=2017/m=09/d=22/ --dupfactor 1000 --feedback > /user/soluser/dns/scored_results/20170922/feedback/ml_feedback.csv > --ldatopiccount 20 --scored /user/soluser/dns/scored_results/20170922/scores > --threshold 1e-4 --maxresults -1 --ldamaxiterations 20 --ldaalpha 1.02 > --ldabeta 1.001 --ldaoptimizer em --precision 64 --userdomain neosecure > 17/09/29 13:51:56 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8032 > 17/09/29 13:51:56 INFO yarn.Client: Requesting a new application from cluster > with 0 NodeManagers > 17/09/29 13:51:56 INFO yarn.Client: Verifying our application has not > requested more than the maximum memory capability of the cluster (8192 MB per > container) > 17/09/29 13:51:56 INFO yarn.Client: Will allocate AM container, with 1408 MB > memory including 384 MB overhead > 17/09/29 13:51:56 INFO yarn.Client: Setting up container launch context for > our AM > 17/09/29 13:51:56 INFO yarn.Client: Setting up the launch environment for our > AM container > 17/09/29 13:51:56 INFO yarn.Client: Preparing resources for our AM container > 17/09/29 13:51:57 INFO yarn.Client: Deleted staging directory > hdfs://master:9000/user/soluser/.sparkStaging/application_1506636890912_0058 > Exception in thread "main" java.lang.IllegalArgumentException: Can not create > a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:134) > at org.apache.hadoop.fs.Path.(Path.java:93) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:337) > at > org.apache.spark.deplo
[jira] [Assigned] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option
[ https://issues.apache.org/jira/browse/SPARK-21667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21667: Assignee: Apache Spark > ConsoleSink should not fail streaming query with checkpointLocation option > -- > > Key: SPARK-21667 > URL: https://issues.apache.org/jira/browse/SPARK-21667 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > > As agreed on the Spark users mailing list in the thread "\[SS] Console sink > not supporting recovering from checkpoint location? Why?" in which > [~marmbrus] said: > {quote} > I think there is really no good reason for this limitation. > {quote} > Using {{ConsoleSink}} should therefore not fail a streaming query when used > with {{checkpointLocation}} option. > {code} > // today's build from the master > scala> spark.version > res8: String = 2.3.0-SNAPSHOT > scala> val q = records. > | writeStream. > | format("console"). > | option("truncate", false). > | option("checkpointLocation", "/tmp/checkpoint"). // <-- > checkpoint directory > | trigger(Trigger.ProcessingTime(10.seconds)). > | outputMode(OutputMode.Update). > | start > org.apache.spark.sql.AnalysisException: This query does not support > recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start > over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284) > ... 61 elided > {code} > The "trigger" is SPARK-16116 and [this > line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277] > in particular. > This also relates to SPARK-19768 that was resolved as not a bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option
[ https://issues.apache.org/jira/browse/SPARK-21667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187303#comment-16187303 ] Apache Spark commented on SPARK-21667: -- User 'rekhajoshm' has created a pull request for this issue: https://github.com/apache/spark/pull/19407 > ConsoleSink should not fail streaming query with checkpointLocation option > -- > > Key: SPARK-21667 > URL: https://issues.apache.org/jira/browse/SPARK-21667 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Minor > > As agreed on the Spark users mailing list in the thread "\[SS] Console sink > not supporting recovering from checkpoint location? Why?" in which > [~marmbrus] said: > {quote} > I think there is really no good reason for this limitation. > {quote} > Using {{ConsoleSink}} should therefore not fail a streaming query when used > with {{checkpointLocation}} option. > {code} > // today's build from the master > scala> spark.version > res8: String = 2.3.0-SNAPSHOT > scala> val q = records. > | writeStream. > | format("console"). > | option("truncate", false). > | option("checkpointLocation", "/tmp/checkpoint"). // <-- > checkpoint directory > | trigger(Trigger.ProcessingTime(10.seconds)). > | outputMode(OutputMode.Update). > | start > org.apache.spark.sql.AnalysisException: This query does not support > recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start > over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284) > ... 61 elided > {code} > The "trigger" is SPARK-16116 and [this > line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277] > in particular. > This also relates to SPARK-19768 that was resolved as not a bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21667) ConsoleSink should not fail streaming query with checkpointLocation option
[ https://issues.apache.org/jira/browse/SPARK-21667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21667: Assignee: (was: Apache Spark) > ConsoleSink should not fail streaming query with checkpointLocation option > -- > > Key: SPARK-21667 > URL: https://issues.apache.org/jira/browse/SPARK-21667 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Minor > > As agreed on the Spark users mailing list in the thread "\[SS] Console sink > not supporting recovering from checkpoint location? Why?" in which > [~marmbrus] said: > {quote} > I think there is really no good reason for this limitation. > {quote} > Using {{ConsoleSink}} should therefore not fail a streaming query when used > with {{checkpointLocation}} option. > {code} > // today's build from the master > scala> spark.version > res8: String = 2.3.0-SNAPSHOT > scala> val q = records. > | writeStream. > | format("console"). > | option("truncate", false). > | option("checkpointLocation", "/tmp/checkpoint"). // <-- > checkpoint directory > | trigger(Trigger.ProcessingTime(10.seconds)). > | outputMode(OutputMode.Update). > | start > org.apache.spark.sql.AnalysisException: This query does not support > recovering from checkpoint location. Delete /tmp/checkpoint/offsets to start > over.; > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284) > ... 61 elided > {code} > The "trigger" is SPARK-16116 and [this > line|https://github.com/apache/spark/pull/13817/files#diff-d35e8fce09686073f81de598ed657de7R277] > in particular. > This also relates to SPARK-19768 that was resolved as not a bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org