[jira] [Comment Edited] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()
[ https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312579#comment-16312579 ] wuyi edited comment on SPARK-22967 at 1/5/18 6:51 AM: -- @[~srowen] was (Author: ngone51): @[~srowen] > VersionSuite failed on Windows caused by unescapeSQLString() > > > Key: SPARK-22967 > URL: https://issues.apache.org/jira/browse/SPARK-22967 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Windos7 >Reporter: wuyi >Priority: Minor > Labels: build, test, windows > > On Windows system, two unit test case would fail while running VersionSuite > ("A simple set of tests that call the methods of a `HiveClient`, loading > different version of hive from maven central.") > Failed A : test(s"$version: read avro file containing decimal") > {code:java} > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.lang.IllegalArgumentException: Can not create a > Path from an empty string); > {code} > Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table") > {code:java} > Unable to infer the schema. The schema specification is required to create > the table `default`.`tab2`.; > org.apache.spark.sql.AnalysisException: Unable to infer the schema. The > schema specification is required to create the table `default`.`tab2`.; > {code} > As I deep into this problem, I found it is related to > ParserUtils#unescapeSQLString(). > These are two lines at the beginning of Failed A: > {code:java} > val url = > Thread.currentThread().getContextClassLoader.getResource("avroDecimal") > val location = new File(url.getFile) > {code} > And in my environment,`location` (path value) is > {code:java} > D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal > {code} > And then, in SparkSqlParser#visitCreateHiveTable()#L1128: > {code:java} > val location = Option(ctx.locationSpec).map(visitLocationSpec) > {code} > This line want to get LocationSepcContext's content first, which is equal to > `location` above. > Then, the content is passed to visitLocationSpec(), and passed to > unescapeSQLString() > finally. > Lets' have a look at unescapeSQLString(): > {code:java} > /** Unescape baskslash-escaped string enclosed by quotes. */ > def unescapeSQLString(b: String): String = { > var enclosure: Character = null > val sb = new StringBuilder(b.length()) > def appendEscapedChar(n: Char) { > n match { > case '0' => sb.append('\u') > case '\'' => sb.append('\'') > case '"' => sb.append('\"') > case 'b' => sb.append('\b') > case 'n' => sb.append('\n') > case 'r' => sb.append('\r') > case 't' => sb.append('\t') > case 'Z' => sb.append('\u001A') > case '\\' => sb.append('\\') > // The following 2 lines are exactly what MySQL does TODO: why do we > do this? > case '%' => sb.append("\\%") > case '_' => sb.append("\\_") > case _ => sb.append(n) > } > } > var i = 0 > val strLength = b.length > while (i < strLength) { > val currentChar = b.charAt(i) > if (enclosure == null) { > if (currentChar == '\'' || currentChar == '\"') { > enclosure = currentChar > } > } else if (enclosure == currentChar) { > enclosure = null > } else if (currentChar == '\\') { > if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') { > // \u style character literals. > val base = i + 2 > val code = (0 until 4).foldLeft(0) { (mid, j) => > val digit = Character.digit(b.charAt(j + base), 16) > (mid << 4) + digit > } > sb.append(code.asInstanceOf[Char]) > i += 5 > } else if (i + 4 < strLength) { > // \000 style character literals. > val i1 = b.charAt(i + 1) > val i2 = b.charAt(i + 2) > val i3 = b.charAt(i + 3) > if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= > '0' && i3 <= '7')) { > val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << > 6)).asInstanceOf[Char] > sb.append(tmp) > i += 3 > } else { > appendEscapedChar(i1) > i += 1 > } > } else if (i + 2 < strLength) { > // escaped character literals. > val n = b.charAt(i + 1) > appendEscapedChar(n) > i += 1 > } > } else { > // non-escaped character literals. > sb.append(currentChar) > } > i += 1 > } > sb.toString() > } > {code} > Again, here, variable `b` is equal to
[jira] [Commented] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()
[ https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312579#comment-16312579 ] wuyi commented on SPARK-22967: -- [~srowen] > VersionSuite failed on Windows caused by unescapeSQLString() > > > Key: SPARK-22967 > URL: https://issues.apache.org/jira/browse/SPARK-22967 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Windos7 >Reporter: wuyi >Priority: Minor > Labels: build, test, windows > > On Windows system, two unit test case would fail while running VersionSuite > ("A simple set of tests that call the methods of a `HiveClient`, loading > different version of hive from maven central.") > Failed A : test(s"$version: read avro file containing decimal") > {code:java} > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.lang.IllegalArgumentException: Can not create a > Path from an empty string); > {code} > Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table") > {code:java} > Unable to infer the schema. The schema specification is required to create > the table `default`.`tab2`.; > org.apache.spark.sql.AnalysisException: Unable to infer the schema. The > schema specification is required to create the table `default`.`tab2`.; > {code} > As I deep into this problem, I found it is related to > ParserUtils#unescapeSQLString(). > These are two lines at the beginning of Failed A: > {code:java} > val url = > Thread.currentThread().getContextClassLoader.getResource("avroDecimal") > val location = new File(url.getFile) > {code} > And in my environment,`location` (path value) is > {code:java} > D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal > {code} > And then, in SparkSqlParser#visitCreateHiveTable()#L1128: > {code:java} > val location = Option(ctx.locationSpec).map(visitLocationSpec) > {code} > This line want to get LocationSepcContext's content first, which is equal to > `location` above. > Then, the content is passed to visitLocationSpec(), and passed to > unescapeSQLString() > finally. > Lets' have a look at unescapeSQLString(): > {code:java} > /** Unescape baskslash-escaped string enclosed by quotes. */ > def unescapeSQLString(b: String): String = { > var enclosure: Character = null > val sb = new StringBuilder(b.length()) > def appendEscapedChar(n: Char) { > n match { > case '0' => sb.append('\u') > case '\'' => sb.append('\'') > case '"' => sb.append('\"') > case 'b' => sb.append('\b') > case 'n' => sb.append('\n') > case 'r' => sb.append('\r') > case 't' => sb.append('\t') > case 'Z' => sb.append('\u001A') > case '\\' => sb.append('\\') > // The following 2 lines are exactly what MySQL does TODO: why do we > do this? > case '%' => sb.append("\\%") > case '_' => sb.append("\\_") > case _ => sb.append(n) > } > } > var i = 0 > val strLength = b.length > while (i < strLength) { > val currentChar = b.charAt(i) > if (enclosure == null) { > if (currentChar == '\'' || currentChar == '\"') { > enclosure = currentChar > } > } else if (enclosure == currentChar) { > enclosure = null > } else if (currentChar == '\\') { > if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') { > // \u style character literals. > val base = i + 2 > val code = (0 until 4).foldLeft(0) { (mid, j) => > val digit = Character.digit(b.charAt(j + base), 16) > (mid << 4) + digit > } > sb.append(code.asInstanceOf[Char]) > i += 5 > } else if (i + 4 < strLength) { > // \000 style character literals. > val i1 = b.charAt(i + 1) > val i2 = b.charAt(i + 2) > val i3 = b.charAt(i + 3) > if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= > '0' && i3 <= '7')) { > val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << > 6)).asInstanceOf[Char] > sb.append(tmp) > i += 3 > } else { > appendEscapedChar(i1) > i += 1 > } > } else if (i + 2 < strLength) { > // escaped character literals. > val n = b.charAt(i + 1) > appendEscapedChar(n) > i += 1 > } > } else { > // non-escaped character literals. > sb.append(currentChar) > } > i += 1 > } > sb.toString() > } > {code} > Again, here, variable `b` is equal to content and `location`, is valued of > {code:java} > D:\workspace\IdeaProjects\spar
[jira] [Comment Edited] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()
[ https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312579#comment-16312579 ] wuyi edited comment on SPARK-22967 at 1/5/18 6:51 AM: -- @[~srowen] was (Author: ngone51): [~srowen] > VersionSuite failed on Windows caused by unescapeSQLString() > > > Key: SPARK-22967 > URL: https://issues.apache.org/jira/browse/SPARK-22967 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Windos7 >Reporter: wuyi >Priority: Minor > Labels: build, test, windows > > On Windows system, two unit test case would fail while running VersionSuite > ("A simple set of tests that call the methods of a `HiveClient`, loading > different version of hive from maven central.") > Failed A : test(s"$version: read avro file containing decimal") > {code:java} > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.lang.IllegalArgumentException: Can not create a > Path from an empty string); > {code} > Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table") > {code:java} > Unable to infer the schema. The schema specification is required to create > the table `default`.`tab2`.; > org.apache.spark.sql.AnalysisException: Unable to infer the schema. The > schema specification is required to create the table `default`.`tab2`.; > {code} > As I deep into this problem, I found it is related to > ParserUtils#unescapeSQLString(). > These are two lines at the beginning of Failed A: > {code:java} > val url = > Thread.currentThread().getContextClassLoader.getResource("avroDecimal") > val location = new File(url.getFile) > {code} > And in my environment,`location` (path value) is > {code:java} > D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal > {code} > And then, in SparkSqlParser#visitCreateHiveTable()#L1128: > {code:java} > val location = Option(ctx.locationSpec).map(visitLocationSpec) > {code} > This line want to get LocationSepcContext's content first, which is equal to > `location` above. > Then, the content is passed to visitLocationSpec(), and passed to > unescapeSQLString() > finally. > Lets' have a look at unescapeSQLString(): > {code:java} > /** Unescape baskslash-escaped string enclosed by quotes. */ > def unescapeSQLString(b: String): String = { > var enclosure: Character = null > val sb = new StringBuilder(b.length()) > def appendEscapedChar(n: Char) { > n match { > case '0' => sb.append('\u') > case '\'' => sb.append('\'') > case '"' => sb.append('\"') > case 'b' => sb.append('\b') > case 'n' => sb.append('\n') > case 'r' => sb.append('\r') > case 't' => sb.append('\t') > case 'Z' => sb.append('\u001A') > case '\\' => sb.append('\\') > // The following 2 lines are exactly what MySQL does TODO: why do we > do this? > case '%' => sb.append("\\%") > case '_' => sb.append("\\_") > case _ => sb.append(n) > } > } > var i = 0 > val strLength = b.length > while (i < strLength) { > val currentChar = b.charAt(i) > if (enclosure == null) { > if (currentChar == '\'' || currentChar == '\"') { > enclosure = currentChar > } > } else if (enclosure == currentChar) { > enclosure = null > } else if (currentChar == '\\') { > if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') { > // \u style character literals. > val base = i + 2 > val code = (0 until 4).foldLeft(0) { (mid, j) => > val digit = Character.digit(b.charAt(j + base), 16) > (mid << 4) + digit > } > sb.append(code.asInstanceOf[Char]) > i += 5 > } else if (i + 4 < strLength) { > // \000 style character literals. > val i1 = b.charAt(i + 1) > val i2 = b.charAt(i + 2) > val i3 = b.charAt(i + 3) > if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= > '0' && i3 <= '7')) { > val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << > 6)).asInstanceOf[Char] > sb.append(tmp) > i += 3 > } else { > appendEscapedChar(i1) > i += 1 > } > } else if (i + 2 < strLength) { > // escaped character literals. > val n = b.charAt(i + 1) > appendEscapedChar(n) > i += 1 > } > } else { > // non-escaped character literals. > sb.append(currentChar) > } > i += 1 > } > sb.toString() > } > {code} > Again, here, variable `b` is equal to c
[jira] [Updated] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()
[ https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-22967: - Shepherd: Sean Owen > VersionSuite failed on Windows caused by unescapeSQLString() > > > Key: SPARK-22967 > URL: https://issues.apache.org/jira/browse/SPARK-22967 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Windos7 >Reporter: wuyi >Priority: Minor > Labels: build, test, windows > > On Windows system, two unit test case would fail while running VersionSuite > ("A simple set of tests that call the methods of a `HiveClient`, loading > different version of hive from maven central.") > Failed A : test(s"$version: read avro file containing decimal") > {code:java} > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.lang.IllegalArgumentException: Can not create a > Path from an empty string); > {code} > Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table") > {code:java} > Unable to infer the schema. The schema specification is required to create > the table `default`.`tab2`.; > org.apache.spark.sql.AnalysisException: Unable to infer the schema. The > schema specification is required to create the table `default`.`tab2`.; > {code} > As I deep into this problem, I found it is related to > ParserUtils#unescapeSQLString(). > These are two lines at the beginning of Failed A: > {code:java} > val url = > Thread.currentThread().getContextClassLoader.getResource("avroDecimal") > val location = new File(url.getFile) > {code} > And in my environment,`location` (path value) is > {code:java} > D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal > {code} > And then, in SparkSqlParser#visitCreateHiveTable()#L1128: > {code:java} > val location = Option(ctx.locationSpec).map(visitLocationSpec) > {code} > This line want to get LocationSepcContext's content first, which is equal to > `location` above. > Then, the content is passed to visitLocationSpec(), and passed to > unescapeSQLString() > finally. > Lets' have a look at unescapeSQLString(): > {code:java} > /** Unescape baskslash-escaped string enclosed by quotes. */ > def unescapeSQLString(b: String): String = { > var enclosure: Character = null > val sb = new StringBuilder(b.length()) > def appendEscapedChar(n: Char) { > n match { > case '0' => sb.append('\u') > case '\'' => sb.append('\'') > case '"' => sb.append('\"') > case 'b' => sb.append('\b') > case 'n' => sb.append('\n') > case 'r' => sb.append('\r') > case 't' => sb.append('\t') > case 'Z' => sb.append('\u001A') > case '\\' => sb.append('\\') > // The following 2 lines are exactly what MySQL does TODO: why do we > do this? > case '%' => sb.append("\\%") > case '_' => sb.append("\\_") > case _ => sb.append(n) > } > } > var i = 0 > val strLength = b.length > while (i < strLength) { > val currentChar = b.charAt(i) > if (enclosure == null) { > if (currentChar == '\'' || currentChar == '\"') { > enclosure = currentChar > } > } else if (enclosure == currentChar) { > enclosure = null > } else if (currentChar == '\\') { > if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') { > // \u style character literals. > val base = i + 2 > val code = (0 until 4).foldLeft(0) { (mid, j) => > val digit = Character.digit(b.charAt(j + base), 16) > (mid << 4) + digit > } > sb.append(code.asInstanceOf[Char]) > i += 5 > } else if (i + 4 < strLength) { > // \000 style character literals. > val i1 = b.charAt(i + 1) > val i2 = b.charAt(i + 2) > val i3 = b.charAt(i + 3) > if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= > '0' && i3 <= '7')) { > val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << > 6)).asInstanceOf[Char] > sb.append(tmp) > i += 3 > } else { > appendEscapedChar(i1) > i += 1 > } > } else if (i + 2 < strLength) { > // escaped character literals. > val n = b.charAt(i + 1) > appendEscapedChar(n) > i += 1 > } > } else { > // non-escaped character literals. > sb.append(currentChar) > } > i += 1 > } > sb.toString() > } > {code} > Again, here, variable `b` is equal to content and `location`, is valued of > {code:java} > D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\av
[jira] [Resolved] (SPARK-22949) Reduce memory requirement for TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-22949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22949. --- Resolution: Fixed Fix Version/s: 2.4.0 2.3.0 Target Version/s: 2.3.0, 2.4.0 Resolved by https://github.com/apache/spark/pull/20143 > Reduce memory requirement for TrainValidationSplit > -- > > Key: SPARK-22949 > URL: https://issues.apache.org/jira/browse/SPARK-22949 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Critical > Fix For: 2.3.0, 2.4.0 > > > There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, > the same patch to be applied to {{ TrainValidationSplit }}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22949) Reduce memory requirement for TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-22949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22949: - Assignee: Bago Amirbekian > Reduce memory requirement for TrainValidationSplit > -- > > Key: SPARK-22949 > URL: https://issues.apache.org/jira/browse/SPARK-22949 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Critical > > There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, > the same patch to be applied to {{ TrainValidationSplit }}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22825) Incorrect results of Casting Array to String
[ https://issues.apache.org/jira/browse/SPARK-22825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22825. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20024 [https://github.com/apache/spark/pull/20024] > Incorrect results of Casting Array to String > > > Key: SPARK-22825 > URL: https://issues.apache.org/jira/browse/SPARK-22825 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro > Fix For: 2.3.0 > > > {code} > val df = > spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids)) > df.write.saveAsTable("t") > sql("SELECT cast(ids as String) FROM t").show(false) > {code} > The output is like > {code} > +--+ > |ids | > +--+ > |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@8bc285df| > +--+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22825) Incorrect results of Casting Array to String
[ https://issues.apache.org/jira/browse/SPARK-22825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22825: --- Assignee: Takeshi Yamamuro > Incorrect results of Casting Array to String > > > Key: SPARK-22825 > URL: https://issues.apache.org/jira/browse/SPARK-22825 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro > Fix For: 2.3.0 > > > {code} > val df = > spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids)) > df.write.saveAsTable("t") > sql("SELECT cast(ids as String) FROM t").show(false) > {code} > The output is like > {code} > +--+ > |ids | > +--+ > |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@8bc285df| > +--+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception
[ https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312505#comment-16312505 ] Jepson commented on SPARK-19547: I have encounter this problem too. https://issues.apache.org/jira/browse/SPARK-22968 > KafkaUtil throw 'No current assignment for partition' Exception > --- > > Key: SPARK-19547 > URL: https://issues.apache.org/jira/browse/SPARK-19547 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 1.6.1 >Reporter: wuchang > > Below is my scala code to create spark kafka stream: > val kafkaParams = Map[String, Object]( > "bootstrap.servers" -> "server110:2181,server110:9092", > "zookeeper" -> "server110:2181", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> classOf[StringDeserializer], > "group.id" -> "example", > "auto.offset.reset" -> "latest", > "enable.auto.commit" -> (false: java.lang.Boolean) > ) > val topics = Array("ABTest") > val stream = KafkaUtils.createDirectStream[String, String]( > ssc, > PreferConsistent, > Subscribe[String, String](topics, kafkaParams) > ) > But after run for 10 hours, it throws exceptions: > 2017-02-10 10:56:20,000 INFO [JobGenerator] internals.ConsumerCoordinator: > Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example > 2017-02-10 10:56:20,000 INFO [JobGenerator] internals.AbstractCoordinator: > (Re-)joining group example > 2017-02-10 10:56:20,011 INFO [JobGenerator] internals.AbstractCoordinator: > (Re-)joining group example > 2017-02-10 10:56:40,057 INFO [JobGenerator] internals.AbstractCoordinator: > Successfully joined group example with generation 5 > 2017-02-10 10:56:40,058 INFO [JobGenerator] internals.ConsumerCoordinator: > Setting newly assigned partitions [ABTest-1] for group example > 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error > generating jobs for time 148669538 ms > java.lang.IllegalStateException: No current assignment for partition ABTest-0 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248) > at > org.
[jira] [Commented] (SPARK-22805) Use aliases for StorageLevel in event logs
[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312499#comment-16312499 ] Imran Rashid commented on SPARK-22805: -- I'm leaning slightly against this, though could go either way. For 2.3+, the gains are pretty small, and it means an old history server can't read new logs (I know we don't guarantee that anyway, but might as well keep it if we can). For < 2.3, there would be notable improvements in log sizes, but I don't like the compatibility story. I don't think there are any explicit guarantees but seems pretty annoying to have a 2.2.1 SHS not read logs from spark 2.2.2. sorry [~lebedev], I appreciate the work you've put into this anyhow. > Use aliases for StorageLevel in event logs > -- > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Sergei Lebedev >Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jepson updated SPARK-22968: --- Description: *Kafka Broker:* {code:java} message.max.bytes : 2621440 {code} *Spark Streaming+Kafka Code:* {code:java} , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 1048576 , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 , "heartbeat.interval.ms" -> (5000: java.lang.Integer) , "receive.buffer.bytes" -> (10485760: java.lang.Integer) {code} *Error message:* {code:java} 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, kssh-1] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group use_a_separate_group_id_for_each_stream with generation 4 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for time 1515116907000 ms java.lang.IllegalStateException: No current assignment for partition kssh-2 at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) at org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) at org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) at scala.Option.orElse(Option.scala:289) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 18/01/05 09:48:27 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 18/01/05 09:48:27 INFO streaming.StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook 18/01/05 09:48:27 INFO scheduler.ReceiverT
[jira] [Updated] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jepson updated SPARK-22968: --- Description: *Kafka Broker:* message.max.bytes : 2621440 *Spark Streaming+Kafka Code:* *, "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 1048576* , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 , "heartbeat.interval.ms" -> (5000: java.lang.Integer) *, "receive.buffer.bytes" -> (10485760: java.lang.Integer)* *Error message:* {code:java} 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, kssh-1] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group use_a_separate_group_id_for_each_stream with generation 4 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for time 1515116907000 ms java.lang.IllegalStateException: No current assignment for partition kssh-2 at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) at org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) at org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) at scala.Option.orElse(Option.scala:289) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 18/01/05 09:48:27 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 18/01/05 09:48:27 INFO streaming.StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook 18/01/05 09:48:27 INFO scheduler.ReceiverTracker: ReceiverTracker stopped
[jira] [Updated] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jepson updated SPARK-22968: --- Description: CDH-KAFKA: message.max.bytes : 2621440 , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 1048576 , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 , "heartbeat.interval.ms" -> (5000: java.lang.Integer) , "receive.buffer.bytes" -> (10485760: java.lang.Integer) *Error message:* {code:java} 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, kssh-1] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group use_a_separate_group_id_for_each_stream with generation 4 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for time 1515116907000 ms java.lang.IllegalStateException: No current assignment for partition kssh-2 at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) at org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) at org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) at scala.Option.orElse(Option.scala:289) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 18/01/05 09:48:27 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 18/01/05 09:48:27 INFO streaming.StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook 18/01/05 09:48:27 INFO scheduler.ReceiverTracker: ReceiverTracker stopped 18/01/05 09:48:27 INFO scheduler.JobGen
[jira] [Closed] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET
[ https://issues.apache.org/jira/browse/SPARK-17282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang closed SPARK-17282. > Implement ALTER TABLE UPDATE STATISTICS SET > --- > > Key: SPARK-17282 > URL: https://issues.apache.org/jira/browse/SPARK-17282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > Users can change the statistics by the DDL statement: > {noformat} > ALTER TABLE UPDATE STATISTICS SET > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET
[ https://issues.apache.org/jira/browse/SPARK-17282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang resolved SPARK-17282. -- Resolution: Won't Fix > Implement ALTER TABLE UPDATE STATISTICS SET > --- > > Key: SPARK-17282 > URL: https://issues.apache.org/jira/browse/SPARK-17282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > Users can change the statistics by the DDL statement: > {noformat} > ALTER TABLE UPDATE STATISTICS SET > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
Jepson created SPARK-22968: -- Summary: java.lang.IllegalStateException: No current assignment for partition kssh-2 Key: SPARK-22968 URL: https://issues.apache.org/jira/browse/SPARK-22968 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Environment: Kafka: 0.10.0 (CDH5.12.0) Apache Spark 2.1.1 Spark streaming+Kafka Reporter: Jepson *Error message:* {code:java} 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, kssh-1] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group use_a_separate_group_id_for_each_stream with generation 4 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for time 1515116907000 ms java.lang.IllegalStateException: No current assignment for partition kssh-2 at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) at org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) at org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) at scala.Option.orElse(Option.scala:289) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 18/01/05 09:48:27 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 18/01/05 09:48:27 INFO streaming.StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook 18/01/05 09:48:27 INFO scheduler.ReceiverTracker: ReceiverTracker stopped 18/01/05 09:48:27 INFO scheduler.JobGenerator: Stopping JobGenerator immediately 18/01/05 09:48:27 INFO util.RecurringTim
[jira] [Commented] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET
[ https://issues.apache.org/jira/browse/SPARK-17282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312487#comment-16312487 ] Zhenhua Wang commented on SPARK-17282: -- I'm closing this for now. [~smilegator], [~hvanhovell], [~cloud_fan] If you think this functionality is necessary, we can reopen it later. > Implement ALTER TABLE UPDATE STATISTICS SET > --- > > Key: SPARK-17282 > URL: https://issues.apache.org/jira/browse/SPARK-17282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > Users can change the statistics by the DDL statement: > {noformat} > ALTER TABLE UPDATE STATISTICS SET > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21209) Implement Incremental PCA algorithm for ML
[ https://issues.apache.org/jira/browse/SPARK-21209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312477#comment-16312477 ] Sandeep Kumar Choudhary commented on SPARK-21209: - I would like to work on it. I have used IPCA of python. I am reading few papers to figure out the best possible solution. I will get to you on this in few days. > Implement Incremental PCA algorithm for ML > -- > > Key: SPARK-21209 > URL: https://issues.apache.org/jira/browse/SPARK-21209 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Ben St. Clair > Labels: features > > Incremental Principal Component Analysis is a method for calculating PCAs in > an incremental fashion, allowing one to update an existing PCA model as new > evidence arrives. Furthermore, an alpha parameter can be used to enable > task-specific weighting of new and old evidence. > This algorithm would be useful for streaming applications, where a fast and > adaptive feature subspace calculation could be applied. Furthermore, it can > be applied to combine PCAs from subcomponents of large datasets. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312466#comment-16312466 ] Neil Alexander McQuarrie commented on SPARK-21727: -- I was able to get a version working by checking for length > 1, as well as whether the type is integer, character, logical, double, numeric, Date, POSIXlt, or POSIXct. [https://github.com/neilalex/spark/blob/neilalex-sparkr-arraytype/R/pkg/R/serialize.R#L39] The reason I'm checking the specific type is that it seems quite a few objects flow through getSerdeType() besides just the object we're converting -- for example, after calling 'as.DataFrame(myDf)' per above, I saw a jobj "Java ref type org.apache.spark.sql.SparkSession id 1" of length 2, as well as a raw object of length ~700, both pass through getSerdeType(). So, it seems we can't just check length? Also, in general this makes me a little nervous -- will there be scenarios when a 2+ length vector of one of the above types should also not be converted to array? I'll familiarize myself further with how getSerdeType() is called, to see if I can think this through. > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil Alexander McQuarrie >Assignee: Neil Alexander McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()
[ https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-22967: - Description: On Windows system, two unit test case would fail while running VersionSuite ("A simple set of tests that call the methods of a `HiveClient`, loading different version of hive from maven central.") Failed A : test(s"$version: read avro file containing decimal") {code:java} org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); {code} Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table") {code:java} Unable to infer the schema. The schema specification is required to create the table `default`.`tab2`.; org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema specification is required to create the table `default`.`tab2`.; {code} As I deep into this problem, I found it is related to ParserUtils#unescapeSQLString(). These are two lines at the beginning of Failed A: {code:java} val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") val location = new File(url.getFile) {code} And in my environment,`location` (path value) is {code:java} D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal {code} And then, in SparkSqlParser#visitCreateHiveTable()#L1128: {code:java} val location = Option(ctx.locationSpec).map(visitLocationSpec) {code} This line want to get LocationSepcContext's content first, which is equal to `location` above. Then, the content is passed to visitLocationSpec(), and passed to unescapeSQLString() finally. Lets' have a look at unescapeSQLString(): {code:java} /** Unescape baskslash-escaped string enclosed by quotes. */ def unescapeSQLString(b: String): String = { var enclosure: Character = null val sb = new StringBuilder(b.length()) def appendEscapedChar(n: Char) { n match { case '0' => sb.append('\u') case '\'' => sb.append('\'') case '"' => sb.append('\"') case 'b' => sb.append('\b') case 'n' => sb.append('\n') case 'r' => sb.append('\r') case 't' => sb.append('\t') case 'Z' => sb.append('\u001A') case '\\' => sb.append('\\') // The following 2 lines are exactly what MySQL does TODO: why do we do this? case '%' => sb.append("\\%") case '_' => sb.append("\\_") case _ => sb.append(n) } } var i = 0 val strLength = b.length while (i < strLength) { val currentChar = b.charAt(i) if (enclosure == null) { if (currentChar == '\'' || currentChar == '\"') { enclosure = currentChar } } else if (enclosure == currentChar) { enclosure = null } else if (currentChar == '\\') { if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') { // \u style character literals. val base = i + 2 val code = (0 until 4).foldLeft(0) { (mid, j) => val digit = Character.digit(b.charAt(j + base), 16) (mid << 4) + digit } sb.append(code.asInstanceOf[Char]) i += 5 } else if (i + 4 < strLength) { // \000 style character literals. val i1 = b.charAt(i + 1) val i2 = b.charAt(i + 2) val i3 = b.charAt(i + 3) if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= '0' && i3 <= '7')) { val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 6)).asInstanceOf[Char] sb.append(tmp) i += 3 } else { appendEscapedChar(i1) i += 1 } } else if (i + 2 < strLength) { // escaped character literals. val n = b.charAt(i + 1) appendEscapedChar(n) i += 1 } } else { // non-escaped character literals. sb.append(currentChar) } i += 1 } sb.toString() } {code} Again, here, variable `b` is equal to content and `location`, is valued of {code:java} D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal {code} And we can make sense from the unescapeSQLString()' strategies that it transform the String "\t" into a escape character '\t' and remove all backslashes. So, our original correct location resulted in: {code:java} D:workspaceIdeaProjectssparksqlhive\targetscala-2.11\test-classesavroDecimal {code} after unescapeSQLString() completed. Note that, here, [ \t ] is no longer a string, but a escape character. Then, return into SparkSqlParser#visitCreateHiveTable(), and move to L1134: {code:java} val locUri = location.map(CatalogUtils.stringToURI(_)) {code} `location` is passed to stringToURI(), and resulted in: {code:java} file:/D:workspaceIdeaProjectssparksqlhive%09arg
[jira] [Updated] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()
[ https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-22967: - Description: On Windows system, two unit test case would fail while running VersionSuite ("A simple set of tests that call the methods of a `HiveClient`, loading different version of hive from maven central.") Failed A : test(s"$version: read avro file containing decimal") {code:java} org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); {code} Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table") {code:java} Unable to infer the schema. The schema specification is required to create the table `default`.`tab2`.; org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema specification is required to create the table `default`.`tab2`.; {code} As I deep into this problem, I found it is related to ParserUtils#unescapeSQLString(). These are two lines at the beginning of Failed A: {code:java} val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") val location = new File(url.getFile) {code} And in my environment,`location` (path value) is {code:java} D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal {code} And then, in SparkSqlParser#visitCreateHiveTable()#L1128: {code:java} val location = Option(ctx.locationSpec).map(visitLocationSpec) {code} This line want to get LocationSepcContext's content first, which is equal to `location` above. Then, the content is passed to visitLocationSpec(), and passed to unescapeSQLString() finally. Lets' have a look at unescapeSQLString(): {code:java} /** Unescape baskslash-escaped string enclosed by quotes. */ def unescapeSQLString(b: String): String = { var enclosure: Character = null val sb = new StringBuilder(b.length()) def appendEscapedChar(n: Char) { n match { case '0' => sb.append('\u') case '\'' => sb.append('\'') case '"' => sb.append('\"') case 'b' => sb.append('\b') case 'n' => sb.append('\n') case 'r' => sb.append('\r') case 't' => sb.append('\t') case 'Z' => sb.append('\u001A') case '\\' => sb.append('\\') // The following 2 lines are exactly what MySQL does TODO: why do we do this? case '%' => sb.append("\\%") case '_' => sb.append("\\_") case _ => sb.append(n) } } var i = 0 val strLength = b.length while (i < strLength) { val currentChar = b.charAt(i) if (enclosure == null) { if (currentChar == '\'' || currentChar == '\"') { enclosure = currentChar } } else if (enclosure == currentChar) { enclosure = null } else if (currentChar == '\\') { if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') { // \u style character literals. val base = i + 2 val code = (0 until 4).foldLeft(0) { (mid, j) => val digit = Character.digit(b.charAt(j + base), 16) (mid << 4) + digit } sb.append(code.asInstanceOf[Char]) i += 5 } else if (i + 4 < strLength) { // \000 style character literals. val i1 = b.charAt(i + 1) val i2 = b.charAt(i + 2) val i3 = b.charAt(i + 3) if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= '0' && i3 <= '7')) { val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 6)).asInstanceOf[Char] sb.append(tmp) i += 3 } else { appendEscapedChar(i1) i += 1 } } else if (i + 2 < strLength) { // escaped character literals. val n = b.charAt(i + 1) appendEscapedChar(n) i += 1 } } else { // non-escaped character literals. sb.append(currentChar) } i += 1 } sb.toString() } {code} Again, here, variable `b` is equal to content and `location`, is valued of {code:java} D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal {code} And we can make sense from the unescapeSQLString()' strategies that it transform the String "\t" into a escape character '\t' and remove all backslashes. So, our original correct location resulted in: {code:java} D:workspaceIdeaProjectssparksqlhive\targetscala-2.11\test-classesavroDecimal {code} after unescapeSQLString() completed. Note that, here, [ \t ] is no longer a string, but a escape character. Then, return into SparkSqlParser#visitCreateHiveTable(), and move to L1134: {code:java} val locUri = location.map(CatalogUtils.stringToURI(_)) {code} `location` is passed to stringToURI(), and resulted in: {code:java} file:/D:workspaceIdeaProjectssparksqlhive%09arg
[jira] [Created] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()
wuyi created SPARK-22967: Summary: VersionSuite failed on Windows caused by unescapeSQLString() Key: SPARK-22967 URL: https://issues.apache.org/jira/browse/SPARK-22967 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Environment: Windos7 Reporter: wuyi Priority: Minor On Windows system, two unit test case would fail while running VersionSuite ("A simple set of tests that call the methods of a `HiveClient`, loading different version of hive from maven central.") Failed A : test(s"$version: read avro file containing decimal") {code:java} org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); {code} Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table") {code:java} Unable to infer the schema. The schema specification is required to create the table `default`.`tab2`.; org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema specification is required to create the table `default`.`tab2`.; {code} As I deep into this problem, I found it is related to ParserUtils#unescapeSQLString(). These are two lines at the beginning of Failed A: {code:java} val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") val location = new File(url.getFile) {code} And in my environment,`location` (path value) is {code:java} D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal {code} And then, in SparkSqlParser#visitCreateHiveTable()#L1128: {code:java} val location = Option(ctx.locationSpec).map(visitLocationSpec) {code} This line want to get LocationSepcContext's content first, which is equal to `location` above. Then, the content is passed to visitLocationSpec(), and passed to unescapeSQLString() finally. Lets' have a look at unescapeSQLString(): {code:java} /** Unescape baskslash-escaped string enclosed by quotes. */ def unescapeSQLString(b: String): String = { var enclosure: Character = null val sb = new StringBuilder(b.length()) def appendEscapedChar(n: Char) { n match { case '0' => sb.append('\u') case '\'' => sb.append('\'') case '"' => sb.append('\"') case 'b' => sb.append('\b') case 'n' => sb.append('\n') case 'r' => sb.append('\r') case 't' => sb.append('\t') case 'Z' => sb.append('\u001A') case '\\' => sb.append('\\') // The following 2 lines are exactly what MySQL does TODO: why do we do this? case '%' => sb.append("\\%") case '_' => sb.append("\\_") case _ => sb.append(n) } } var i = 0 val strLength = b.length while (i < strLength) { val currentChar = b.charAt(i) if (enclosure == null) { if (currentChar == '\'' || currentChar == '\"') { enclosure = currentChar } } else if (enclosure == currentChar) { enclosure = null } else if (currentChar == '\\') { if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') { // \u style character literals. val base = i + 2 val code = (0 until 4).foldLeft(0) { (mid, j) => val digit = Character.digit(b.charAt(j + base), 16) (mid << 4) + digit } sb.append(code.asInstanceOf[Char]) i += 5 } else if (i + 4 < strLength) { // \000 style character literals. val i1 = b.charAt(i + 1) val i2 = b.charAt(i + 2) val i3 = b.charAt(i + 3) if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= '0' && i3 <= '7')) { val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 6)).asInstanceOf[Char] sb.append(tmp) i += 3 } else { appendEscapedChar(i1) i += 1 } } else if (i + 2 < strLength) { // escaped character literals. val n = b.charAt(i + 1) appendEscapedChar(n) i += 1 } } else { // non-escaped character literals. sb.append(currentChar) } i += 1 } sb.toString() } {code} Again, here, variable `b` is equal to content and `location`, is valued of {code:java} D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal {code} And we can make sense from the unescapeSQLString()' strategies that it transform the String "\t" into a escape character '\t' and remove all backslashes. So, our original correct location resulted in: {code:java} D:workspaceIdeaProjectssparksqlhive\targetscala-2.11\test-classesavroDecimal {code} after unescapeSQLString() completed. Then, return into SparkSqlParser#visitCreateHiveTable(), and move to L1134: {code:java} val locUr
[jira] [Commented] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312398#comment-16312398 ] Reynold Xin commented on SPARK-7721: I think it's fine even if you don't preserve the history forever ... > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22957) ApproxQuantile breaks if the number of rows exceeds MaxInt
[ https://issues.apache.org/jira/browse/SPARK-22957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22957. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20152 [https://github.com/apache/spark/pull/20152] > ApproxQuantile breaks if the number of rows exceeds MaxInt > -- > > Key: SPARK-22957 > URL: https://issues.apache.org/jira/browse/SPARK-22957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski > Fix For: 2.3.0 > > > ApproxQuantile overflows when number of rows exceeds 2.147B (max int32). > If you run ApproxQuantile on a dataframe with 3B rows of 1 to 3B and ask it > for 1/6 quantiles, it should return [0.5B, 1B, 1.5B, 2B, 2.5B, 3B]. However, > in the [implementation of > ApproxQuantile|https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L195], > it calls .toInt on the target rank, which overflows at 2.147B. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22957) ApproxQuantile breaks if the number of rows exceeds MaxInt
[ https://issues.apache.org/jira/browse/SPARK-22957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22957: --- Assignee: Juliusz Sompolski > ApproxQuantile breaks if the number of rows exceeds MaxInt > -- > > Key: SPARK-22957 > URL: https://issues.apache.org/jira/browse/SPARK-22957 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski > Fix For: 2.3.0 > > > ApproxQuantile overflows when number of rows exceeds 2.147B (max int32). > If you run ApproxQuantile on a dataframe with 3B rows of 1 to 3B and ask it > for 1/6 quantiles, it should return [0.5B, 1B, 1.5B, 2B, 2.5B, 3B]. However, > in the [implementation of > ApproxQuantile|https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L195], > it calls .toInt on the target rank, which overflows at 2.147B. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone
[ https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312356#comment-16312356 ] Hyukjin Kwon commented on SPARK-22632: -- To me, nope, I don't think so although it might be important to have. In case of PySpark <> Pandas related one, it was fixed with a configuration to control the behaviour. I was trying to take a look at that time but I am not sure if it's safe to have this at this stage and I can make it within 2.3.0 timeline too ... PySpark itself also still has the issue too. FYI. > Fix the behavior of timestamp values for R's DataFrame to respect session > timezone > -- > > Key: SPARK-22632 > URL: https://issues.apache.org/jira/browse/SPARK-22632 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > Note: wording is borrowed from SPARK-22395. Symptom is similar and I think > that JIRA is well descriptive. > When converting R's DataFrame from/to Spark DataFrame using > {{createDataFrame}} or {{collect}}, timestamp values behave to respect R > system timezone instead of session timezone. > For example, let's say we use "America/Los_Angeles" as session timezone and > have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in > South Korea so R timezone would be "KST". > The timestamp value from current collect() will be the following: > {code} > > sparkR.session(master = "local[*]", sparkConfig = > > list(spark.sql.session.timeZone = "America/Los_Angeles")) > > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts")) >ts > 1 1970-01-01 00:00:01 > > collect(sql("SELECT cast(28801 as timestamp) as ts")) >ts > 1 1970-01-01 17:00:01 > {code} > As you can see, the value becomes "1970-01-01 17:00:01" because it respects R > system timezone. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312331#comment-16312331 ] Hyukjin Kwon commented on SPARK-7721: - I (and possibly few committers given [the comment above|https://issues.apache.org/jira/browse/SPARK-7721?focusedCommentId=14551198&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14551198]) would run this though .. but yes, sure, it should become actually powerful when we can run it automatically. If we are all fine to have a single up-to-date coverage site ("Simplest one") for now, it's pretty easy and possible. It's just what I have done so far here - https://spark-test.github.io/pyspark-coverage-site and the only thing I should do is to make this automatic, clone the latest commit bit and push it. I know it's better to keep the history of coverages and leave the link in each PR ("Another one") and in this case we should consider how to keep the history of the coverages, etc. This is where I should investigate more and verify the idea. Will anyway test and investigate the integration more and try the "Another one" way too. If I fail, I think we can fall back to "Simplest one" for now. Does this sounds good to you? In this way, I think I can make sure we can run this automatically eventually. BTW, can you take a look for https://github.com/apache/spark/pull/20151 too? This way we can make the changes separate for Coverage only and I am trying to isolate such logics as much as we can in case we can bring better idea in the future. > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime
[ https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22966: Assignee: Apache Spark > Spark SQL should handle Python UDFs that return a datetime.date or > datetime.datetime > > > Key: SPARK-22966 > URL: https://issues.apache.org/jira/browse/SPARK-22966 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok >Assignee: Apache Spark > > Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which > should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} > (which should correspond to a Spark SQL {{timestamp}} type), it gets > unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand > internally, and will thus give incorrect results. > e.g. > {code:none} > >>> import datetime > >>> from pyspark.sql import * > >>> py_date = udf(datetime.date) > >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == > >>> lit(datetime.date(2017, 10, 30))).show() > ++ > |(date(2017, 10, 30) = DATE '2017-10-30')| > ++ > | false| > ++ > {code} > (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, > 'date')}} doesn't work either) > We should correctly handle Python UDFs that return objects of such types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime
[ https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312312#comment-16312312 ] Apache Spark commented on SPARK-22966: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/20163 > Spark SQL should handle Python UDFs that return a datetime.date or > datetime.datetime > > > Key: SPARK-22966 > URL: https://issues.apache.org/jira/browse/SPARK-22966 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok > > Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which > should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} > (which should correspond to a Spark SQL {{timestamp}} type), it gets > unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand > internally, and will thus give incorrect results. > e.g. > {code:none} > >>> import datetime > >>> from pyspark.sql import * > >>> py_date = udf(datetime.date) > >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == > >>> lit(datetime.date(2017, 10, 30))).show() > ++ > |(date(2017, 10, 30) = DATE '2017-10-30')| > ++ > | false| > ++ > {code} > (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, > 'date')}} doesn't work either) > We should correctly handle Python UDFs that return objects of such types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime
[ https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22966: Assignee: (was: Apache Spark) > Spark SQL should handle Python UDFs that return a datetime.date or > datetime.datetime > > > Key: SPARK-22966 > URL: https://issues.apache.org/jira/browse/SPARK-22966 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok > > Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which > should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} > (which should correspond to a Spark SQL {{timestamp}} type), it gets > unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand > internally, and will thus give incorrect results. > e.g. > {code:none} > >>> import datetime > >>> from pyspark.sql import * > >>> py_date = udf(datetime.date) > >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == > >>> lit(datetime.date(2017, 10, 30))).show() > ++ > |(date(2017, 10, 30) = DATE '2017-10-30')| > ++ > | false| > ++ > {code} > (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, > 'date')}} doesn't work either) > We should correctly handle Python UDFs that return objects of such types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime
Kris Mok created SPARK-22966: Summary: Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime Key: SPARK-22966 URL: https://issues.apache.org/jira/browse/SPARK-22966 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.1, 2.2.0 Reporter: Kris Mok Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand internally, and will thus give incorrect results. e.g. {code:python} >>> import datetime >>> from pyspark.sql import * >>> py_date = udf(datetime.date) >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == >>> lit(datetime.date(2017, 10, 30))).show() ++ |(date(2017, 10, 30) = DATE '2017-10-30')| ++ | false| ++ {code} (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 'date')}} doesn't work either) We should correctly handle Python UDFs that return objects of such types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime
[ https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-22966: - Description: Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand internally, and will thus give incorrect results. e.g. {code:none} >>> import datetime >>> from pyspark.sql import * >>> py_date = udf(datetime.date) >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == >>> lit(datetime.date(2017, 10, 30))).show() ++ |(date(2017, 10, 30) = DATE '2017-10-30')| ++ | false| ++ {code} (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 'date')}} doesn't work either) We should correctly handle Python UDFs that return objects of such types. was: Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand internally, and will thus give incorrect results. e.g. {code:python} >>> import datetime >>> from pyspark.sql import * >>> py_date = udf(datetime.date) >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == >>> lit(datetime.date(2017, 10, 30))).show() ++ |(date(2017, 10, 30) = DATE '2017-10-30')| ++ | false| ++ {code} (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 'date')}} doesn't work either) We should correctly handle Python UDFs that return objects of such types. > Spark SQL should handle Python UDFs that return a datetime.date or > datetime.datetime > > > Key: SPARK-22966 > URL: https://issues.apache.org/jira/browse/SPARK-22966 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok > > Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which > should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} > (which should correspond to a Spark SQL {{timestamp}} type), it gets > unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand > internally, and will thus give incorrect results. > e.g. > {code:none} > >>> import datetime > >>> from pyspark.sql import * > >>> py_date = udf(datetime.date) > >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == > >>> lit(datetime.date(2017, 10, 30))).show() > ++ > |(date(2017, 10, 30) = DATE '2017-10-30')| > ++ > | false| > ++ > {code} > (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, > 'date')}} doesn't work either) > We should correctly handle Python UDFs that return objects of such types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21972: -- Target Version/s: 2.4.0 (was: 2.3.0) > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22054) Allow release managers to inject their keys
[ https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-22054: --- Target Version/s: 2.4.0 (was: 2.3.0) > Allow release managers to inject their keys > --- > > Key: SPARK-22054 > URL: https://issues.apache.org/jira/browse/SPARK-22054 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Blocker > > Right now the current release process signs with Patrick's keys, let's update > the scripts to allow the release manager to sign the release as part of the > job. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22055) Port release scripts
[ https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312292#comment-16312292 ] Sameer Agarwal commented on SPARK-22055: re-targeting for 2.4.0 > Port release scripts > > > Key: SPARK-22055 > URL: https://issues.apache.org/jira/browse/SPARK-22055 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Blocker > > The current Jenkins jobs are generated from scripts in a private repo. We > should port these to enable changes like SPARK-22054 . -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22055) Port release scripts
[ https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-22055: --- Target Version/s: 2.4.0 (was: 2.3.0) > Port release scripts > > > Key: SPARK-22055 > URL: https://issues.apache.org/jira/browse/SPARK-22055 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Blocker > > The current Jenkins jobs are generated from scripts in a private repo. We > should port these to enable changes like SPARK-22054 . -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22054) Allow release managers to inject their keys
[ https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312293#comment-16312293 ] Sameer Agarwal commented on SPARK-22054: re-targeting for 2.4.0 > Allow release managers to inject their keys > --- > > Key: SPARK-22054 > URL: https://issues.apache.org/jira/browse/SPARK-22054 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Blocker > > Right now the current release process signs with Patrick's keys, let's update > the scripts to allow the release manager to sign the release as part of the > job. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone
[ https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312287#comment-16312287 ] Sameer Agarwal commented on SPARK-22632: [~hyukjin.kwon] [~felixcheung] should this be a blocker for 2.3? cc [~ueshin] > Fix the behavior of timestamp values for R's DataFrame to respect session > timezone > -- > > Key: SPARK-22632 > URL: https://issues.apache.org/jira/browse/SPARK-22632 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > Note: wording is borrowed from SPARK-22395. Symptom is similar and I think > that JIRA is well descriptive. > When converting R's DataFrame from/to Spark DataFrame using > {{createDataFrame}} or {{collect}}, timestamp values behave to respect R > system timezone instead of session timezone. > For example, let's say we use "America/Los_Angeles" as session timezone and > have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in > South Korea so R timezone would be "KST". > The timestamp value from current collect() will be the following: > {code} > > sparkR.session(master = "local[*]", sparkConfig = > > list(spark.sql.session.timeZone = "America/Los_Angeles")) > > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts")) >ts > 1 1970-01-01 00:00:01 > > collect(sql("SELECT cast(28801 as timestamp) as ts")) >ts > 1 1970-01-01 17:00:01 > {code} > As you can see, the value becomes "1970-01-01 17:00:01" because it respects R > system timezone. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22739) Additional Expression Support for Objects
[ https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312281#comment-16312281 ] Sameer Agarwal commented on SPARK-22739: Raising priority for this. This would be great to have in 2.3! > Additional Expression Support for Objects > - > > Key: SPARK-22739 > URL: https://issues.apache.org/jira/browse/SPARK-22739 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Aleksander Eskilson >Priority: Critical > > Some discussion in Spark-Avro [1] motivates additions and minor changes to > the {{Objects}} Expressions API [2]. The proposed changes include > * a generalized form of {{initializeJavaBean}} taking a sequence of > initialization expressions that can be applied to instances of varying objects > * an object cast that performs a simple Java type cast against a value > * making {{ExternalMapToCatalyst}} public, for use in outside libraries > These changes would facilitate the writing of custom encoders for varying > objects that cannot already be readily converted to a statically typed > dataset by a JavaBean encoder (e.g. Avro). > [1] -- > https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110 > [2] -- > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22739) Additional Expression Support for Objects
[ https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-22739: --- Priority: Critical (was: Major) > Additional Expression Support for Objects > - > > Key: SPARK-22739 > URL: https://issues.apache.org/jira/browse/SPARK-22739 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Aleksander Eskilson >Priority: Critical > > Some discussion in Spark-Avro [1] motivates additions and minor changes to > the {{Objects}} Expressions API [2]. The proposed changes include > * a generalized form of {{initializeJavaBean}} taking a sequence of > initialization expressions that can be applied to instances of varying objects > * an object cast that performs a simple Java type cast against a value > * making {{ExternalMapToCatalyst}} public, for use in outside libraries > These changes would facilitate the writing of custom encoders for varying > objects that cannot already be readily converted to a statically typed > dataset by a JavaBean encoder (e.g. Avro). > [1] -- > https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110 > [2] -- > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set
[ https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-22799: --- Target Version/s: 2.4.0 (was: 2.3.0) > Bucketizer should throw exception if single- and multi-column params are both > set > - > > Key: SPARK-22799 > URL: https://issues.apache.org/jira/browse/SPARK-22799 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath > > See the related discussion: > https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22798) Add multiple column support to PySpark StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-22798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-22798: --- Target Version/s: 2.4.0 (was: 2.3.0) > Add multiple column support to PySpark StringIndexer > > > Key: SPARK-22798 > URL: https://issues.apache.org/jira/browse/SPARK-22798 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Nick Pentreath > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set
[ https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312275#comment-16312275 ] Sameer Agarwal commented on SPARK-22799: [~mgaido] [~mlnick] I'm re-targeting this for 2.4.0. Please let me know if you think this should block 2.3.0. > Bucketizer should throw exception if single- and multi-column params are both > set > - > > Key: SPARK-22799 > URL: https://issues.apache.org/jira/browse/SPARK-22799 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath > > See the related discussion: > https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly
[ https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22960: -- Assignee: Marcelo Vanzin > Make build-push-docker-images.sh more dev-friendly > -- > > Key: SPARK-22960 > URL: https://issues.apache.org/jira/browse/SPARK-22960 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > It's currently kinda hard to test the kubernetes backend. To use > {{build-push-docker-images.sh}} you have to create a Spark dist archive which > slows down things a lot of you're doing iterative development. > The script could be enhanced to make it easier for developers to test their > changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly
[ https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22960. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20154 [https://github.com/apache/spark/pull/20154] > Make build-push-docker-images.sh more dev-friendly > -- > > Key: SPARK-22960 > URL: https://issues.apache.org/jira/browse/SPARK-22960 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > It's currently kinda hard to test the kubernetes backend. To use > {{build-push-docker-images.sh}} you have to create a Spark dist archive which > slows down things a lot of you're doing iterative development. > The script could be enhanced to make it easier for developers to test their > changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312264#comment-16312264 ] Bago Amirbekian commented on SPARK-22126: - [~bryanc] thanks for taking the time to put together the PR and share thoughts. I like the idea of being able to preserve the existing APIs and not needing to add a new fitMultiple API but I'm concerned the existing APIs aren't quite flexible enough. One of the use cases that motivated the {{ fitMultiple }} API was optimizing the Pipeline Estimator. The Pipeline Estimator seems like in important one to optimize because I believe it's required in order for CrossValidator to be able to exploit optimized implementations of the {{ fit }}/{{ fitMultiple }} methods of Pipeline stages. The way one would optimize the Pipeline Estimator is to group the paramMaps into a tree structure where each level represents a stage with a param that can take multiple values. One would then traverse the tree in depth first order. Notice that in this case the params need not be estimator params, but could actually be transformer params as well since we can avoid applying expensive transformers multiple times. With this approach all the params for a pipeline estimator after the top level of the tree are "optimizable" so simply being group on optimizable params isn't sufficient, we need to actually order the paramMaps to match the depth first traversal of the stages tree. I'm still thinking through all this in my head so let me know if any of it is off base or not clear, but I think the advantage of the {{ fitMultiple }} approach gives us full flexibility in order to these kinds of optimizations. > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the
[jira] [Assigned] (SPARK-22965) Add deterministic parameter to registerJavaFunction
[ https://issues.apache.org/jira/browse/SPARK-22965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22965: Assignee: Apache Spark (was: Xiao Li) > Add deterministic parameter to registerJavaFunction > --- > > Key: SPARK-22965 > URL: https://issues.apache.org/jira/browse/SPARK-22965 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > > To register a JAVA UDF in PySpark, users are unable to specify the registered > UDF is not deterministic. The proposal is to add the extra parameter > deterministic at the end of the function registerJavaFunction > Below is an example. > {noformat} > >>> from pyspark.sql.types import DoubleType > >>> sqlContext.registerJavaFunction("javaRand", > ... "test.org.apache.spark.sql.JavaRandUDF", DoubleType(), > deterministic=False) > >>> sqlContext.sql("SELECT javaRand(3)").collect() # doctest: +SKIP > [Row(UDF:javaRand(3)=3.12345)] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22965) Add deterministic parameter to registerJavaFunction
[ https://issues.apache.org/jira/browse/SPARK-22965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22965: Assignee: Xiao Li (was: Apache Spark) > Add deterministic parameter to registerJavaFunction > --- > > Key: SPARK-22965 > URL: https://issues.apache.org/jira/browse/SPARK-22965 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > To register a JAVA UDF in PySpark, users are unable to specify the registered > UDF is not deterministic. The proposal is to add the extra parameter > deterministic at the end of the function registerJavaFunction > Below is an example. > {noformat} > >>> from pyspark.sql.types import DoubleType > >>> sqlContext.registerJavaFunction("javaRand", > ... "test.org.apache.spark.sql.JavaRandUDF", DoubleType(), > deterministic=False) > >>> sqlContext.sql("SELECT javaRand(3)").collect() # doctest: +SKIP > [Row(UDF:javaRand(3)=3.12345)] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22965) Add deterministic parameter to registerJavaFunction
[ https://issues.apache.org/jira/browse/SPARK-22965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312232#comment-16312232 ] Apache Spark commented on SPARK-22965: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20162 > Add deterministic parameter to registerJavaFunction > --- > > Key: SPARK-22965 > URL: https://issues.apache.org/jira/browse/SPARK-22965 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > To register a JAVA UDF in PySpark, users are unable to specify the registered > UDF is not deterministic. The proposal is to add the extra parameter > deterministic at the end of the function registerJavaFunction > Below is an example. > {noformat} > >>> from pyspark.sql.types import DoubleType > >>> sqlContext.registerJavaFunction("javaRand", > ... "test.org.apache.spark.sql.JavaRandUDF", DoubleType(), > deterministic=False) > >>> sqlContext.sql("SELECT javaRand(3)").collect() # doctest: +SKIP > [Row(UDF:javaRand(3)=3.12345)] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22965) Add deterministic parameter to registerJavaFunction
Xiao Li created SPARK-22965: --- Summary: Add deterministic parameter to registerJavaFunction Key: SPARK-22965 URL: https://issues.apache.org/jira/browse/SPARK-22965 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li To register a JAVA UDF in PySpark, users are unable to specify the registered UDF is not deterministic. The proposal is to add the extra parameter deterministic at the end of the function registerJavaFunction Below is an example. {noformat} >>> from pyspark.sql.types import DoubleType >>> sqlContext.registerJavaFunction("javaRand", ... "test.org.apache.spark.sql.JavaRandUDF", DoubleType(), deterministic=False) >>> sqlContext.sql("SELECT javaRand(3)").collect() # doctest: +SKIP [Row(UDF:javaRand(3)=3.12345)] {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used
[ https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22953. Resolution: Fixed Issue resolved by pull request 20159 [https://github.com/apache/spark/pull/20159] > Duplicated secret volumes in Spark pods when init-containers are used > - > > Key: SPARK-22953 > URL: https://issues.apache.org/jira/browse/SPARK-22953 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Assignee: Yinan Li > Fix For: 2.3.0 > > > User-specified secrets are mounted into both the main container and > init-container (when it is used) in a Spark driver/executor pod, using the > {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the > secret volumes to the pod, the same secret volumes get added twice, one when > mounting the secrets to the main container, and the other when mounting the > secrets to the init-container. See > https://github.com/apache-spark-on-k8s/spark/issues/594. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used
[ https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22953: -- Assignee: Yinan Li > Duplicated secret volumes in Spark pods when init-containers are used > - > > Key: SPARK-22953 > URL: https://issues.apache.org/jira/browse/SPARK-22953 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Assignee: Yinan Li > Fix For: 2.3.0 > > > User-specified secrets are mounted into both the main container and > init-container (when it is used) in a Spark driver/executor pod, using the > {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the > secret volumes to the pod, the same secret volumes get added twice, one when > mounting the secrets to the main container, and the other when mounting the > secrets to the init-container. See > https://github.com/apache-spark-on-k8s/spark/issues/594. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
[ https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312202#comment-16312202 ] Apache Spark commented on SPARK-21525: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20161 > ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL > - > > Key: SPARK-21525 > URL: https://issues.apache.org/jira/browse/SPARK-21525 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Mark Grover > > {{AddBlock}} returns an error code related to whether writing the block to > the WAL was successful or not. In cases where a WAL may be unavailable > temporarily, the write would fail but it seems like we are not using the > return code (see > [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]). > For example, when using the Flume Receiver, we should be sending a n'ack back > to Flume if the block wasn't written to the WAL. I haven't gone through the > full code path yet but at least from looking at the ReceiverSupervisorImpl, > it doesn't seem like that return code is being used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
[ https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21525: Assignee: Apache Spark > ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL > - > > Key: SPARK-21525 > URL: https://issues.apache.org/jira/browse/SPARK-21525 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Mark Grover >Assignee: Apache Spark > > {{AddBlock}} returns an error code related to whether writing the block to > the WAL was successful or not. In cases where a WAL may be unavailable > temporarily, the write would fail but it seems like we are not using the > return code (see > [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]). > For example, when using the Flume Receiver, we should be sending a n'ack back > to Flume if the block wasn't written to the WAL. I haven't gone through the > full code path yet but at least from looking at the ReceiverSupervisorImpl, > it doesn't seem like that return code is being used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
[ https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21525: Assignee: (was: Apache Spark) > ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL > - > > Key: SPARK-21525 > URL: https://issues.apache.org/jira/browse/SPARK-21525 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Mark Grover > > {{AddBlock}} returns an error code related to whether writing the block to > the WAL was successful or not. In cases where a WAL may be unavailable > temporarily, the write would fail but it seems like we are not using the > return code (see > [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]). > For example, when using the Flume Receiver, we should be sending a n'ack back > to Flume if the block wasn't written to the WAL. I haven't gone through the > full code path yet but at least from looking at the ReceiverSupervisorImpl, > it doesn't seem like that return code is being used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312172#comment-16312172 ] Apache Spark commented on SPARK-22757: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/20160 > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Assignee: Yinan Li > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used
[ https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312160#comment-16312160 ] Apache Spark commented on SPARK-22953: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/20159 > Duplicated secret volumes in Spark pods when init-containers are used > - > > Key: SPARK-22953 > URL: https://issues.apache.org/jira/browse/SPARK-22953 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li > Fix For: 2.3.0 > > > User-specified secrets are mounted into both the main container and > init-container (when it is used) in a Spark driver/executor pod, using the > {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the > secret volumes to the pod, the same secret volumes get added twice, one when > mounting the secrets to the main container, and the other when mounting the > secrets to the init-container. See > https://github.com/apache-spark-on-k8s/spark/issues/594. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package
[ https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22948: -- Assignee: Marcelo Vanzin > "SparkPodInitContainer" shouldn't be in "rest" package > -- > > Key: SPARK-22948 > URL: https://issues.apache.org/jira/browse/SPARK-22948 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Trivial > Fix For: 2.3.0 > > > Just noticed while playing with the code that this class is in > {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since > there's no REST server (and it's the only class in there, too). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package
[ https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22948. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20156 [https://github.com/apache/spark/pull/20156] > "SparkPodInitContainer" shouldn't be in "rest" package > -- > > Key: SPARK-22948 > URL: https://issues.apache.org/jira/browse/SPARK-22948 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Trivial > Fix For: 2.3.0 > > > Just noticed while playing with the code that this class is in > {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since > there's no REST server (and it's the only class in there, too). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22850) Executor page in SHS does not show driver
[ https://issues.apache.org/jira/browse/SPARK-22850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-22850: Assignee: Marcelo Vanzin > Executor page in SHS does not show driver > - > > Key: SPARK-22850 > URL: https://issues.apache.org/jira/browse/SPARK-22850 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.3.0 > > > This bug is sort of related to SPARK-22836. > Starting with Spark 2.2 (at least), the event logs generated by Spark do not > contain a {{SparkListenerBlockManagerAdded}} entry for the driver. That means > when applications are replayed in the SHS, the driver is not listed in the > executors page. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22850) Executor page in SHS does not show driver
[ https://issues.apache.org/jira/browse/SPARK-22850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-22850. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20039 [https://github.com/apache/spark/pull/20039] > Executor page in SHS does not show driver > - > > Key: SPARK-22850 > URL: https://issues.apache.org/jira/browse/SPARK-22850 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.3.0 > > > This bug is sort of related to SPARK-22836. > Starting with Spark 2.2 (at least), the event logs generated by Spark do not > contain a {{SparkListenerBlockManagerAdded}} entry for the driver. That means > when applications are replayed in the SHS, the driver is not listed in the > executors page. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22963) Make failure recovery global and automatic for continuous processing.
[ https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Torres updated SPARK-22963: Summary: Make failure recovery global and automatic for continuous processing. (was: Clean up continuous processing failure recovery) > Make failure recovery global and automatic for continuous processing. > - > > Key: SPARK-22963 > URL: https://issues.apache.org/jira/browse/SPARK-22963 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > > Spark native task restarts don't work well for continuous processing. They > will process all data from the task's original start offset - even data which > has already been committed. This is not semantically incorrect under at least > once semantics, but it's awkward and bad. > Fortunately, they're also not necessary; the central coordinator can restart > every task from the checkpointed offsets without losing much. So we should > make that happen on task failures instead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22963) Make failure recovery global and automatic for continuous processing.
[ https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Torres updated SPARK-22963: Description: Spark native task restarts don't work well for continuous processing. They will process all data from the task's original start offset - even data which has already been committed. This is not semantically incorrect under at least once semantics, but it's awkward and bad. Fortunately, they're also not necessary; the central coordinator can restart every task from the checkpointed offsets without losing much. So we should make that happen automatically on task failures instead. was: Spark native task restarts don't work well for continuous processing. They will process all data from the task's original start offset - even data which has already been committed. This is not semantically incorrect under at least once semantics, but it's awkward and bad. Fortunately, they're also not necessary; the central coordinator can restart every task from the checkpointed offsets without losing much. So we should make that happen on task failures instead. > Make failure recovery global and automatic for continuous processing. > - > > Key: SPARK-22963 > URL: https://issues.apache.org/jira/browse/SPARK-22963 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > > Spark native task restarts don't work well for continuous processing. They > will process all data from the task's original start offset - even data which > has already been committed. This is not semantically incorrect under at least > once semantics, but it's awkward and bad. > Fortunately, they're also not necessary; the central coordinator can restart > every task from the checkpointed offsets without losing much. So we should > make that happen automatically on task failures instead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22963) Clean up continuous processing failure recovery
[ https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Torres updated SPARK-22963: Description: Spark native task restarts don't work well for continuous processing. They will process all data from the task's original start offset - even data which has already been committed. This is not semantically incorrect under at least once semantics, but it's awkward and bad. Fortunately, they're also not necessary; the central coordinator can restart every task from the checkpointed offsets without losing much. So we should force > Clean up continuous processing failure recovery > --- > > Key: SPARK-22963 > URL: https://issues.apache.org/jira/browse/SPARK-22963 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > > Spark native task restarts don't work well for continuous processing. They > will process all data from the task's original start offset - even data which > has already been committed. This is not semantically incorrect under at least > once semantics, but it's awkward and bad. > Fortunately, they're also not necessary; the central coordinator can restart > every task from the checkpointed offsets without losing much. So we should > force -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22963) Clean up continuous processing failure recovery
[ https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Torres updated SPARK-22963: Description: Spark native task restarts don't work well for continuous processing. They will process all data from the task's original start offset - even data which has already been committed. This is not semantically incorrect under at least once semantics, but it's awkward and bad. Fortunately, they're also not necessary; the central coordinator can restart every task from the checkpointed offsets without losing much. So we should make that happen on task failures instead. was: Spark native task restarts don't work well for continuous processing. They will process all data from the task's original start offset - even data which has already been committed. This is not semantically incorrect under at least once semantics, but it's awkward and bad. Fortunately, they're also not necessary; the central coordinator can restart every task from the checkpointed offsets without losing much. So we should force > Clean up continuous processing failure recovery > --- > > Key: SPARK-22963 > URL: https://issues.apache.org/jira/browse/SPARK-22963 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > > Spark native task restarts don't work well for continuous processing. They > will process all data from the task's original start offset - even data which > has already been committed. This is not semantically incorrect under at least > once semantics, but it's awkward and bad. > Fortunately, they're also not necessary; the central coordinator can restart > every task from the checkpointed offsets without losing much. So we should > make that happen on task failures instead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22964) don't allow task restarts for continuous processing
Jose Torres created SPARK-22964: --- Summary: don't allow task restarts for continuous processing Key: SPARK-22964 URL: https://issues.apache.org/jira/browse/SPARK-22964 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22963) Clean up continuous processing failure recovery
[ https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Torres updated SPARK-22963: Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-20928) > Clean up continuous processing failure recovery > --- > > Key: SPARK-22963 > URL: https://issues.apache.org/jira/browse/SPARK-22963 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22963) Clean up continuous processing failure recovery
Jose Torres created SPARK-22963: --- Summary: Clean up continuous processing failure recovery Key: SPARK-22963 URL: https://issues.apache.org/jira/browse/SPARK-22963 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21760) Structured streaming terminates with Exception
[ https://issues.apache.org/jira/browse/SPARK-21760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312089#comment-16312089 ] Shixiong Zhu commented on SPARK-21760: -- Could you try 2.2.1? It's probably fixed by this line: https://github.com/apache/spark/commit/6edfff055caea81dc3a98a6b4081313a0c0b0729#diff-aaeb546880508bb771df502318c40a99L126 > Structured streaming terminates with Exception > --- > > Key: SPARK-21760 > URL: https://issues.apache.org/jira/browse/SPARK-21760 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: gagan taneja >Priority: Critical > > We have seen Structured stream stops with exception below > While analyzing the content we found that latest log file as just one line > with version > {quote}hdfs dfs -cat warehouse/latency_internal/_spark_metadata/1683 > v1 > {quote} > Exception is below > Exception in thread "stream execution thread for latency_internal [id = > 39f35d01-60d5-40b4-826e-99e5e38d0077, runId = > 95c95a01-bd4f-4604-8aae-c0c5d3e873e8]" java.lang.IllegalStateException: > Incomplete log file > at > org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.deserialize(CompactibleFileStreamLog.scala:147) > at > org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.deserialize(CompactibleFileStreamLog.scala:42) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$getLatest$1.apply$mcVJ$sp(HDFSMetadataLog.scala:266) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$getLatest$1.apply(HDFSMetadataLog.scala:265) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$getLatest$1.apply(HDFSMetadataLog.scala:265) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at > scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:265) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:60) > at > org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:256) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$logicalPlan$1.applyOrElse(StreamExecution.scala:127) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$logicalPlan$1.applyOrElse(StreamExecution.scala:123) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329) >
[jira] [Commented] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies
[ https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312064#comment-16312064 ] Apache Spark commented on SPARK-22757: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/20157 > Init-container in the driver/executor pods for downloading remote dependencies > -- > > Key: SPARK-22757 > URL: https://issues.apache.org/jira/browse/SPARK-22757 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Assignee: Yinan Li > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21475) Change the usage of FileInputStream/OutputStream to Files.newInput/OutputStream in the critical path
[ https://issues.apache.org/jira/browse/SPARK-21475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21475. -- Resolution: Fixed Fix Version/s: 2.3.0 3.0.0 Issue resolved by pull request 20144 [https://github.com/apache/spark/pull/20144] > Change the usage of FileInputStream/OutputStream to > Files.newInput/OutputStream in the critical path > > > Key: SPARK-21475 > URL: https://issues.apache.org/jira/browse/SPARK-21475 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 3.0.0, 2.3.0 > > > Java's {{FileInputStream}} and {{FileOutputStream}} overrides {{finalize()}}, > even this file input/output stream is closed correctly and promptly, it will > still leave some memory footprints which will get cleaned in Full GC. This > will introduce two side effects: > 1. Lots of memory footprints regarding to Finalizer will be kept in memory > and this will increase the memory overhead. In our use case of external > shuffle service, a busy shuffle service will have bunch of this object and > potentially lead to OOM. > 2. The Finalizer will only be called in Full GC, and this will increase the > overhead of Full GC and lead to long GC pause. > So to fix this potential issue, here propose to use NIO's > Files#newInput/OutputStream instead in some critical paths like shuffle. > https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21475) Change to use NIO's Files API for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-21475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21475: - Summary: Change to use NIO's Files API for external shuffle service (was: Change the usage of FileInputStream/OutputStream to Files.newInput/OutputStream in the critical path) > Change to use NIO's Files API for external shuffle service > -- > > Key: SPARK-21475 > URL: https://issues.apache.org/jira/browse/SPARK-21475 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.3.0, 3.0.0 > > > Java's {{FileInputStream}} and {{FileOutputStream}} overrides {{finalize()}}, > even this file input/output stream is closed correctly and promptly, it will > still leave some memory footprints which will get cleaned in Full GC. This > will introduce two side effects: > 1. Lots of memory footprints regarding to Finalizer will be kept in memory > and this will increase the memory overhead. In our use case of external > shuffle service, a busy shuffle service will have bunch of this object and > potentially lead to OOM. > 2. The Finalizer will only be called in Full GC, and this will increase the > overhead of Full GC and lead to long GC pause. > So to fix this potential issue, here propose to use NIO's > Files#newInput/OutputStream instead in some critical paths like shuffle. > https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12344) Remove env-based configurations
[ https://issues.apache.org/jira/browse/SPARK-12344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-12344: -- Assignee: (was: Marcelo Vanzin) > Remove env-based configurations > --- > > Key: SPARK-12344 > URL: https://issues.apache.org/jira/browse/SPARK-12344 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Reporter: Marcelo Vanzin > > We should remove as many env-based configurations as it makes sense, since > they are deprecated and we prefer to use Spark's configuration. > Tools available through the command line should consistently support both a > properties file with configuration keys and the {{--conf}} command line > argument such as the one SparkSubmit supports. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12344) Remove env-based configurations
[ https://issues.apache.org/jira/browse/SPARK-12344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-12344: -- Assignee: Marcelo Vanzin > Remove env-based configurations > --- > > Key: SPARK-12344 > URL: https://issues.apache.org/jira/browse/SPARK-12344 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > > We should remove as many env-based configurations as it makes sense, since > they are deprecated and we prefer to use Spark's configuration. > Tools available through the command line should consistently support both a > properties file with configuration keys and the {{--conf}} command line > argument such as the one SparkSubmit supports. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20657) Speed up Stage page
[ https://issues.apache.org/jira/browse/SPARK-20657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-20657: --- Target Version/s: 2.3.0 I'm targeting 2.3.0 since the stages page is really slow for large apps without this fix. The fix also changes the data saved to the disk store, so adding this to a later release would require revving the store version and re-parsing event logs, so better to avoid that. The patch has been up for review for a while but nobody has looked at it yet. > Speed up Stage page > --- > > Key: SPARK-20657 > URL: https://issues.apache.org/jira/browse/SPARK-20657 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > The Stage page in the UI is very slow when a large number of tasks exist > (tens of thousands). The new work being done in SPARK-18085 makes that worse, > since it adds potential disk access to the mix. > A lot of the slowness is because the code loads all the tasks in memory then > sorts a really large list, and does a lot of calculations on all the data; > both can be avoided with the new app state store by having smarter indices > (so data is read from the store sorted in the desired order) and by keeping > statistics about metrics pre-calculated (instead of re-doing that on every > page access). > Then only the tasks on the current page (100 items by default) need to > actually be loaded. This also saves a lot on memory usage, not just CPU time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22850) Executor page in SHS does not show driver
[ https://issues.apache.org/jira/browse/SPARK-22850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-22850: --- Target Version/s: 2.3.0 > Executor page in SHS does not show driver > - > > Key: SPARK-22850 > URL: https://issues.apache.org/jira/browse/SPARK-22850 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > This bug is sort of related to SPARK-22836. > Starting with Spark 2.2 (at least), the event logs generated by Spark do not > contain a {{SparkListenerBlockManagerAdded}} entry for the driver. That means > when applications are replayed in the SHS, the driver is not listed in the > executors page. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package
[ https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311864#comment-16311864 ] Apache Spark commented on SPARK-22948: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20156 > "SparkPodInitContainer" shouldn't be in "rest" package > -- > > Key: SPARK-22948 > URL: https://issues.apache.org/jira/browse/SPARK-22948 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Trivial > > Just noticed while playing with the code that this class is in > {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since > there's no REST server (and it's the only class in there, too). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package
[ https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22948: Assignee: Apache Spark > "SparkPodInitContainer" shouldn't be in "rest" package > -- > > Key: SPARK-22948 > URL: https://issues.apache.org/jira/browse/SPARK-22948 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Trivial > > Just noticed while playing with the code that this class is in > {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since > there's no REST server (and it's the only class in there, too). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package
[ https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22948: Assignee: (was: Apache Spark) > "SparkPodInitContainer" shouldn't be in "rest" package > -- > > Key: SPARK-22948 > URL: https://issues.apache.org/jira/browse/SPARK-22948 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Trivial > > Just noticed while playing with the code that this class is in > {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since > there's no REST server (and it's the only class in there, too). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311859#comment-16311859 ] Li Jin commented on SPARK-22947: I have implemented a simple version of "as-of join" using existing Spark SQL syntax: {code:java} from pyspark.sql.functions import monotonically_increasing_id, col, max df1 = spark.range(0, 1000, 2).toDF('time').withColumn('v1', col('time')) df2 = spark.range(1, 1001, 2).toDF('time').withColumn('v2', col('time')) tolerance = 100 {code} {code:java} df2 = df2.withColumn("rowId", monotonically_increasing_id()) df3 = df2.join(df1, (df2.time >= df1.time) & (df1.time >= df2.time - tolerance), 'leftOuter') max_time = df3.groupby('rowId').agg(max(df1['time'])).sort('rowId') df4 = df3.join(max_time, "rowId", 'leftOuter').filter((df1['time'] == col('max(time)')) | (df1['time'].isNull())) df5 = df4.select(df2['time'], df1['time'], df1['v1'], df2['v2']).sort(df2['time']) df5.count() {code} This works and gives me the correct result. [~rxin] I agree with you that we should separate logical plan from physical execution. However, I don't see how it is feasible for the query planner to understand this logical plan and turn it into a "range partition + merge join" physical plan... > SPIP: as-of join in Spark SQL > - > > Key: SPARK-22947 > URL: https://issues.apache.org/jira/browse/SPARK-22947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Li Jin > Attachments: SPIP_ as-of join in Spark SQL (1).pdf > > > h2. Background and Motivation > Time series analysis is one of the most common analysis on financial data. In > time series analysis, as-of join is a very common operation. Supporting as-of > join in Spark SQL will allow many use cases of using Spark SQL for time > series analysis. > As-of join is “join on time” with inexact time matching criteria. Various > library has implemented asof join or similar functionality: > Kdb: https://code.kx.com/wiki/Reference/aj > Pandas: > http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof > R: This functionality is called “Last Observation Carried Forward” > https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf > JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin > Flint: https://github.com/twosigma/flint#temporal-join-functions > This proposal advocates introducing new API in Spark SQL to support as-of > join. > h2. Target Personas > Data scientists, data engineers > h2. Goals > * New API in Spark SQL that allows as-of join > * As-of join of multiple table (>2) should be performant, because it’s very > common that users need to join multiple data sources together for further > analysis. > * Define Distribution, Partitioning and shuffle strategy for ordered time > series data > h2. Non-Goals > These are out of scope for the existing SPIP, should be considered in future > SPIP as improvement to Spark’s time series analysis ability: > * Utilize partition information from data source, i.e, begin/end of each > partition to reduce sorting/shuffling > * Define API for user to implement asof join time spec in business calendar > (i.e. lookback one business day, this is very common in financial data > analysis because of market calendars) > * Support broadcast join > h2. Proposed API Changes > h3. TimeContext > TimeContext is an object that defines the time scope of the analysis, it has > begin time (inclusive) and end time (exclusive). User should be able to > change the time scope of the analysis (i.e, from one month to five year) by > just changing the TimeContext. > To Spark engine, TimeContext is a hint that: > can be used to repartition data for join > serve as a predicate that can be pushed down to storage layer > Time context is similar to filtering time by begin/end, the main difference > is that time context can be expanded based on the operation taken (see > example in as-of join). > Time context example: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > {code} > h3. asofJoin > h4. User Case A (join without key) > Join two DataFrames on time, with one day lookback: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, quantity > 20160101, 100 > 20160102, 50 > 20160104, -50 > 20160105, 100 > dfB: > time, price > 20151231, 100.0 > 20160104, 105.0 > 20160105, 102.0 > output: > time, quantity, price > 20160101, 100, 100.0 > 20160102, 50, null > 20160104, -50, 105.0 > 20160105, 100, 102.0 > {code} > Note row (20160101, 100) of dfA is jo
[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-22962: --- Issue Type: Bug (was: Improvement) > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22962) Kubernetes app fails if local files are used
Marcelo Vanzin created SPARK-22962: -- Summary: Kubernetes app fails if local files are used Key: SPARK-22962 URL: https://issues.apache.org/jira/browse/SPARK-22962 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.0 Reporter: Marcelo Vanzin If you try to start a Spark app on kubernetes using a local file as the app resource, for example, it will fail: {code} ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar {code} {noformat} + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR Y -Xmx$SPARK_DRIVER_MEMORY -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS $SPARK_DRIVER_ARGS' Error: Could not find or load main class com.cloudera.spark.tests.Sleeper {noformat} Using an http server to provide the app jar solves the problem. The k8s backend should either somehow make these files available to the cluster or error out with a more user-friendly message if that feature is not yet available. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22961) Constant columns no longer picked as constraints in 2.3
[ https://issues.apache.org/jira/browse/SPARK-22961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311771#comment-16311771 ] Apache Spark commented on SPARK-22961: -- User 'adrian-ionescu' has created a pull request for this issue: https://github.com/apache/spark/pull/20155 > Constant columns no longer picked as constraints in 2.3 > --- > > Key: SPARK-22961 > URL: https://issues.apache.org/jira/browse/SPARK-22961 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 3.0.0 >Reporter: Adrian Ionescu > Labels: constraints, optimizer, regression > > We're no longer picking up {{x = 2}} as a constraint from something like > {{df.withColumn("x", lit(2))}} > The unit test below succeeds in {{branch-2.2}}: > {code} > test("constraints should be inferred from aliased literals") { > val originalLeft = testRelation.subquery('left).as("left") > val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && > 'a <=> 2).as("left") > val right = Project(Seq(Literal(2).as("two")), > testRelation.subquery('right)).as("right") > val condition = Some("left.a".attr === "right.two".attr) > val original = originalLeft.join(right, Inner, condition) > val correct = optimizedLeft.join(right, Inner, condition) > comparePlans(Optimize.execute(original.analyze), correct.analyze) > } > {code} > but fails in {{branch-2.3}} with: > {code} > == FAIL: Plans do not match === > 'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0) > !:- Filter isnotnull(a#0) :- Filter ((2 <=> a#0) && > isnotnull(a#0)) > : +- LocalRelation , [a#0, b#0, c#0] : +- LocalRelation , > [a#0, b#0, c#0] > +- Project [2 AS two#0]+- Project [2 AS two#0] > +- LocalRelation , [a#0, b#0, c#0] +- LocalRelation , > [a#0, b#0, c#0] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22961) Constant columns no longer picked as constraints in 2.3
[ https://issues.apache.org/jira/browse/SPARK-22961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22961: Assignee: (was: Apache Spark) > Constant columns no longer picked as constraints in 2.3 > --- > > Key: SPARK-22961 > URL: https://issues.apache.org/jira/browse/SPARK-22961 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 3.0.0 >Reporter: Adrian Ionescu > Labels: constraints, optimizer, regression > > We're no longer picking up {{x = 2}} as a constraint from something like > {{df.withColumn("x", lit(2))}} > The unit test below succeeds in {{branch-2.2}}: > {code} > test("constraints should be inferred from aliased literals") { > val originalLeft = testRelation.subquery('left).as("left") > val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && > 'a <=> 2).as("left") > val right = Project(Seq(Literal(2).as("two")), > testRelation.subquery('right)).as("right") > val condition = Some("left.a".attr === "right.two".attr) > val original = originalLeft.join(right, Inner, condition) > val correct = optimizedLeft.join(right, Inner, condition) > comparePlans(Optimize.execute(original.analyze), correct.analyze) > } > {code} > but fails in {{branch-2.3}} with: > {code} > == FAIL: Plans do not match === > 'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0) > !:- Filter isnotnull(a#0) :- Filter ((2 <=> a#0) && > isnotnull(a#0)) > : +- LocalRelation , [a#0, b#0, c#0] : +- LocalRelation , > [a#0, b#0, c#0] > +- Project [2 AS two#0]+- Project [2 AS two#0] > +- LocalRelation , [a#0, b#0, c#0] +- LocalRelation , > [a#0, b#0, c#0] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22961) Constant columns no longer picked as constraints in 2.3
[ https://issues.apache.org/jira/browse/SPARK-22961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22961: Assignee: Apache Spark > Constant columns no longer picked as constraints in 2.3 > --- > > Key: SPARK-22961 > URL: https://issues.apache.org/jira/browse/SPARK-22961 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 3.0.0 >Reporter: Adrian Ionescu >Assignee: Apache Spark > Labels: constraints, optimizer, regression > > We're no longer picking up {{x = 2}} as a constraint from something like > {{df.withColumn("x", lit(2))}} > The unit test below succeeds in {{branch-2.2}}: > {code} > test("constraints should be inferred from aliased literals") { > val originalLeft = testRelation.subquery('left).as("left") > val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && > 'a <=> 2).as("left") > val right = Project(Seq(Literal(2).as("two")), > testRelation.subquery('right)).as("right") > val condition = Some("left.a".attr === "right.two".attr) > val original = originalLeft.join(right, Inner, condition) > val correct = optimizedLeft.join(right, Inner, condition) > comparePlans(Optimize.execute(original.analyze), correct.analyze) > } > {code} > but fails in {{branch-2.3}} with: > {code} > == FAIL: Plans do not match === > 'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0) > !:- Filter isnotnull(a#0) :- Filter ((2 <=> a#0) && > isnotnull(a#0)) > : +- LocalRelation , [a#0, b#0, c#0] : +- LocalRelation , > [a#0, b#0, c#0] > +- Project [2 AS two#0]+- Project [2 AS two#0] > +- LocalRelation , [a#0, b#0, c#0] +- LocalRelation , > [a#0, b#0, c#0] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22961) Constant columns no longer picked as constraints in 2.3
Adrian Ionescu created SPARK-22961: -- Summary: Constant columns no longer picked as constraints in 2.3 Key: SPARK-22961 URL: https://issues.apache.org/jira/browse/SPARK-22961 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0, 3.0.0 Reporter: Adrian Ionescu We're no longer picking up {{x = 2}} as a constraint from something like {{df.withColumn("x", lit(2))}} The unit test below succeeds in {{branch-2.2}}: {code} test("constraints should be inferred from aliased literals") { val originalLeft = testRelation.subquery('left).as("left") val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && 'a <=> 2).as("left") val right = Project(Seq(Literal(2).as("two")), testRelation.subquery('right)).as("right") val condition = Some("left.a".attr === "right.two".attr) val original = originalLeft.join(right, Inner, condition) val correct = optimizedLeft.join(right, Inner, condition) comparePlans(Optimize.execute(original.analyze), correct.analyze) } {code} but fails in {{branch-2.3}} with: {code} == FAIL: Plans do not match === 'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0) !:- Filter isnotnull(a#0) :- Filter ((2 <=> a#0) && isnotnull(a#0)) : +- LocalRelation , [a#0, b#0, c#0] : +- LocalRelation , [a#0, b#0, c#0] +- Project [2 AS two#0]+- Project [2 AS two#0] +- LocalRelation , [a#0, b#0, c#0] +- LocalRelation , [a#0, b#0, c#0] {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly
[ https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22960: Assignee: Apache Spark > Make build-push-docker-images.sh more dev-friendly > -- > > Key: SPARK-22960 > URL: https://issues.apache.org/jira/browse/SPARK-22960 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > It's currently kinda hard to test the kubernetes backend. To use > {{build-push-docker-images.sh}} you have to create a Spark dist archive which > slows down things a lot of you're doing iterative development. > The script could be enhanced to make it easier for developers to test their > changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly
[ https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311746#comment-16311746 ] Apache Spark commented on SPARK-22960: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20154 > Make build-push-docker-images.sh more dev-friendly > -- > > Key: SPARK-22960 > URL: https://issues.apache.org/jira/browse/SPARK-22960 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > It's currently kinda hard to test the kubernetes backend. To use > {{build-push-docker-images.sh}} you have to create a Spark dist archive which > slows down things a lot of you're doing iterative development. > The script could be enhanced to make it easier for developers to test their > changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly
[ https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22960: Assignee: (was: Apache Spark) > Make build-push-docker-images.sh more dev-friendly > -- > > Key: SPARK-22960 > URL: https://issues.apache.org/jira/browse/SPARK-22960 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > It's currently kinda hard to test the kubernetes backend. To use > {{build-push-docker-images.sh}} you have to create a Spark dist archive which > slows down things a lot of you're doing iterative development. > The script could be enhanced to make it easier for developers to test their > changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly
Marcelo Vanzin created SPARK-22960: -- Summary: Make build-push-docker-images.sh more dev-friendly Key: SPARK-22960 URL: https://issues.apache.org/jira/browse/SPARK-22960 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.0 Reporter: Marcelo Vanzin Priority: Minor It's currently kinda hard to test the kubernetes backend. To use {{build-push-docker-images.sh}} you have to create a Spark dist archive which slows down things a lot of you're doing iterative development. The script could be enhanced to make it easier for developers to test their changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311731#comment-16311731 ] Reynold Xin commented on SPARK-7721: We can add it first but in my experience this will only be used when it is automatic :) > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311675#comment-16311675 ] Li Jin commented on SPARK-22947: There could be multiple matches within the range, and in that case, match only with the row that is closest. > SPIP: as-of join in Spark SQL > - > > Key: SPARK-22947 > URL: https://issues.apache.org/jira/browse/SPARK-22947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Li Jin > Attachments: SPIP_ as-of join in Spark SQL (1).pdf > > > h2. Background and Motivation > Time series analysis is one of the most common analysis on financial data. In > time series analysis, as-of join is a very common operation. Supporting as-of > join in Spark SQL will allow many use cases of using Spark SQL for time > series analysis. > As-of join is “join on time” with inexact time matching criteria. Various > library has implemented asof join or similar functionality: > Kdb: https://code.kx.com/wiki/Reference/aj > Pandas: > http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof > R: This functionality is called “Last Observation Carried Forward” > https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf > JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin > Flint: https://github.com/twosigma/flint#temporal-join-functions > This proposal advocates introducing new API in Spark SQL to support as-of > join. > h2. Target Personas > Data scientists, data engineers > h2. Goals > * New API in Spark SQL that allows as-of join > * As-of join of multiple table (>2) should be performant, because it’s very > common that users need to join multiple data sources together for further > analysis. > * Define Distribution, Partitioning and shuffle strategy for ordered time > series data > h2. Non-Goals > These are out of scope for the existing SPIP, should be considered in future > SPIP as improvement to Spark’s time series analysis ability: > * Utilize partition information from data source, i.e, begin/end of each > partition to reduce sorting/shuffling > * Define API for user to implement asof join time spec in business calendar > (i.e. lookback one business day, this is very common in financial data > analysis because of market calendars) > * Support broadcast join > h2. Proposed API Changes > h3. TimeContext > TimeContext is an object that defines the time scope of the analysis, it has > begin time (inclusive) and end time (exclusive). User should be able to > change the time scope of the analysis (i.e, from one month to five year) by > just changing the TimeContext. > To Spark engine, TimeContext is a hint that: > can be used to repartition data for join > serve as a predicate that can be pushed down to storage layer > Time context is similar to filtering time by begin/end, the main difference > is that time context can be expanded based on the operation taken (see > example in as-of join). > Time context example: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > {code} > h3. asofJoin > h4. User Case A (join without key) > Join two DataFrames on time, with one day lookback: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, quantity > 20160101, 100 > 20160102, 50 > 20160104, -50 > 20160105, 100 > dfB: > time, price > 20151231, 100.0 > 20160104, 105.0 > 20160105, 102.0 > output: > time, quantity, price > 20160101, 100, 100.0 > 20160102, 50, null > 20160104, -50, 105.0 > 20160105, 100, 102.0 > {code} > Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This > is an important illustration of the time context - it is able to expand the > context to 20151231 on dfB because of the 1 day lookback. > h4. Use Case B (join with key) > To join on time and another key (for instance, id), we use “by” to specify > the key. > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = > JoinSpec(timeContext).on("time").by("id").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, id, quantity > 20160101, 1, 100 > 20160101, 2, 50 > 20160102, 1, -50 > 20160102, 2, 50 > dfB: > time, id, price > 20151231, 1, 100.0 > 20150102, 1, 105.0 > 20150102, 2, 195.0 > Output: > time, id, quantity, price > 20160101, 1, 100, 100.0 > 20160101, 2, 50, null > 20160102, 1, -50, 105.0 > 20160102, 2, 50, 195.0 > {code} > h2. Optional Design Sketch > h3. Implementation A > (This is just ini
[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311673#comment-16311673 ] Reynold Xin commented on SPARK-22947: - Basically we should separate the logical plan from the physical execution. Sometimes the optimizer won't have enough information to come up with the best execution plan, and we can give the optimizer hints. It looks to me in this case the logical plan doesn't really need any extra language constructs; it can be entirely expressed just using normal inner join with range predicates. The physical execution, on the other hand, should be aware of the specific properties of the join (i.e. only one tuple matching). > SPIP: as-of join in Spark SQL > - > > Key: SPARK-22947 > URL: https://issues.apache.org/jira/browse/SPARK-22947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Li Jin > Attachments: SPIP_ as-of join in Spark SQL (1).pdf > > > h2. Background and Motivation > Time series analysis is one of the most common analysis on financial data. In > time series analysis, as-of join is a very common operation. Supporting as-of > join in Spark SQL will allow many use cases of using Spark SQL for time > series analysis. > As-of join is “join on time” with inexact time matching criteria. Various > library has implemented asof join or similar functionality: > Kdb: https://code.kx.com/wiki/Reference/aj > Pandas: > http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof > R: This functionality is called “Last Observation Carried Forward” > https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf > JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin > Flint: https://github.com/twosigma/flint#temporal-join-functions > This proposal advocates introducing new API in Spark SQL to support as-of > join. > h2. Target Personas > Data scientists, data engineers > h2. Goals > * New API in Spark SQL that allows as-of join > * As-of join of multiple table (>2) should be performant, because it’s very > common that users need to join multiple data sources together for further > analysis. > * Define Distribution, Partitioning and shuffle strategy for ordered time > series data > h2. Non-Goals > These are out of scope for the existing SPIP, should be considered in future > SPIP as improvement to Spark’s time series analysis ability: > * Utilize partition information from data source, i.e, begin/end of each > partition to reduce sorting/shuffling > * Define API for user to implement asof join time spec in business calendar > (i.e. lookback one business day, this is very common in financial data > analysis because of market calendars) > * Support broadcast join > h2. Proposed API Changes > h3. TimeContext > TimeContext is an object that defines the time scope of the analysis, it has > begin time (inclusive) and end time (exclusive). User should be able to > change the time scope of the analysis (i.e, from one month to five year) by > just changing the TimeContext. > To Spark engine, TimeContext is a hint that: > can be used to repartition data for join > serve as a predicate that can be pushed down to storage layer > Time context is similar to filtering time by begin/end, the main difference > is that time context can be expanded based on the operation taken (see > example in as-of join). > Time context example: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > {code} > h3. asofJoin > h4. User Case A (join without key) > Join two DataFrames on time, with one day lookback: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, quantity > 20160101, 100 > 20160102, 50 > 20160104, -50 > 20160105, 100 > dfB: > time, price > 20151231, 100.0 > 20160104, 105.0 > 20160105, 102.0 > output: > time, quantity, price > 20160101, 100, 100.0 > 20160102, 50, null > 20160104, -50, 105.0 > 20160105, 100, 102.0 > {code} > Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This > is an important illustration of the time context - it is able to expand the > context to 20151231 on dfB because of the 1 day lookback. > h4. Use Case B (join with key) > To join on time and another key (for instance, id), we use “by” to specify > the key. > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = > JoinSpec(timeContext).on("time").by("id").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: >
[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311667#comment-16311667 ] Reynold Xin commented on SPARK-22947: - So this is just a hint that there is only one matching tuple right? > SPIP: as-of join in Spark SQL > - > > Key: SPARK-22947 > URL: https://issues.apache.org/jira/browse/SPARK-22947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Li Jin > Attachments: SPIP_ as-of join in Spark SQL (1).pdf > > > h2. Background and Motivation > Time series analysis is one of the most common analysis on financial data. In > time series analysis, as-of join is a very common operation. Supporting as-of > join in Spark SQL will allow many use cases of using Spark SQL for time > series analysis. > As-of join is “join on time” with inexact time matching criteria. Various > library has implemented asof join or similar functionality: > Kdb: https://code.kx.com/wiki/Reference/aj > Pandas: > http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof > R: This functionality is called “Last Observation Carried Forward” > https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf > JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin > Flint: https://github.com/twosigma/flint#temporal-join-functions > This proposal advocates introducing new API in Spark SQL to support as-of > join. > h2. Target Personas > Data scientists, data engineers > h2. Goals > * New API in Spark SQL that allows as-of join > * As-of join of multiple table (>2) should be performant, because it’s very > common that users need to join multiple data sources together for further > analysis. > * Define Distribution, Partitioning and shuffle strategy for ordered time > series data > h2. Non-Goals > These are out of scope for the existing SPIP, should be considered in future > SPIP as improvement to Spark’s time series analysis ability: > * Utilize partition information from data source, i.e, begin/end of each > partition to reduce sorting/shuffling > * Define API for user to implement asof join time spec in business calendar > (i.e. lookback one business day, this is very common in financial data > analysis because of market calendars) > * Support broadcast join > h2. Proposed API Changes > h3. TimeContext > TimeContext is an object that defines the time scope of the analysis, it has > begin time (inclusive) and end time (exclusive). User should be able to > change the time scope of the analysis (i.e, from one month to five year) by > just changing the TimeContext. > To Spark engine, TimeContext is a hint that: > can be used to repartition data for join > serve as a predicate that can be pushed down to storage layer > Time context is similar to filtering time by begin/end, the main difference > is that time context can be expanded based on the operation taken (see > example in as-of join). > Time context example: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > {code} > h3. asofJoin > h4. User Case A (join without key) > Join two DataFrames on time, with one day lookback: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, quantity > 20160101, 100 > 20160102, 50 > 20160104, -50 > 20160105, 100 > dfB: > time, price > 20151231, 100.0 > 20160104, 105.0 > 20160105, 102.0 > output: > time, quantity, price > 20160101, 100, 100.0 > 20160102, 50, null > 20160104, -50, 105.0 > 20160105, 100, 102.0 > {code} > Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This > is an important illustration of the time context - it is able to expand the > context to 20151231 on dfB because of the 1 day lookback. > h4. Use Case B (join with key) > To join on time and another key (for instance, id), we use “by” to specify > the key. > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = > JoinSpec(timeContext).on("time").by("id").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, id, quantity > 20160101, 1, 100 > 20160101, 2, 50 > 20160102, 1, -50 > 20160102, 2, 50 > dfB: > time, id, price > 20151231, 1, 100.0 > 20150102, 1, 105.0 > 20150102, 2, 195.0 > Output: > time, id, quantity, price > 20160101, 1, 100, 100.0 > 20160101, 2, 50, null > 20160102, 1, -50, 105.0 > 20160102, 2, 50, 195.0 > {code} > h2. Optional Design Sketch > h3. Implementation A > (This is just initial thought of how to implemen
[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311566#comment-16311566 ] Li Jin commented on SPARK-22947: Thanks for the feedback. [~rxin], as-of join is different from range join - as-of join joins each left row with 0 or 1 right row (the closest row within the range), where range join joins each left row with 0 to n right rows. To [~marmbrus]'s point, as-of join can be implemented as sugar on top of range join, maybe by a range join followed by a groupby to filter out the right rows that are not closest. However, implement as-of join using this approach is fairly expansive - a range query like 'dfA.time > dfB.time & dfB.time > dfA.time - tolerance' results in a CartesianProduct, which requires much more memory and CPU I think O(N*M), plus there is the cost of groupby to filter down joined result, where an as-of join can be done much faster (basically a range partition followed by merge join), this should be something like O(M+N) if the input data is already sorted. Regarding SPARK-8682, it seems to target the case when one of the table can be broadcasted. I am not sure it helps the general case (both table are large) ... > SPIP: as-of join in Spark SQL > - > > Key: SPARK-22947 > URL: https://issues.apache.org/jira/browse/SPARK-22947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Li Jin > Attachments: SPIP_ as-of join in Spark SQL (1).pdf > > > h2. Background and Motivation > Time series analysis is one of the most common analysis on financial data. In > time series analysis, as-of join is a very common operation. Supporting as-of > join in Spark SQL will allow many use cases of using Spark SQL for time > series analysis. > As-of join is “join on time” with inexact time matching criteria. Various > library has implemented asof join or similar functionality: > Kdb: https://code.kx.com/wiki/Reference/aj > Pandas: > http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof > R: This functionality is called “Last Observation Carried Forward” > https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf > JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin > Flint: https://github.com/twosigma/flint#temporal-join-functions > This proposal advocates introducing new API in Spark SQL to support as-of > join. > h2. Target Personas > Data scientists, data engineers > h2. Goals > * New API in Spark SQL that allows as-of join > * As-of join of multiple table (>2) should be performant, because it’s very > common that users need to join multiple data sources together for further > analysis. > * Define Distribution, Partitioning and shuffle strategy for ordered time > series data > h2. Non-Goals > These are out of scope for the existing SPIP, should be considered in future > SPIP as improvement to Spark’s time series analysis ability: > * Utilize partition information from data source, i.e, begin/end of each > partition to reduce sorting/shuffling > * Define API for user to implement asof join time spec in business calendar > (i.e. lookback one business day, this is very common in financial data > analysis because of market calendars) > * Support broadcast join > h2. Proposed API Changes > h3. TimeContext > TimeContext is an object that defines the time scope of the analysis, it has > begin time (inclusive) and end time (exclusive). User should be able to > change the time scope of the analysis (i.e, from one month to five year) by > just changing the TimeContext. > To Spark engine, TimeContext is a hint that: > can be used to repartition data for join > serve as a predicate that can be pushed down to storage layer > Time context is similar to filtering time by begin/end, the main difference > is that time context can be expanded based on the operation taken (see > example in as-of join). > Time context example: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > {code} > h3. asofJoin > h4. User Case A (join without key) > Join two DataFrames on time, with one day lookback: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, quantity > 20160101, 100 > 20160102, 50 > 20160104, -50 > 20160105, 100 > dfB: > time, price > 20151231, 100.0 > 20160104, 105.0 > 20160105, 102.0 > output: > time, quantity, price > 20160101, 100, 100.0 > 20160102, 50, null > 20160104, -50, 105.0 > 20160105, 100, 102.0 > {code} > Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This > is
[jira] [Assigned] (SPARK-22392) columnar reader interface
[ https://issues.apache.org/jira/browse/SPARK-22392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22392: Assignee: (was: Apache Spark) > columnar reader interface > -- > > Key: SPARK-22392 > URL: https://issues.apache.org/jira/browse/SPARK-22392 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22392) columnar reader interface
[ https://issues.apache.org/jira/browse/SPARK-22392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311487#comment-16311487 ] Apache Spark commented on SPARK-22392: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20153 > columnar reader interface > -- > > Key: SPARK-22392 > URL: https://issues.apache.org/jira/browse/SPARK-22392 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22392) columnar reader interface
[ https://issues.apache.org/jira/browse/SPARK-22392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22392: Assignee: Apache Spark > columnar reader interface > -- > > Key: SPARK-22392 > URL: https://issues.apache.org/jira/browse/SPARK-22392 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20436) NullPointerException when restart from checkpoint file
[ https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311402#comment-16311402 ] Riccardo Vincelli commented on SPARK-20436: --- I have tested this myself and it seems that this is the case, thanks [~ffbin] and [~zsxwing]! Spark team, this basically means that there are issues around the serialization of unused streams, or something like that right? It is a shame that it took me a while, like a couple of days, to figure this out: it must be documented in the checkpointing section of the docs, because this is really unexpected. By the way I do not think that this can be closed as a nap, again because it is rather surprising behavior, but perhaps easy to fix. Thanks, > NullPointerException when restart from checkpoint file > -- > > Key: SPARK-20436 > URL: https://issues.apache.org/jira/browse/SPARK-20436 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 1.5.0 >Reporter: fangfengbin > > I have written a Spark Streaming application which have two DStreams. > Code is : > {code} > object KafkaTwoInkfk { > def main(args: Array[String]) { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val ssc = StreamingContext.getOrCreate(checkPointDir, () => > createContext(args)) > ssc.start() > ssc.awaitTermination() > } > def createContext(args : Array[String]) : StreamingContext = { > val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args > val sparkConf = new SparkConf().setAppName("KafkaWordCount") > val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong)) > ssc.checkpoint(checkPointDir) > val topicArr1 = topic1.split(",") > val topicSet1 = topicArr1.toSet > val topicArr2 = topic2.split(",") > val topicSet2 = topicArr2.toSet > val kafkaParams = Map[String, String]( > "metadata.broker.list" -> brokers > ) > val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet1) > val words1 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts1 = words1.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts1.print() > val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicSet2) > val words2 = lines1.map(_._2).flatMap(_.split(" ")) > val wordCounts2 = words2.map(x => { > (x, 1L)}).reduceByKey(_ + _) > wordCounts2.print() > return ssc > } > } > {code} > when restart from checkpoint file, it throw NullPointerException: > java.lang.NullPointerException > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291) > at > org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAcc
[jira] [Commented] (SPARK-22955) Error generating jobs when Stopping JobGenerator gracefully
[ https://issues.apache.org/jira/browse/SPARK-22955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311399#comment-16311399 ] Sean Owen commented on SPARK-22955: --- [~tdas] should the order of these two stops be reversed? the comment on stopping the event loop suggests it needs to happen 'first', also. > Error generating jobs when Stopping JobGenerator gracefully > --- > > Key: SPARK-22955 > URL: https://issues.apache.org/jira/browse/SPARK-22955 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: zhaoshijie > > when I stop a spark-streaming application with parameter > spark.streaming.stopGracefullyOnShutdown, I get ERROR as follows: > {code:java} > 2018-01-04 17:31:17,524 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: > RECEIVED SIGNAL TERM > 2018-01-04 17:31:17,527 INFO org.apache.spark.streaming.StreamingContext: > Invoking stop(stopGracefully=true) from shutdown hook > 2018-01-04 17:31:17,530 INFO > org.apache.spark.streaming.scheduler.ReceiverTracker: ReceiverTracker stopped > 2018-01-04 17:31:17,531 INFO > org.apache.spark.streaming.scheduler.JobGenerator: Stopping JobGenerator > gracefully > 2018-01-04 17:31:17,532 INFO > org.apache.spark.streaming.scheduler.JobGenerator: Waiting for all received > blocks to be consumed for job generation > 2018-01-04 17:31:17,533 INFO > org.apache.spark.streaming.scheduler.JobGenerator: Waited for all received > blocks to be consumed for job generation > 2018-01-04 17:31:17,747 INFO > org.apache.spark.streaming.scheduler.JobScheduler: Added jobs for time > 1515058267000 ms > 2018-01-04 17:31:18,302 INFO > org.apache.spark.streaming.scheduler.JobScheduler: Added jobs for time > 1515058268000 ms > 2018-01-04 17:31:18,785 INFO > org.apache.spark.streaming.scheduler.JobScheduler: Added jobs for time > 1515058269000 ms > 2018-01-04 17:31:19,001 INFO org.apache.spark.streaming.util.RecurringTimer: > Stopped timer for JobGenerator after time 1515058279000 > 2018-01-04 17:31:19,200 INFO > org.apache.spark.streaming.scheduler.JobScheduler: Added jobs for time > 151505827 ms > 2018-01-04 17:31:19,207 INFO > org.apache.spark.streaming.scheduler.JobGenerator: Stopped generation timer > 2018-01-04 17:31:19,207 INFO > org.apache.spark.streaming.scheduler.JobGenerator: Waiting for jobs to be > processed and checkpoints to be written > 2018-01-04 17:31:19,210 ERROR > org.apache.spark.streaming.scheduler.JobScheduler: Error generating jobs for > time 1515058271000 ms > java.lang.IllegalStateException: This consumer has already been closed. > at > org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417) > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:161) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:180) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:208) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collecti