[jira] [Comment Edited] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()

2018-01-04 Thread wuyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312579#comment-16312579
 ] 

wuyi edited comment on SPARK-22967 at 1/5/18 6:51 AM:
--

@[~srowen]


was (Author: ngone51):
@[~srowen]

> VersionSuite failed on Windows caused by unescapeSQLString()
> 
>
> Key: SPARK-22967
> URL: https://issues.apache.org/jira/browse/SPARK-22967
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Windos7
>Reporter: wuyi
>Priority: Minor
>  Labels: build, test, windows
>
> On Windows system, two unit test case would fail while running VersionSuite 
> ("A simple set of tests that call the methods of a `HiveClient`, loading 
> different version of hive from maven central.")
> Failed A : test(s"$version: read avro file containing decimal") 
> {code:java}
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.lang.IllegalArgumentException: Can not create a 
> Path from an empty string);
> {code}
> Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table")
> {code:java}
> Unable to infer the schema. The schema specification is required to create 
> the table `default`.`tab2`.;
> org.apache.spark.sql.AnalysisException: Unable to infer the schema. The 
> schema specification is required to create the table `default`.`tab2`.;
> {code}
> As I deep into this problem, I found it is related to 
> ParserUtils#unescapeSQLString().
> These are two lines at the beginning of Failed A:
> {code:java}
> val url = 
> Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
> val location = new File(url.getFile)
> {code}
> And in my environment,`location` (path value) is
> {code:java}
> D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
> {code}
> And then, in SparkSqlParser#visitCreateHiveTable()#L1128:
> {code:java}
> val location = Option(ctx.locationSpec).map(visitLocationSpec)
> {code}
> This line want to get LocationSepcContext's content first, which is equal to 
> `location` above.
> Then, the content is passed to visitLocationSpec(), and passed to 
> unescapeSQLString()
> finally.
> Lets' have a look at unescapeSQLString():
> {code:java}
> /** Unescape baskslash-escaped string enclosed by quotes. */
>   def unescapeSQLString(b: String): String = {
> var enclosure: Character = null
> val sb = new StringBuilder(b.length())
> def appendEscapedChar(n: Char) {
>   n match {
> case '0' => sb.append('\u')
> case '\'' => sb.append('\'')
> case '"' => sb.append('\"')
> case 'b' => sb.append('\b')
> case 'n' => sb.append('\n')
> case 'r' => sb.append('\r')
> case 't' => sb.append('\t')
> case 'Z' => sb.append('\u001A')
> case '\\' => sb.append('\\')
> // The following 2 lines are exactly what MySQL does TODO: why do we 
> do this?
> case '%' => sb.append("\\%")
> case '_' => sb.append("\\_")
> case _ => sb.append(n)
>   }
> }
> var i = 0
> val strLength = b.length
> while (i < strLength) {
>   val currentChar = b.charAt(i)
>   if (enclosure == null) {
> if (currentChar == '\'' || currentChar == '\"') {
>   enclosure = currentChar
> }
>   } else if (enclosure == currentChar) {
> enclosure = null
>   } else if (currentChar == '\\') {
> if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') {
>   // \u style character literals.
>   val base = i + 2
>   val code = (0 until 4).foldLeft(0) { (mid, j) =>
> val digit = Character.digit(b.charAt(j + base), 16)
> (mid << 4) + digit
>   }
>   sb.append(code.asInstanceOf[Char])
>   i += 5
> } else if (i + 4 < strLength) {
>   // \000 style character literals.
>   val i1 = b.charAt(i + 1)
>   val i2 = b.charAt(i + 2)
>   val i3 = b.charAt(i + 3)
>   if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= 
> '0' && i3 <= '7')) {
> val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 
> 6)).asInstanceOf[Char]
> sb.append(tmp)
> i += 3
>   } else {
> appendEscapedChar(i1)
> i += 1
>   }
> } else if (i + 2 < strLength) {
>   // escaped character literals.
>   val n = b.charAt(i + 1)
>   appendEscapedChar(n)
>   i += 1
> }
>   } else {
> // non-escaped character literals.
> sb.append(currentChar)
>   }
>   i += 1
> }
> sb.toString()
>   }
> {code}
>  Again, here, variable `b` is equal to 

[jira] [Commented] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()

2018-01-04 Thread wuyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312579#comment-16312579
 ] 

wuyi commented on SPARK-22967:
--

[~srowen]

> VersionSuite failed on Windows caused by unescapeSQLString()
> 
>
> Key: SPARK-22967
> URL: https://issues.apache.org/jira/browse/SPARK-22967
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Windos7
>Reporter: wuyi
>Priority: Minor
>  Labels: build, test, windows
>
> On Windows system, two unit test case would fail while running VersionSuite 
> ("A simple set of tests that call the methods of a `HiveClient`, loading 
> different version of hive from maven central.")
> Failed A : test(s"$version: read avro file containing decimal") 
> {code:java}
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.lang.IllegalArgumentException: Can not create a 
> Path from an empty string);
> {code}
> Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table")
> {code:java}
> Unable to infer the schema. The schema specification is required to create 
> the table `default`.`tab2`.;
> org.apache.spark.sql.AnalysisException: Unable to infer the schema. The 
> schema specification is required to create the table `default`.`tab2`.;
> {code}
> As I deep into this problem, I found it is related to 
> ParserUtils#unescapeSQLString().
> These are two lines at the beginning of Failed A:
> {code:java}
> val url = 
> Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
> val location = new File(url.getFile)
> {code}
> And in my environment,`location` (path value) is
> {code:java}
> D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
> {code}
> And then, in SparkSqlParser#visitCreateHiveTable()#L1128:
> {code:java}
> val location = Option(ctx.locationSpec).map(visitLocationSpec)
> {code}
> This line want to get LocationSepcContext's content first, which is equal to 
> `location` above.
> Then, the content is passed to visitLocationSpec(), and passed to 
> unescapeSQLString()
> finally.
> Lets' have a look at unescapeSQLString():
> {code:java}
> /** Unescape baskslash-escaped string enclosed by quotes. */
>   def unescapeSQLString(b: String): String = {
> var enclosure: Character = null
> val sb = new StringBuilder(b.length())
> def appendEscapedChar(n: Char) {
>   n match {
> case '0' => sb.append('\u')
> case '\'' => sb.append('\'')
> case '"' => sb.append('\"')
> case 'b' => sb.append('\b')
> case 'n' => sb.append('\n')
> case 'r' => sb.append('\r')
> case 't' => sb.append('\t')
> case 'Z' => sb.append('\u001A')
> case '\\' => sb.append('\\')
> // The following 2 lines are exactly what MySQL does TODO: why do we 
> do this?
> case '%' => sb.append("\\%")
> case '_' => sb.append("\\_")
> case _ => sb.append(n)
>   }
> }
> var i = 0
> val strLength = b.length
> while (i < strLength) {
>   val currentChar = b.charAt(i)
>   if (enclosure == null) {
> if (currentChar == '\'' || currentChar == '\"') {
>   enclosure = currentChar
> }
>   } else if (enclosure == currentChar) {
> enclosure = null
>   } else if (currentChar == '\\') {
> if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') {
>   // \u style character literals.
>   val base = i + 2
>   val code = (0 until 4).foldLeft(0) { (mid, j) =>
> val digit = Character.digit(b.charAt(j + base), 16)
> (mid << 4) + digit
>   }
>   sb.append(code.asInstanceOf[Char])
>   i += 5
> } else if (i + 4 < strLength) {
>   // \000 style character literals.
>   val i1 = b.charAt(i + 1)
>   val i2 = b.charAt(i + 2)
>   val i3 = b.charAt(i + 3)
>   if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= 
> '0' && i3 <= '7')) {
> val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 
> 6)).asInstanceOf[Char]
> sb.append(tmp)
> i += 3
>   } else {
> appendEscapedChar(i1)
> i += 1
>   }
> } else if (i + 2 < strLength) {
>   // escaped character literals.
>   val n = b.charAt(i + 1)
>   appendEscapedChar(n)
>   i += 1
> }
>   } else {
> // non-escaped character literals.
> sb.append(currentChar)
>   }
>   i += 1
> }
> sb.toString()
>   }
> {code}
>  Again, here, variable `b` is equal to content and `location`, is valued of 
> {code:java}
> D:\workspace\IdeaProjects\spar

[jira] [Comment Edited] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()

2018-01-04 Thread wuyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312579#comment-16312579
 ] 

wuyi edited comment on SPARK-22967 at 1/5/18 6:51 AM:
--

@[~srowen]


was (Author: ngone51):
[~srowen]

> VersionSuite failed on Windows caused by unescapeSQLString()
> 
>
> Key: SPARK-22967
> URL: https://issues.apache.org/jira/browse/SPARK-22967
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Windos7
>Reporter: wuyi
>Priority: Minor
>  Labels: build, test, windows
>
> On Windows system, two unit test case would fail while running VersionSuite 
> ("A simple set of tests that call the methods of a `HiveClient`, loading 
> different version of hive from maven central.")
> Failed A : test(s"$version: read avro file containing decimal") 
> {code:java}
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.lang.IllegalArgumentException: Can not create a 
> Path from an empty string);
> {code}
> Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table")
> {code:java}
> Unable to infer the schema. The schema specification is required to create 
> the table `default`.`tab2`.;
> org.apache.spark.sql.AnalysisException: Unable to infer the schema. The 
> schema specification is required to create the table `default`.`tab2`.;
> {code}
> As I deep into this problem, I found it is related to 
> ParserUtils#unescapeSQLString().
> These are two lines at the beginning of Failed A:
> {code:java}
> val url = 
> Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
> val location = new File(url.getFile)
> {code}
> And in my environment,`location` (path value) is
> {code:java}
> D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
> {code}
> And then, in SparkSqlParser#visitCreateHiveTable()#L1128:
> {code:java}
> val location = Option(ctx.locationSpec).map(visitLocationSpec)
> {code}
> This line want to get LocationSepcContext's content first, which is equal to 
> `location` above.
> Then, the content is passed to visitLocationSpec(), and passed to 
> unescapeSQLString()
> finally.
> Lets' have a look at unescapeSQLString():
> {code:java}
> /** Unescape baskslash-escaped string enclosed by quotes. */
>   def unescapeSQLString(b: String): String = {
> var enclosure: Character = null
> val sb = new StringBuilder(b.length())
> def appendEscapedChar(n: Char) {
>   n match {
> case '0' => sb.append('\u')
> case '\'' => sb.append('\'')
> case '"' => sb.append('\"')
> case 'b' => sb.append('\b')
> case 'n' => sb.append('\n')
> case 'r' => sb.append('\r')
> case 't' => sb.append('\t')
> case 'Z' => sb.append('\u001A')
> case '\\' => sb.append('\\')
> // The following 2 lines are exactly what MySQL does TODO: why do we 
> do this?
> case '%' => sb.append("\\%")
> case '_' => sb.append("\\_")
> case _ => sb.append(n)
>   }
> }
> var i = 0
> val strLength = b.length
> while (i < strLength) {
>   val currentChar = b.charAt(i)
>   if (enclosure == null) {
> if (currentChar == '\'' || currentChar == '\"') {
>   enclosure = currentChar
> }
>   } else if (enclosure == currentChar) {
> enclosure = null
>   } else if (currentChar == '\\') {
> if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') {
>   // \u style character literals.
>   val base = i + 2
>   val code = (0 until 4).foldLeft(0) { (mid, j) =>
> val digit = Character.digit(b.charAt(j + base), 16)
> (mid << 4) + digit
>   }
>   sb.append(code.asInstanceOf[Char])
>   i += 5
> } else if (i + 4 < strLength) {
>   // \000 style character literals.
>   val i1 = b.charAt(i + 1)
>   val i2 = b.charAt(i + 2)
>   val i3 = b.charAt(i + 3)
>   if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= 
> '0' && i3 <= '7')) {
> val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 
> 6)).asInstanceOf[Char]
> sb.append(tmp)
> i += 3
>   } else {
> appendEscapedChar(i1)
> i += 1
>   }
> } else if (i + 2 < strLength) {
>   // escaped character literals.
>   val n = b.charAt(i + 1)
>   appendEscapedChar(n)
>   i += 1
> }
>   } else {
> // non-escaped character literals.
> sb.append(currentChar)
>   }
>   i += 1
> }
> sb.toString()
>   }
> {code}
>  Again, here, variable `b` is equal to c

[jira] [Updated] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()

2018-01-04 Thread wuyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-22967:
-
Shepherd: Sean Owen

> VersionSuite failed on Windows caused by unescapeSQLString()
> 
>
> Key: SPARK-22967
> URL: https://issues.apache.org/jira/browse/SPARK-22967
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Windos7
>Reporter: wuyi
>Priority: Minor
>  Labels: build, test, windows
>
> On Windows system, two unit test case would fail while running VersionSuite 
> ("A simple set of tests that call the methods of a `HiveClient`, loading 
> different version of hive from maven central.")
> Failed A : test(s"$version: read avro file containing decimal") 
> {code:java}
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.lang.IllegalArgumentException: Can not create a 
> Path from an empty string);
> {code}
> Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table")
> {code:java}
> Unable to infer the schema. The schema specification is required to create 
> the table `default`.`tab2`.;
> org.apache.spark.sql.AnalysisException: Unable to infer the schema. The 
> schema specification is required to create the table `default`.`tab2`.;
> {code}
> As I deep into this problem, I found it is related to 
> ParserUtils#unescapeSQLString().
> These are two lines at the beginning of Failed A:
> {code:java}
> val url = 
> Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
> val location = new File(url.getFile)
> {code}
> And in my environment,`location` (path value) is
> {code:java}
> D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
> {code}
> And then, in SparkSqlParser#visitCreateHiveTable()#L1128:
> {code:java}
> val location = Option(ctx.locationSpec).map(visitLocationSpec)
> {code}
> This line want to get LocationSepcContext's content first, which is equal to 
> `location` above.
> Then, the content is passed to visitLocationSpec(), and passed to 
> unescapeSQLString()
> finally.
> Lets' have a look at unescapeSQLString():
> {code:java}
> /** Unescape baskslash-escaped string enclosed by quotes. */
>   def unescapeSQLString(b: String): String = {
> var enclosure: Character = null
> val sb = new StringBuilder(b.length())
> def appendEscapedChar(n: Char) {
>   n match {
> case '0' => sb.append('\u')
> case '\'' => sb.append('\'')
> case '"' => sb.append('\"')
> case 'b' => sb.append('\b')
> case 'n' => sb.append('\n')
> case 'r' => sb.append('\r')
> case 't' => sb.append('\t')
> case 'Z' => sb.append('\u001A')
> case '\\' => sb.append('\\')
> // The following 2 lines are exactly what MySQL does TODO: why do we 
> do this?
> case '%' => sb.append("\\%")
> case '_' => sb.append("\\_")
> case _ => sb.append(n)
>   }
> }
> var i = 0
> val strLength = b.length
> while (i < strLength) {
>   val currentChar = b.charAt(i)
>   if (enclosure == null) {
> if (currentChar == '\'' || currentChar == '\"') {
>   enclosure = currentChar
> }
>   } else if (enclosure == currentChar) {
> enclosure = null
>   } else if (currentChar == '\\') {
> if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') {
>   // \u style character literals.
>   val base = i + 2
>   val code = (0 until 4).foldLeft(0) { (mid, j) =>
> val digit = Character.digit(b.charAt(j + base), 16)
> (mid << 4) + digit
>   }
>   sb.append(code.asInstanceOf[Char])
>   i += 5
> } else if (i + 4 < strLength) {
>   // \000 style character literals.
>   val i1 = b.charAt(i + 1)
>   val i2 = b.charAt(i + 2)
>   val i3 = b.charAt(i + 3)
>   if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= 
> '0' && i3 <= '7')) {
> val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 
> 6)).asInstanceOf[Char]
> sb.append(tmp)
> i += 3
>   } else {
> appendEscapedChar(i1)
> i += 1
>   }
> } else if (i + 2 < strLength) {
>   // escaped character literals.
>   val n = b.charAt(i + 1)
>   appendEscapedChar(n)
>   i += 1
> }
>   } else {
> // non-escaped character literals.
> sb.append(currentChar)
>   }
>   i += 1
> }
> sb.toString()
>   }
> {code}
>  Again, here, variable `b` is equal to content and `location`, is valued of 
> {code:java}
> D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\av

[jira] [Resolved] (SPARK-22949) Reduce memory requirement for TrainValidationSplit

2018-01-04 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22949.
---
  Resolution: Fixed
   Fix Version/s: 2.4.0
  2.3.0
Target Version/s: 2.3.0, 2.4.0

Resolved by https://github.com/apache/spark/pull/20143

> Reduce memory requirement for TrainValidationSplit
> --
>
> Key: SPARK-22949
> URL: https://issues.apache.org/jira/browse/SPARK-22949
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Critical
> Fix For: 2.3.0, 2.4.0
>
>
> There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, 
> the same patch to be applied to {{ TrainValidationSplit }}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22949) Reduce memory requirement for TrainValidationSplit

2018-01-04 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22949:
-

Assignee: Bago Amirbekian

> Reduce memory requirement for TrainValidationSplit
> --
>
> Key: SPARK-22949
> URL: https://issues.apache.org/jira/browse/SPARK-22949
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Critical
>
> There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, 
> the same patch to be applied to {{ TrainValidationSplit }}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22825) Incorrect results of Casting Array to String

2018-01-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22825.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20024
[https://github.com/apache/spark/pull/20024]

> Incorrect results of Casting Array to String
> 
>
> Key: SPARK-22825
> URL: https://issues.apache.org/jira/browse/SPARK-22825
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> {code}
> val df = 
> spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids))   
> df.write.saveAsTable("t")
> sql("SELECT cast(ids as String) FROM t").show(false)
> {code}
> The output is like
> {code}
> +--+
> |ids   |
> +--+
> |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@8bc285df|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22825) Incorrect results of Casting Array to String

2018-01-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22825:
---

Assignee: Takeshi Yamamuro

> Incorrect results of Casting Array to String
> 
>
> Key: SPARK-22825
> URL: https://issues.apache.org/jira/browse/SPARK-22825
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> {code}
> val df = 
> spark.range(10).select('id.cast("integer")).agg(collect_list('id).as('ids))   
> df.write.saveAsTable("t")
> sql("SELECT cast(ids as String) FROM t").show(false)
> {code}
> The output is like
> {code}
> +--+
> |ids   |
> +--+
> |org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@8bc285df|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception

2018-01-04 Thread Jepson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312505#comment-16312505
 ] 

Jepson commented on SPARK-19547:


I have encounter this problem too.
https://issues.apache.org/jira/browse/SPARK-22968

> KafkaUtil throw 'No current assignment for partition' Exception
> ---
>
> Key: SPARK-19547
> URL: https://issues.apache.org/jira/browse/SPARK-19547
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.6.1
>Reporter: wuchang
>
> Below is my scala code to create spark kafka stream:
> val kafkaParams = Map[String, Object](
>   "bootstrap.servers" -> "server110:2181,server110:9092",
>   "zookeeper" -> "server110:2181",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> classOf[StringDeserializer],
>   "group.id" -> "example",
>   "auto.offset.reset" -> "latest",
>   "enable.auto.commit" -> (false: java.lang.Boolean)
> )
> val topics = Array("ABTest")
> val stream = KafkaUtils.createDirectStream[String, String](
>   ssc,
>   PreferConsistent,
>   Subscribe[String, String](topics, kafkaParams)
> )
> But after run for 10 hours, it throws exceptions:
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:20,011 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:40,057 INFO  [JobGenerator] internals.AbstractCoordinator: 
> Successfully joined group example with generation 5
> 2017-02-10 10:56:40,058 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Setting newly assigned partitions [ABTest-1] for group example
> 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error 
> generating jobs for time 148669538 ms
> java.lang.IllegalStateException: No current assignment for partition ABTest-0
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
> at 
> org.

[jira] [Commented] (SPARK-22805) Use aliases for StorageLevel in event logs

2018-01-04 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312499#comment-16312499
 ] 

Imran Rashid commented on SPARK-22805:
--

I'm leaning slightly against this, though could go either way.

For 2.3+, the gains are pretty small, and it means an old history server can't 
read new logs (I know we don't guarantee that anyway, but might as well keep it 
if we can).

For < 2.3, there would be notable improvements in log sizes, but I don't like 
the compatibility story.  I don't think there are any explicit guarantees but 
seems pretty annoying to have a 2.2.1 SHS not read logs from spark 2.2.2.

sorry [~lebedev], I appreciate the work you've put into this anyhow.

> Use aliases for StorageLevel in event logs
> --
>
> Key: SPARK-22805
> URL: https://issues.apache.org/jira/browse/SPARK-22805
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Sergei Lebedev
>Priority: Minor
>
> Fact 1: {{StorageLevel}} has a private constructor, therefore a list of 
> predefined levels is not extendable (by the users).
> Fact 2: The format of event logs uses redundant representation for storage 
> levels 
> {code}
> >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, 
> >>> "Replication": 1}')
> 79
> >>> len('DISK_ONLY')
> 9
> {code}
> Fact 3: This leads to excessive log sizes for workloads with lots of 
> partitions, because every partition would have the storage level field which 
> is 60-70 bytes more than it should be.
> Suggested quick win: use the names of the predefined levels to identify them 
> in the event log.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-01-04 Thread Jepson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson updated SPARK-22968:
---
Description: 
*Kafka Broker:*
{code:java}
   message.max.bytes : 2621440  
{code}

*Spark Streaming+Kafka Code:*
{code:java}
, "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 1048576
, "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
, "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
, "heartbeat.interval.ms" -> (5000: java.lang.Integer)
, "receive.buffer.bytes" -> (10485760: java.lang.Integer)
{code}


*Error message:*
{code:java}
8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
kssh-1] for group use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group 
use_a_separate_group_id_for_each_stream with generation 4
18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for time 
1515116907000 ms
java.lang.IllegalStateException: No current assignment for partition kssh-2
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
18/01/05 09:48:27 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
18/01/05 09:48:27 INFO streaming.StreamingContext: Invoking 
stop(stopGracefully=false) from shutdown hook
18/01/05 09:48:27 INFO scheduler.ReceiverT

[jira] [Updated] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-01-04 Thread Jepson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson updated SPARK-22968:
---
Description: 
*Kafka Broker:*
   message.max.bytes : 2621440  

*Spark Streaming+Kafka Code:*
*, "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
1048576*
, "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
, "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
, "heartbeat.interval.ms" -> (5000: java.lang.Integer)
*, "receive.buffer.bytes" -> (10485760: java.lang.Integer)*


*Error message:*

{code:java}
8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
kssh-1] for group use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group 
use_a_separate_group_id_for_each_stream with generation 4
18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for time 
1515116907000 ms
java.lang.IllegalStateException: No current assignment for partition kssh-2
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
18/01/05 09:48:27 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
18/01/05 09:48:27 INFO streaming.StreamingContext: Invoking 
stop(stopGracefully=false) from shutdown hook
18/01/05 09:48:27 INFO scheduler.ReceiverTracker: ReceiverTracker stopped

[jira] [Updated] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-01-04 Thread Jepson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson updated SPARK-22968:
---
Description: 
CDH-KAFKA:
   message.max.bytes : 2621440  


, "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 1048576
, "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
, "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
, "heartbeat.interval.ms" -> (5000: java.lang.Integer)
, "receive.buffer.bytes" -> (10485760: java.lang.Integer)


*Error message:*

{code:java}
8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
kssh-1] for group use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group 
use_a_separate_group_id_for_each_stream with generation 4
18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for time 
1515116907000 ms
java.lang.IllegalStateException: No current assignment for partition kssh-2
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
18/01/05 09:48:27 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
18/01/05 09:48:27 INFO streaming.StreamingContext: Invoking 
stop(stopGracefully=false) from shutdown hook
18/01/05 09:48:27 INFO scheduler.ReceiverTracker: ReceiverTracker stopped
18/01/05 09:48:27 INFO scheduler.JobGen

[jira] [Closed] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET

2018-01-04 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang closed SPARK-17282.


> Implement ALTER TABLE UPDATE STATISTICS SET
> ---
>
> Key: SPARK-17282
> URL: https://issues.apache.org/jira/browse/SPARK-17282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Users can change the statistics by the DDL statement:
> {noformat}
> ALTER TABLE UPDATE STATISTICS SET
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET

2018-01-04 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang resolved SPARK-17282.
--
Resolution: Won't Fix

> Implement ALTER TABLE UPDATE STATISTICS SET
> ---
>
> Key: SPARK-17282
> URL: https://issues.apache.org/jira/browse/SPARK-17282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Users can change the statistics by the DDL statement:
> {noformat}
> ALTER TABLE UPDATE STATISTICS SET
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-01-04 Thread Jepson (JIRA)
Jepson created SPARK-22968:
--

 Summary: java.lang.IllegalStateException: No current assignment 
for partition kssh-2
 Key: SPARK-22968
 URL: https://issues.apache.org/jira/browse/SPARK-22968
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
 Environment: Kafka:  0.10.0  (CDH5.12.0)
Apache Spark 2.1.1

Spark streaming+Kafka
Reporter: Jepson


*Error message:*

{code:java}
8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
kssh-1] for group use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group 
use_a_separate_group_id_for_each_stream with generation 4
18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
use_a_separate_group_id_for_each_stream
18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for time 
1515116907000 ms
java.lang.IllegalStateException: No current assignment for partition kssh-2
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
18/01/05 09:48:27 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
18/01/05 09:48:27 INFO streaming.StreamingContext: Invoking 
stop(stopGracefully=false) from shutdown hook
18/01/05 09:48:27 INFO scheduler.ReceiverTracker: ReceiverTracker stopped
18/01/05 09:48:27 INFO scheduler.JobGenerator: Stopping JobGenerator immediately
18/01/05 09:48:27 INFO util.RecurringTim

[jira] [Commented] (SPARK-17282) Implement ALTER TABLE UPDATE STATISTICS SET

2018-01-04 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312487#comment-16312487
 ] 

Zhenhua Wang commented on SPARK-17282:
--

I'm closing this for now. [~smilegator], [~hvanhovell], [~cloud_fan] If you 
think this functionality is necessary, we can reopen it later.

> Implement ALTER TABLE UPDATE STATISTICS SET
> ---
>
> Key: SPARK-17282
> URL: https://issues.apache.org/jira/browse/SPARK-17282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Users can change the statistics by the DDL statement:
> {noformat}
> ALTER TABLE UPDATE STATISTICS SET
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21209) Implement Incremental PCA algorithm for ML

2018-01-04 Thread Sandeep Kumar Choudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312477#comment-16312477
 ] 

Sandeep Kumar Choudhary commented on SPARK-21209:
-

I would like to work on it. I have used IPCA of python. I am reading few papers 
to figure out the best possible solution. I will get to you on this in few days.

> Implement Incremental PCA algorithm for ML
> --
>
> Key: SPARK-21209
> URL: https://issues.apache.org/jira/browse/SPARK-21209
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ben St. Clair
>  Labels: features
>
> Incremental Principal Component Analysis is a method for calculating PCAs in 
> an incremental fashion, allowing one to update an existing PCA model as new 
> evidence arrives. Furthermore, an alpha parameter can be used to enable 
> task-specific weighting of new and old evidence.
> This algorithm would be useful for streaming applications, where a fast and 
> adaptive feature subspace calculation could be applied. Furthermore, it can 
> be applied to combine PCAs from subcomponents of large datasets.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-04 Thread Neil Alexander McQuarrie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312466#comment-16312466
 ] 

Neil Alexander McQuarrie commented on SPARK-21727:
--

I was able to get a version working by checking for length > 1, as well as 
whether the type is integer, character, logical, double, numeric, Date, 
POSIXlt, or POSIXct.

[https://github.com/neilalex/spark/blob/neilalex-sparkr-arraytype/R/pkg/R/serialize.R#L39]

The reason I'm checking the specific type is that it seems quite a few objects 
flow through getSerdeType() besides just the object we're converting -- for 
example, after calling 'as.DataFrame(myDf)' per above, I saw a jobj "Java ref 
type org.apache.spark.sql.SparkSession id 1" of length 2, as well as a raw 
object of length ~700, both pass through getSerdeType(). So, it seems we can't 
just check length?

Also, in general this makes me a little nervous -- will there be scenarios when 
a 2+ length vector of one of the above types should also not be converted to 
array? I'll familiarize myself further with how getSerdeType() is called, to 
see if I can think this through.

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()

2018-01-04 Thread wuyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-22967:
-
Description: 
On Windows system, two unit test case would fail while running VersionSuite ("A 
simple set of tests that call the methods of a `HiveClient`, loading different 
version of hive from maven central.")

Failed A : test(s"$version: read avro file containing decimal") 

{code:java}
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path 
from an empty string);
{code}

Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table")

{code:java}
Unable to infer the schema. The schema specification is required to create the 
table `default`.`tab2`.;
org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema 
specification is required to create the table `default`.`tab2`.;
{code}

As I deep into this problem, I found it is related to 
ParserUtils#unescapeSQLString().

These are two lines at the beginning of Failed A:

{code:java}
val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
val location = new File(url.getFile)
{code}

And in my environment,`location` (path value) is

{code:java}
D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
{code}

And then, in SparkSqlParser#visitCreateHiveTable()#L1128:

{code:java}
val location = Option(ctx.locationSpec).map(visitLocationSpec)
{code}
This line want to get LocationSepcContext's content first, which is equal to 
`location` above.
Then, the content is passed to visitLocationSpec(), and passed to 
unescapeSQLString()
finally.

Lets' have a look at unescapeSQLString():
{code:java}
/** Unescape baskslash-escaped string enclosed by quotes. */
  def unescapeSQLString(b: String): String = {
var enclosure: Character = null
val sb = new StringBuilder(b.length())

def appendEscapedChar(n: Char) {
  n match {
case '0' => sb.append('\u')
case '\'' => sb.append('\'')
case '"' => sb.append('\"')
case 'b' => sb.append('\b')
case 'n' => sb.append('\n')
case 'r' => sb.append('\r')
case 't' => sb.append('\t')
case 'Z' => sb.append('\u001A')
case '\\' => sb.append('\\')
// The following 2 lines are exactly what MySQL does TODO: why do we do 
this?
case '%' => sb.append("\\%")
case '_' => sb.append("\\_")
case _ => sb.append(n)
  }
}

var i = 0
val strLength = b.length
while (i < strLength) {
  val currentChar = b.charAt(i)
  if (enclosure == null) {
if (currentChar == '\'' || currentChar == '\"') {
  enclosure = currentChar
}
  } else if (enclosure == currentChar) {
enclosure = null
  } else if (currentChar == '\\') {

if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') {
  // \u style character literals.

  val base = i + 2
  val code = (0 until 4).foldLeft(0) { (mid, j) =>
val digit = Character.digit(b.charAt(j + base), 16)
(mid << 4) + digit
  }
  sb.append(code.asInstanceOf[Char])
  i += 5
} else if (i + 4 < strLength) {
  // \000 style character literals.

  val i1 = b.charAt(i + 1)
  val i2 = b.charAt(i + 2)
  val i3 = b.charAt(i + 3)

  if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= 
'0' && i3 <= '7')) {
val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 
6)).asInstanceOf[Char]
sb.append(tmp)
i += 3
  } else {
appendEscapedChar(i1)
i += 1
  }
} else if (i + 2 < strLength) {
  // escaped character literals.
  val n = b.charAt(i + 1)
  appendEscapedChar(n)
  i += 1
}
  } else {
// non-escaped character literals.
sb.append(currentChar)
  }
  i += 1
}
sb.toString()
  }
{code}
 Again, here, variable `b` is equal to content and `location`, is valued of 

{code:java}
D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
{code}

And we can make sense from the unescapeSQLString()' strategies that it 
transform  the String "\t" into a escape character '\t' and remove all 
backslashes.
So, our original correct location resulted in:

{code:java}
D:workspaceIdeaProjectssparksqlhive\targetscala-2.11\test-classesavroDecimal
{code}
 after unescapeSQLString() completed.
Note that, here, [ \t ] is no longer a string, but a escape character. 

Then, return into SparkSqlParser#visitCreateHiveTable(), and move to L1134:

{code:java}
val locUri = location.map(CatalogUtils.stringToURI(_))
{code}

`location` is passed to stringToURI(), and resulted in:

{code:java}
file:/D:workspaceIdeaProjectssparksqlhive%09arg

[jira] [Updated] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()

2018-01-04 Thread wuyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-22967:
-
Description: 
On Windows system, two unit test case would fail while running VersionSuite ("A 
simple set of tests that call the methods of a `HiveClient`, loading different 
version of hive from maven central.")

Failed A : test(s"$version: read avro file containing decimal") 

{code:java}
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path 
from an empty string);
{code}

Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table")

{code:java}
Unable to infer the schema. The schema specification is required to create the 
table `default`.`tab2`.;
org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema 
specification is required to create the table `default`.`tab2`.;
{code}

As I deep into this problem, I found it is related to 
ParserUtils#unescapeSQLString().

These are two lines at the beginning of Failed A:

{code:java}
val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
val location = new File(url.getFile)
{code}

And in my environment,`location` (path value) is

{code:java}
D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
{code}

And then, in SparkSqlParser#visitCreateHiveTable()#L1128:

{code:java}
val location = Option(ctx.locationSpec).map(visitLocationSpec)
{code}
This line want to get LocationSepcContext's content first, which is equal to 
`location` above.
Then, the content is passed to visitLocationSpec(), and passed to 
unescapeSQLString()
finally.

Lets' have a look at unescapeSQLString():
{code:java}
/** Unescape baskslash-escaped string enclosed by quotes. */
  def unescapeSQLString(b: String): String = {
var enclosure: Character = null
val sb = new StringBuilder(b.length())

def appendEscapedChar(n: Char) {
  n match {
case '0' => sb.append('\u')
case '\'' => sb.append('\'')
case '"' => sb.append('\"')
case 'b' => sb.append('\b')
case 'n' => sb.append('\n')
case 'r' => sb.append('\r')
case 't' => sb.append('\t')
case 'Z' => sb.append('\u001A')
case '\\' => sb.append('\\')
// The following 2 lines are exactly what MySQL does TODO: why do we do 
this?
case '%' => sb.append("\\%")
case '_' => sb.append("\\_")
case _ => sb.append(n)
  }
}

var i = 0
val strLength = b.length
while (i < strLength) {
  val currentChar = b.charAt(i)
  if (enclosure == null) {
if (currentChar == '\'' || currentChar == '\"') {
  enclosure = currentChar
}
  } else if (enclosure == currentChar) {
enclosure = null
  } else if (currentChar == '\\') {

if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') {
  // \u style character literals.

  val base = i + 2
  val code = (0 until 4).foldLeft(0) { (mid, j) =>
val digit = Character.digit(b.charAt(j + base), 16)
(mid << 4) + digit
  }
  sb.append(code.asInstanceOf[Char])
  i += 5
} else if (i + 4 < strLength) {
  // \000 style character literals.

  val i1 = b.charAt(i + 1)
  val i2 = b.charAt(i + 2)
  val i3 = b.charAt(i + 3)

  if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= 
'0' && i3 <= '7')) {
val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 
6)).asInstanceOf[Char]
sb.append(tmp)
i += 3
  } else {
appendEscapedChar(i1)
i += 1
  }
} else if (i + 2 < strLength) {
  // escaped character literals.
  val n = b.charAt(i + 1)
  appendEscapedChar(n)
  i += 1
}
  } else {
// non-escaped character literals.
sb.append(currentChar)
  }
  i += 1
}
sb.toString()
  }
{code}
 Again, here, variable `b` is equal to content and `location`, is valued of 

{code:java}
D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
{code}

And we can make sense from the unescapeSQLString()' strategies that it 
transform  the String "\t" into a escape character '\t' and remove all 
backslashes.
So, our original correct location resulted in:

{code:java}
D:workspaceIdeaProjectssparksqlhive\targetscala-2.11\test-classesavroDecimal
{code}
 after unescapeSQLString() completed.
Note that, here, [ \t ] is no longer a string, but a escape character. 

Then, return into SparkSqlParser#visitCreateHiveTable(), and move to L1134:

{code:java}
val locUri = location.map(CatalogUtils.stringToURI(_))
{code}

`location` is passed to stringToURI(), and resulted in:

{code:java}
file:/D:workspaceIdeaProjectssparksqlhive%09arg

[jira] [Created] (SPARK-22967) VersionSuite failed on Windows caused by unescapeSQLString()

2018-01-04 Thread wuyi (JIRA)
wuyi created SPARK-22967:


 Summary: VersionSuite failed on Windows caused by 
unescapeSQLString()
 Key: SPARK-22967
 URL: https://issues.apache.org/jira/browse/SPARK-22967
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
 Environment: Windos7
Reporter: wuyi
Priority: Minor


On Windows system, two unit test case would fail while running VersionSuite ("A 
simple set of tests that call the methods of a `HiveClient`, loading different 
version of hive from maven central.")

Failed A : test(s"$version: read avro file containing decimal") 

{code:java}
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.IllegalArgumentException: Can not create a Path 
from an empty string);
{code}

Failed B: test(s"$version: SPARK-17920: Insert into/overwrite avro table")

{code:java}
Unable to infer the schema. The schema specification is required to create the 
table `default`.`tab2`.;
org.apache.spark.sql.AnalysisException: Unable to infer the schema. The schema 
specification is required to create the table `default`.`tab2`.;
{code}

As I deep into this problem, I found it is related to 
ParserUtils#unescapeSQLString().

These are two lines at the beginning of Failed A:

{code:java}
val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
val location = new File(url.getFile)
{code}

And in my environment,`location` (path value) is

{code:java}
D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
{code}

And then, in SparkSqlParser#visitCreateHiveTable()#L1128:

{code:java}
val location = Option(ctx.locationSpec).map(visitLocationSpec)
{code}
This line want to get LocationSepcContext's content first, which is equal to 
`location` above.
Then, the content is passed to visitLocationSpec(), and passed to 
unescapeSQLString()
finally.

Lets' have a look at unescapeSQLString():
{code:java}
/** Unescape baskslash-escaped string enclosed by quotes. */
  def unescapeSQLString(b: String): String = {
var enclosure: Character = null
val sb = new StringBuilder(b.length())

def appendEscapedChar(n: Char) {
  n match {
case '0' => sb.append('\u')
case '\'' => sb.append('\'')
case '"' => sb.append('\"')
case 'b' => sb.append('\b')
case 'n' => sb.append('\n')
case 'r' => sb.append('\r')
case 't' => sb.append('\t')
case 'Z' => sb.append('\u001A')
case '\\' => sb.append('\\')
// The following 2 lines are exactly what MySQL does TODO: why do we do 
this?
case '%' => sb.append("\\%")
case '_' => sb.append("\\_")
case _ => sb.append(n)
  }
}

var i = 0
val strLength = b.length
while (i < strLength) {
  val currentChar = b.charAt(i)
  if (enclosure == null) {
if (currentChar == '\'' || currentChar == '\"') {
  enclosure = currentChar
}
  } else if (enclosure == currentChar) {
enclosure = null
  } else if (currentChar == '\\') {

if ((i + 6 < strLength) && b.charAt(i + 1) == 'u') {
  // \u style character literals.

  val base = i + 2
  val code = (0 until 4).foldLeft(0) { (mid, j) =>
val digit = Character.digit(b.charAt(j + base), 16)
(mid << 4) + digit
  }
  sb.append(code.asInstanceOf[Char])
  i += 5
} else if (i + 4 < strLength) {
  // \000 style character literals.

  val i1 = b.charAt(i + 1)
  val i2 = b.charAt(i + 2)
  val i3 = b.charAt(i + 3)

  if ((i1 >= '0' && i1 <= '1') && (i2 >= '0' && i2 <= '7') && (i3 >= 
'0' && i3 <= '7')) {
val tmp = ((i3 - '0') + ((i2 - '0') << 3) + ((i1 - '0') << 
6)).asInstanceOf[Char]
sb.append(tmp)
i += 3
  } else {
appendEscapedChar(i1)
i += 1
  }
} else if (i + 2 < strLength) {
  // escaped character literals.
  val n = b.charAt(i + 1)
  appendEscapedChar(n)
  i += 1
}
  } else {
// non-escaped character literals.
sb.append(currentChar)
  }
  i += 1
}
sb.toString()
  }
{code}
 Again, here, variable `b` is equal to content and `location`, is valued of 

{code:java}
D:\workspace\IdeaProjects\spark\sql\hive\target\scala-2.11\test-classes\avroDecimal
{code}

And we can make sense from the unescapeSQLString()' strategies that it 
transform  the String "\t" into a escape character '\t' and remove all 
backslashes.
So, our original correct location resulted in:

{code:java}
D:workspaceIdeaProjectssparksqlhive\targetscala-2.11\test-classesavroDecimal
{code}
 after unescapeSQLString() completed.

Then, return into SparkSqlParser#visitCreateHiveTable(), and move to L1134:

{code:java}
val locUr

[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2018-01-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312398#comment-16312398
 ] 

Reynold Xin commented on SPARK-7721:


I think it's fine even if you don't preserve the history forever ...


> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22957) ApproxQuantile breaks if the number of rows exceeds MaxInt

2018-01-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22957.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20152
[https://github.com/apache/spark/pull/20152]

> ApproxQuantile breaks if the number of rows exceeds MaxInt
> --
>
> Key: SPARK-22957
> URL: https://issues.apache.org/jira/browse/SPARK-22957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
> Fix For: 2.3.0
>
>
> ApproxQuantile overflows when number of rows exceeds 2.147B (max int32).
> If you run ApproxQuantile on a dataframe with 3B rows of 1 to 3B and ask it 
> for 1/6 quantiles, it should return [0.5B, 1B, 1.5B, 2B, 2.5B, 3B]. However, 
> in the [implementation of 
> ApproxQuantile|https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L195],
>  it calls .toInt on the target rank, which overflows at 2.147B.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22957) ApproxQuantile breaks if the number of rows exceeds MaxInt

2018-01-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22957:
---

Assignee: Juliusz Sompolski

> ApproxQuantile breaks if the number of rows exceeds MaxInt
> --
>
> Key: SPARK-22957
> URL: https://issues.apache.org/jira/browse/SPARK-22957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
> Fix For: 2.3.0
>
>
> ApproxQuantile overflows when number of rows exceeds 2.147B (max int32).
> If you run ApproxQuantile on a dataframe with 3B rows of 1 to 3B and ask it 
> for 1/6 quantiles, it should return [0.5B, 1B, 1.5B, 2B, 2.5B, 3B]. However, 
> in the [implementation of 
> ApproxQuantile|https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L195],
>  it calls .toInt on the target rank, which overflows at 2.147B.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2018-01-04 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312356#comment-16312356
 ] 

Hyukjin Kwon commented on SPARK-22632:
--

To me, nope, I don't think so although it might be important to have. In case 
of PySpark <> Pandas related one, it was fixed with a configuration to control 
the behaviour.

I was trying to take a look at that time but I am not sure if it's safe to have 
this at this stage and I can make it within 2.3.0 timeline too ... PySpark 
itself also still has the issue too. FYI.


> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2018-01-04 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312331#comment-16312331
 ] 

Hyukjin Kwon commented on SPARK-7721:
-

I (and possibly few committers given [the comment 
above|https://issues.apache.org/jira/browse/SPARK-7721?focusedCommentId=14551198&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14551198])
 would run this though ..  but yes, sure, it should become actually powerful 
when we can run it automatically.

If we are all fine to have a single up-to-date coverage site ("Simplest one") 
for now, it's pretty easy and possible. It's just what I have done so far here 
- https://spark-test.github.io/pyspark-coverage-site and the only thing I 
should do is to make this automatic, clone the latest commit bit and push it.

I know it's better to keep the history of coverages and leave the link in each 
PR ("Another one") and in this case we should consider how to keep the history 
of the coverages, etc. This is where I should investigate more and verify the 
idea.

Will anyway test and investigate the integration more and try the "Another one" 
way too. If I fail, I think we can fall back to "Simplest one" for now. Does 
this sounds good to you? In this way, I think I can make sure we can run this 
automatically eventually.

BTW, can you take a look for https://github.com/apache/spark/pull/20151 too?

This way we can make the changes separate for Coverage only and I am trying to 
isolate such logics as much as we can in case we can bring better idea in the 
future.


> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22966:


Assignee: Apache Spark

> Spark SQL should handle Python UDFs that return a datetime.date or 
> datetime.datetime
> 
>
> Key: SPARK-22966
> URL: https://issues.apache.org/jira/browse/SPARK-22966
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>Assignee: Apache Spark
>
> Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
> should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} 
> (which should correspond to a Spark SQL {{timestamp}} type), it gets 
> unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand 
> internally, and will thus give incorrect results.
> e.g.
> {code:none}
> >>> import datetime
> >>> from pyspark.sql import *
> >>> py_date = udf(datetime.date)
> >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
> >>> lit(datetime.date(2017, 10, 30))).show()
> ++
> |(date(2017, 10, 30) = DATE '2017-10-30')|
> ++
> |   false|
> ++
> {code}
> (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
> 'date')}} doesn't work either)
> We should correctly handle Python UDFs that return objects of such types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312312#comment-16312312
 ] 

Apache Spark commented on SPARK-22966:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/20163

> Spark SQL should handle Python UDFs that return a datetime.date or 
> datetime.datetime
> 
>
> Key: SPARK-22966
> URL: https://issues.apache.org/jira/browse/SPARK-22966
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>
> Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
> should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} 
> (which should correspond to a Spark SQL {{timestamp}} type), it gets 
> unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand 
> internally, and will thus give incorrect results.
> e.g.
> {code:none}
> >>> import datetime
> >>> from pyspark.sql import *
> >>> py_date = udf(datetime.date)
> >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
> >>> lit(datetime.date(2017, 10, 30))).show()
> ++
> |(date(2017, 10, 30) = DATE '2017-10-30')|
> ++
> |   false|
> ++
> {code}
> (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
> 'date')}} doesn't work either)
> We should correctly handle Python UDFs that return objects of such types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22966:


Assignee: (was: Apache Spark)

> Spark SQL should handle Python UDFs that return a datetime.date or 
> datetime.datetime
> 
>
> Key: SPARK-22966
> URL: https://issues.apache.org/jira/browse/SPARK-22966
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>
> Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
> should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} 
> (which should correspond to a Spark SQL {{timestamp}} type), it gets 
> unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand 
> internally, and will thus give incorrect results.
> e.g.
> {code:none}
> >>> import datetime
> >>> from pyspark.sql import *
> >>> py_date = udf(datetime.date)
> >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
> >>> lit(datetime.date(2017, 10, 30))).show()
> ++
> |(date(2017, 10, 30) = DATE '2017-10-30')|
> ++
> |   false|
> ++
> {code}
> (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
> 'date')}} doesn't work either)
> We should correctly handle Python UDFs that return objects of such types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime

2018-01-04 Thread Kris Mok (JIRA)
Kris Mok created SPARK-22966:


 Summary: Spark SQL should handle Python UDFs that return a 
datetime.date or datetime.datetime
 Key: SPARK-22966
 URL: https://issues.apache.org/jira/browse/SPARK-22966
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.1, 2.2.0
Reporter: Kris Mok


Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which 
should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a 
{{java.util.Calendar}} which Spark SQL doesn't understand internally, and will 
thus give incorrect results.

e.g.
{code:python}
>>> import datetime
>>> from pyspark.sql import *
>>> py_date = udf(datetime.date)
>>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
>>> lit(datetime.date(2017, 10, 30))).show()
++
|(date(2017, 10, 30) = DATE '2017-10-30')|
++
|   false|
++
{code}
(changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
'date')}} doesn't work either)

We should correctly handle Python UDFs that return objects of such types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22966) Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime

2018-01-04 Thread Kris Mok (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-22966:
-
Description: 
Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which 
should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a 
{{java.util.Calendar}} which Spark SQL doesn't understand internally, and will 
thus give incorrect results.

e.g.
{code:none}
>>> import datetime
>>> from pyspark.sql import *
>>> py_date = udf(datetime.date)
>>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
>>> lit(datetime.date(2017, 10, 30))).show()
++
|(date(2017, 10, 30) = DATE '2017-10-30')|
++
|   false|
++
{code}
(changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
'date')}} doesn't work either)

We should correctly handle Python UDFs that return objects of such types.

  was:
Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} (which 
should correspond to a Spark SQL {{timestamp}} type), it gets unpickled into a 
{{java.util.Calendar}} which Spark SQL doesn't understand internally, and will 
thus give incorrect results.

e.g.
{code:python}
>>> import datetime
>>> from pyspark.sql import *
>>> py_date = udf(datetime.date)
>>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
>>> lit(datetime.date(2017, 10, 30))).show()
++
|(date(2017, 10, 30) = DATE '2017-10-30')|
++
|   false|
++
{code}
(changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
'date')}} doesn't work either)

We should correctly handle Python UDFs that return objects of such types.


> Spark SQL should handle Python UDFs that return a datetime.date or 
> datetime.datetime
> 
>
> Key: SPARK-22966
> URL: https://issues.apache.org/jira/browse/SPARK-22966
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>
> Currently, in Spark SQL, if a Python UDF returns a {{datetime.date}} (which 
> should correspond to a Spark SQL {{date}} type) or {{datetime.datetime}} 
> (which should correspond to a Spark SQL {{timestamp}} type), it gets 
> unpickled into a {{java.util.Calendar}} which Spark SQL doesn't understand 
> internally, and will thus give incorrect results.
> e.g.
> {code:none}
> >>> import datetime
> >>> from pyspark.sql import *
> >>> py_date = udf(datetime.date)
> >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30)) == 
> >>> lit(datetime.date(2017, 10, 30))).show()
> ++
> |(date(2017, 10, 30) = DATE '2017-10-30')|
> ++
> |   false|
> ++
> {code}
> (changing the definition of {{py_date}} from {{udf(date)}} to {{udf(date, 
> 'date')}} doesn't work either)
> We should correctly handle Python UDFs that return objects of such types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2018-01-04 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21972:
--
Target Version/s: 2.4.0  (was: 2.3.0)

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22054) Allow release managers to inject their keys

2018-01-04 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22054:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Allow release managers to inject their keys
> ---
>
> Key: SPARK-22054
> URL: https://issues.apache.org/jira/browse/SPARK-22054
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> Right now the current release process signs with Patrick's keys, let's update 
> the scripts to allow the release manager to sign the release as part of the 
> job.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22055) Port release scripts

2018-01-04 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312292#comment-16312292
 ] 

Sameer Agarwal commented on SPARK-22055:


re-targeting for 2.4.0

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22055) Port release scripts

2018-01-04 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22055:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22054) Allow release managers to inject their keys

2018-01-04 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312293#comment-16312293
 ] 

Sameer Agarwal commented on SPARK-22054:


re-targeting for 2.4.0

> Allow release managers to inject their keys
> ---
>
> Key: SPARK-22054
> URL: https://issues.apache.org/jira/browse/SPARK-22054
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> Right now the current release process signs with Patrick's keys, let's update 
> the scripts to allow the release manager to sign the release as part of the 
> job.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2018-01-04 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312287#comment-16312287
 ] 

Sameer Agarwal commented on SPARK-22632:


[~hyukjin.kwon] [~felixcheung] should this be a blocker for 2.3?

cc [~ueshin]

> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22739) Additional Expression Support for Objects

2018-01-04 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312281#comment-16312281
 ] 

Sameer Agarwal commented on SPARK-22739:


Raising priority for this. This would be great to have in 2.3!

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22739) Additional Expression Support for Objects

2018-01-04 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22739:
---
Priority: Critical  (was: Major)

> Additional Expression Support for Objects
> -
>
> Key: SPARK-22739
> URL: https://issues.apache.org/jira/browse/SPARK-22739
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> Some discussion in Spark-Avro [1] motivates additions and minor changes to 
> the {{Objects}} Expressions API [2]. The proposed changes include
> * a generalized form of {{initializeJavaBean}} taking a sequence of 
> initialization expressions that can be applied to instances of varying objects
> * an object cast that performs a simple Java type cast against a value
> * making {{ExternalMapToCatalyst}} public, for use in outside libraries
> These changes would facilitate the writing of custom encoders for varying 
> objects that cannot already be readily converted to a statically typed 
> dataset by a JavaBean encoder (e.g. Avro).
> [1] -- 
> https://github.com/databricks/spark-avro/pull/217#issuecomment-342599110
> [2] --
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-04 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22799:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22798) Add multiple column support to PySpark StringIndexer

2018-01-04 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-22798:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Add multiple column support to PySpark StringIndexer
> 
>
> Key: SPARK-22798
> URL: https://issues.apache.org/jira/browse/SPARK-22798
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2018-01-04 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312275#comment-16312275
 ] 

Sameer Agarwal commented on SPARK-22799:


[~mgaido] [~mlnick] I'm re-targeting this for 2.4.0. Please let me know if you 
think this should block 2.3.0.

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22960:
--

Assignee: Marcelo Vanzin

> Make build-push-docker-images.sh more dev-friendly
> --
>
> Key: SPARK-22960
> URL: https://issues.apache.org/jira/browse/SPARK-22960
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> It's currently kinda hard to test the kubernetes backend. To use 
> {{build-push-docker-images.sh}} you have to create a Spark dist archive which 
> slows down things a lot of you're doing iterative development.
> The script could be enhanced to make it easier for developers to test their 
> changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22960.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20154
[https://github.com/apache/spark/pull/20154]

> Make build-push-docker-images.sh more dev-friendly
> --
>
> Key: SPARK-22960
> URL: https://issues.apache.org/jira/browse/SPARK-22960
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> It's currently kinda hard to test the kubernetes backend. To use 
> {{build-push-docker-images.sh}} you have to create a Spark dist archive which 
> slows down things a lot of you're doing iterative development.
> The script could be enhanced to make it easier for developers to test their 
> changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2018-01-04 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312264#comment-16312264
 ] 

Bago Amirbekian commented on SPARK-22126:
-

[~bryanc] thanks for taking the time to put together the PR and share thoughts. 
I like the idea of being able to preserve the existing APIs and not needing to 
add a new fitMultiple API but I'm concerned the existing APIs aren't quite 
flexible enough.

One of the use cases that motivated the {{ fitMultiple }} API was optimizing 
the Pipeline Estimator. The Pipeline Estimator seems like in important one to 
optimize because I believe it's required in order for CrossValidator to be able 
to exploit optimized implementations of the {{ fit }}/{{ fitMultiple }} methods 
of Pipeline stages.

The way one would optimize the Pipeline Estimator is to group the paramMaps 
into a tree structure where each level represents a stage with a param that can 
take multiple values. One would then traverse the tree in depth first order. 
Notice that in this case the params need not be estimator params, but could 
actually be transformer params as well since we can avoid applying expensive 
transformers multiple times.

With this approach all the params for a pipeline estimator after the top level 
of the tree are "optimizable" so simply being group on optimizable params isn't 
sufficient, we need to actually order the paramMaps to match the depth first 
traversal of the stages tree.

I'm still thinking through all this in my head so let me know if any of it is 
off base or not clear, but I think the advantage of the {{ fitMultiple }} 
approach gives us full flexibility in order to these kinds of optimizations.

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> Anyone who's following might want to scan the design doc (in the links 
> above), the latest api proposal is:
> {code}
> def fitMultiple(
> dataset: Dataset[_],
> paramMaps: Array[ParamMap]
>   ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]
> {code}
> Old discussion:
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the 

[jira] [Assigned] (SPARK-22965) Add deterministic parameter to registerJavaFunction

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22965:


Assignee: Apache Spark  (was: Xiao Li)

> Add deterministic parameter to registerJavaFunction
> ---
>
> Key: SPARK-22965
> URL: https://issues.apache.org/jira/browse/SPARK-22965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> To register a JAVA UDF in PySpark, users are unable to specify the registered 
> UDF is not deterministic. The proposal is to add the extra parameter 
> deterministic at the end of the function registerJavaFunction
> Below is an example. 
> {noformat}
> >>> from pyspark.sql.types import DoubleType
> >>> sqlContext.registerJavaFunction("javaRand",
> ...   "test.org.apache.spark.sql.JavaRandUDF", DoubleType(), 
> deterministic=False)
> >>> sqlContext.sql("SELECT javaRand(3)").collect()  # doctest: +SKIP
> [Row(UDF:javaRand(3)=3.12345)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22965) Add deterministic parameter to registerJavaFunction

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22965:


Assignee: Xiao Li  (was: Apache Spark)

> Add deterministic parameter to registerJavaFunction
> ---
>
> Key: SPARK-22965
> URL: https://issues.apache.org/jira/browse/SPARK-22965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> To register a JAVA UDF in PySpark, users are unable to specify the registered 
> UDF is not deterministic. The proposal is to add the extra parameter 
> deterministic at the end of the function registerJavaFunction
> Below is an example. 
> {noformat}
> >>> from pyspark.sql.types import DoubleType
> >>> sqlContext.registerJavaFunction("javaRand",
> ...   "test.org.apache.spark.sql.JavaRandUDF", DoubleType(), 
> deterministic=False)
> >>> sqlContext.sql("SELECT javaRand(3)").collect()  # doctest: +SKIP
> [Row(UDF:javaRand(3)=3.12345)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22965) Add deterministic parameter to registerJavaFunction

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312232#comment-16312232
 ] 

Apache Spark commented on SPARK-22965:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20162

> Add deterministic parameter to registerJavaFunction
> ---
>
> Key: SPARK-22965
> URL: https://issues.apache.org/jira/browse/SPARK-22965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> To register a JAVA UDF in PySpark, users are unable to specify the registered 
> UDF is not deterministic. The proposal is to add the extra parameter 
> deterministic at the end of the function registerJavaFunction
> Below is an example. 
> {noformat}
> >>> from pyspark.sql.types import DoubleType
> >>> sqlContext.registerJavaFunction("javaRand",
> ...   "test.org.apache.spark.sql.JavaRandUDF", DoubleType(), 
> deterministic=False)
> >>> sqlContext.sql("SELECT javaRand(3)").collect()  # doctest: +SKIP
> [Row(UDF:javaRand(3)=3.12345)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22965) Add deterministic parameter to registerJavaFunction

2018-01-04 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22965:
---

 Summary: Add deterministic parameter to registerJavaFunction
 Key: SPARK-22965
 URL: https://issues.apache.org/jira/browse/SPARK-22965
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


To register a JAVA UDF in PySpark, users are unable to specify the registered 
UDF is not deterministic. The proposal is to add the extra parameter 
deterministic at the end of the function registerJavaFunction

Below is an example. 
{noformat}
>>> from pyspark.sql.types import DoubleType
>>> sqlContext.registerJavaFunction("javaRand",
...   "test.org.apache.spark.sql.JavaRandUDF", DoubleType(), 
deterministic=False)
>>> sqlContext.sql("SELECT javaRand(3)").collect()  # doctest: +SKIP
[Row(UDF:javaRand(3)=3.12345)]
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22953.

Resolution: Fixed

Issue resolved by pull request 20159
[https://github.com/apache/spark/pull/20159]

> Duplicated secret volumes in Spark pods when init-containers are used
> -
>
> Key: SPARK-22953
> URL: https://issues.apache.org/jira/browse/SPARK-22953
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Yinan Li
> Fix For: 2.3.0
>
>
> User-specified secrets are mounted into both the main container and 
> init-container (when it is used) in a Spark driver/executor pod, using the 
> {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the 
> secret volumes to the pod, the same secret volumes get added twice, one when 
> mounting the secrets to the main container, and the other when mounting the 
> secrets to the init-container. See 
> https://github.com/apache-spark-on-k8s/spark/issues/594.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22953:
--

Assignee: Yinan Li

> Duplicated secret volumes in Spark pods when init-containers are used
> -
>
> Key: SPARK-22953
> URL: https://issues.apache.org/jira/browse/SPARK-22953
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Yinan Li
> Fix For: 2.3.0
>
>
> User-specified secrets are mounted into both the main container and 
> init-container (when it is used) in a Spark driver/executor pod, using the 
> {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the 
> secret volumes to the pod, the same secret volumes get added twice, one when 
> mounting the secrets to the main container, and the other when mounting the 
> secrets to the init-container. See 
> https://github.com/apache-spark-on-k8s/spark/issues/594.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312202#comment-16312202
 ] 

Apache Spark commented on SPARK-21525:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20161

> ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
> -
>
> Key: SPARK-21525
> URL: https://issues.apache.org/jira/browse/SPARK-21525
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> {{AddBlock}} returns an error code related to whether writing the block to 
> the WAL was successful or not. In cases where a WAL may be unavailable 
> temporarily, the write would fail but it seems like we are not using the 
> return code (see 
> [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]).
> For example, when using the Flume Receiver, we should be sending a n'ack back 
> to Flume if the block wasn't written to the WAL. I haven't gone through the 
> full code path yet but at least from looking at the ReceiverSupervisorImpl, 
> it doesn't seem like that return code is being used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21525:


Assignee: Apache Spark

> ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
> -
>
> Key: SPARK-21525
> URL: https://issues.apache.org/jira/browse/SPARK-21525
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Apache Spark
>
> {{AddBlock}} returns an error code related to whether writing the block to 
> the WAL was successful or not. In cases where a WAL may be unavailable 
> temporarily, the write would fail but it seems like we are not using the 
> return code (see 
> [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]).
> For example, when using the Flume Receiver, we should be sending a n'ack back 
> to Flume if the block wasn't written to the WAL. I haven't gone through the 
> full code path yet but at least from looking at the ReceiverSupervisorImpl, 
> it doesn't seem like that return code is being used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21525:


Assignee: (was: Apache Spark)

> ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL
> -
>
> Key: SPARK-21525
> URL: https://issues.apache.org/jira/browse/SPARK-21525
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> {{AddBlock}} returns an error code related to whether writing the block to 
> the WAL was successful or not. In cases where a WAL may be unavailable 
> temporarily, the write would fail but it seems like we are not using the 
> return code (see 
> [here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]).
> For example, when using the Flume Receiver, we should be sending a n'ack back 
> to Flume if the block wasn't written to the WAL. I haven't gone through the 
> full code path yet but at least from looking at the ReceiverSupervisorImpl, 
> it doesn't seem like that return code is being used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312172#comment-16312172
 ] 

Apache Spark commented on SPARK-22757:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20160

> Init-container in the driver/executor pods for downloading remote dependencies
> --
>
> Key: SPARK-22757
> URL: https://issues.apache.org/jira/browse/SPARK-22757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Yinan Li
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22953) Duplicated secret volumes in Spark pods when init-containers are used

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312160#comment-16312160
 ] 

Apache Spark commented on SPARK-22953:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20159

> Duplicated secret volumes in Spark pods when init-containers are used
> -
>
> Key: SPARK-22953
> URL: https://issues.apache.org/jira/browse/SPARK-22953
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
> Fix For: 2.3.0
>
>
> User-specified secrets are mounted into both the main container and 
> init-container (when it is used) in a Spark driver/executor pod, using the 
> {{MountSecretsBootstrap}}. Because {{MountSecretsBootstrap}} always adds the 
> secret volumes to the pod, the same secret volumes get added twice, one when 
> mounting the secrets to the main container, and the other when mounting the 
> secrets to the init-container. See 
> https://github.com/apache-spark-on-k8s/spark/issues/594.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22948:
--

Assignee: Marcelo Vanzin

> "SparkPodInitContainer" shouldn't be in "rest" package
> --
>
> Key: SPARK-22948
> URL: https://issues.apache.org/jira/browse/SPARK-22948
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Just noticed while playing with the code that this class is in 
> {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since 
> there's no REST server (and it's the only class in there, too).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22948.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20156
[https://github.com/apache/spark/pull/20156]

> "SparkPodInitContainer" shouldn't be in "rest" package
> --
>
> Key: SPARK-22948
> URL: https://issues.apache.org/jira/browse/SPARK-22948
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Just noticed while playing with the code that this class is in 
> {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since 
> there's no REST server (and it's the only class in there, too).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22850) Executor page in SHS does not show driver

2018-01-04 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-22850:


Assignee: Marcelo Vanzin

> Executor page in SHS does not show driver
> -
>
> Key: SPARK-22850
> URL: https://issues.apache.org/jira/browse/SPARK-22850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> This bug is sort of related to SPARK-22836.
> Starting with Spark 2.2 (at least), the event logs generated by Spark do not 
> contain a {{SparkListenerBlockManagerAdded}} entry for the driver. That means 
> when applications are replayed in the SHS, the driver is not listed in the 
> executors page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22850) Executor page in SHS does not show driver

2018-01-04 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-22850.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20039
[https://github.com/apache/spark/pull/20039]

> Executor page in SHS does not show driver
> -
>
> Key: SPARK-22850
> URL: https://issues.apache.org/jira/browse/SPARK-22850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> This bug is sort of related to SPARK-22836.
> Starting with Spark 2.2 (at least), the event logs generated by Spark do not 
> contain a {{SparkListenerBlockManagerAdded}} entry for the driver. That means 
> when applications are replayed in the SHS, the driver is not listed in the 
> executors page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22963) Make failure recovery global and automatic for continuous processing.

2018-01-04 Thread Jose Torres (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Torres updated SPARK-22963:

Summary: Make failure recovery global and automatic for continuous 
processing.  (was: Clean up continuous processing failure recovery)

> Make failure recovery global and automatic for continuous processing.
> -
>
> Key: SPARK-22963
> URL: https://issues.apache.org/jira/browse/SPARK-22963
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> Spark native task restarts don't work well for continuous processing. They 
> will process all data from the task's original start offset - even data which 
> has already been committed. This is not semantically incorrect under at least 
> once semantics, but it's awkward and bad.
> Fortunately, they're also not necessary; the central coordinator can restart 
> every task from the checkpointed offsets without losing much. So we should 
> make that happen on task failures instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22963) Make failure recovery global and automatic for continuous processing.

2018-01-04 Thread Jose Torres (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Torres updated SPARK-22963:

Description: 
Spark native task restarts don't work well for continuous processing. They will 
process all data from the task's original start offset - even data which has 
already been committed. This is not semantically incorrect under at least once 
semantics, but it's awkward and bad.

Fortunately, they're also not necessary; the central coordinator can restart 
every task from the checkpointed offsets without losing much. So we should make 
that happen automatically on task failures instead.

  was:
Spark native task restarts don't work well for continuous processing. They will 
process all data from the task's original start offset - even data which has 
already been committed. This is not semantically incorrect under at least once 
semantics, but it's awkward and bad.

Fortunately, they're also not necessary; the central coordinator can restart 
every task from the checkpointed offsets without losing much. So we should make 
that happen on task failures instead.


> Make failure recovery global and automatic for continuous processing.
> -
>
> Key: SPARK-22963
> URL: https://issues.apache.org/jira/browse/SPARK-22963
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> Spark native task restarts don't work well for continuous processing. They 
> will process all data from the task's original start offset - even data which 
> has already been committed. This is not semantically incorrect under at least 
> once semantics, but it's awkward and bad.
> Fortunately, they're also not necessary; the central coordinator can restart 
> every task from the checkpointed offsets without losing much. So we should 
> make that happen automatically on task failures instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22963) Clean up continuous processing failure recovery

2018-01-04 Thread Jose Torres (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Torres updated SPARK-22963:

Description: 
Spark native task restarts don't work well for continuous processing. They will 
process all data from the task's original start offset - even data which has 
already been committed. This is not semantically incorrect under at least once 
semantics, but it's awkward and bad.

Fortunately, they're also not necessary; the central coordinator can restart 
every task from the checkpointed offsets without losing much. So we should 
force 

> Clean up continuous processing failure recovery
> ---
>
> Key: SPARK-22963
> URL: https://issues.apache.org/jira/browse/SPARK-22963
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> Spark native task restarts don't work well for continuous processing. They 
> will process all data from the task's original start offset - even data which 
> has already been committed. This is not semantically incorrect under at least 
> once semantics, but it's awkward and bad.
> Fortunately, they're also not necessary; the central coordinator can restart 
> every task from the checkpointed offsets without losing much. So we should 
> force 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22963) Clean up continuous processing failure recovery

2018-01-04 Thread Jose Torres (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Torres updated SPARK-22963:

Description: 
Spark native task restarts don't work well for continuous processing. They will 
process all data from the task's original start offset - even data which has 
already been committed. This is not semantically incorrect under at least once 
semantics, but it's awkward and bad.

Fortunately, they're also not necessary; the central coordinator can restart 
every task from the checkpointed offsets without losing much. So we should make 
that happen on task failures instead.

  was:
Spark native task restarts don't work well for continuous processing. They will 
process all data from the task's original start offset - even data which has 
already been committed. This is not semantically incorrect under at least once 
semantics, but it's awkward and bad.

Fortunately, they're also not necessary; the central coordinator can restart 
every task from the checkpointed offsets without losing much. So we should 
force 


> Clean up continuous processing failure recovery
> ---
>
> Key: SPARK-22963
> URL: https://issues.apache.org/jira/browse/SPARK-22963
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> Spark native task restarts don't work well for continuous processing. They 
> will process all data from the task's original start offset - even data which 
> has already been committed. This is not semantically incorrect under at least 
> once semantics, but it's awkward and bad.
> Fortunately, they're also not necessary; the central coordinator can restart 
> every task from the checkpointed offsets without losing much. So we should 
> make that happen on task failures instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22964) don't allow task restarts for continuous processing

2018-01-04 Thread Jose Torres (JIRA)
Jose Torres created SPARK-22964:
---

 Summary: don't allow task restarts for continuous processing
 Key: SPARK-22964
 URL: https://issues.apache.org/jira/browse/SPARK-22964
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22963) Clean up continuous processing failure recovery

2018-01-04 Thread Jose Torres (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Torres updated SPARK-22963:

Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-20928)

> Clean up continuous processing failure recovery
> ---
>
> Key: SPARK-22963
> URL: https://issues.apache.org/jira/browse/SPARK-22963
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22963) Clean up continuous processing failure recovery

2018-01-04 Thread Jose Torres (JIRA)
Jose Torres created SPARK-22963:
---

 Summary: Clean up continuous processing failure recovery
 Key: SPARK-22963
 URL: https://issues.apache.org/jira/browse/SPARK-22963
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21760) Structured streaming terminates with Exception

2018-01-04 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312089#comment-16312089
 ] 

Shixiong Zhu commented on SPARK-21760:
--

Could you try 2.2.1? It's probably fixed by this line: 
https://github.com/apache/spark/commit/6edfff055caea81dc3a98a6b4081313a0c0b0729#diff-aaeb546880508bb771df502318c40a99L126

> Structured streaming terminates with Exception 
> ---
>
> Key: SPARK-21760
> URL: https://issues.apache.org/jira/browse/SPARK-21760
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>Priority: Critical
>
> We have seen Structured stream stops with exception below 
> While analyzing the content we found that latest log file as just one line 
> with version 
> {quote}hdfs dfs -cat warehouse/latency_internal/_spark_metadata/1683
> v1
> {quote}
> Exception is below 
> Exception in thread "stream execution thread for latency_internal [id = 
> 39f35d01-60d5-40b4-826e-99e5e38d0077, runId = 
> 95c95a01-bd4f-4604-8aae-c0c5d3e873e8]" java.lang.IllegalStateException: 
> Incomplete log file
> at 
> org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.deserialize(CompactibleFileStreamLog.scala:147)
> at 
> org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.deserialize(CompactibleFileStreamLog.scala:42)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$getLatest$1.apply$mcVJ$sp(HDFSMetadataLog.scala:266)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$getLatest$1.apply(HDFSMetadataLog.scala:265)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$getLatest$1.apply(HDFSMetadataLog.scala:265)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at 
> scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246)
> at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:265)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:60)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:256)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$logicalPlan$1.applyOrElse(StreamExecution.scala:127)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$logicalPlan$1.applyOrElse(StreamExecution.scala:123)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
> 

[jira] [Commented] (SPARK-22757) Init-container in the driver/executor pods for downloading remote dependencies

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312064#comment-16312064
 ] 

Apache Spark commented on SPARK-22757:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20157

> Init-container in the driver/executor pods for downloading remote dependencies
> --
>
> Key: SPARK-22757
> URL: https://issues.apache.org/jira/browse/SPARK-22757
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Yinan Li
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21475) Change the usage of FileInputStream/OutputStream to Files.newInput/OutputStream in the critical path

2018-01-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21475.
--
   Resolution: Fixed
Fix Version/s: 2.3.0
   3.0.0

Issue resolved by pull request 20144
[https://github.com/apache/spark/pull/20144]

> Change the usage of FileInputStream/OutputStream to 
> Files.newInput/OutputStream in the critical path
> 
>
> Key: SPARK-21475
> URL: https://issues.apache.org/jira/browse/SPARK-21475
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
>
> Java's {{FileInputStream}} and {{FileOutputStream}} overrides {{finalize()}}, 
> even this file input/output stream is closed correctly and promptly, it will 
> still leave some memory footprints which will get cleaned in Full GC. This 
> will introduce two side effects:
> 1. Lots of memory footprints regarding to Finalizer will be kept in memory 
> and this will increase the memory overhead. In our use case of external 
> shuffle service, a busy shuffle service will have bunch of this object and 
> potentially lead to OOM.
> 2. The Finalizer will only be called in Full GC, and this will increase the 
> overhead of Full GC and lead to long GC pause.
> So to fix this potential issue, here propose to use NIO's 
> Files#newInput/OutputStream instead in some critical paths like shuffle.
> https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21475) Change to use NIO's Files API for external shuffle service

2018-01-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21475:
-
Summary: Change to use NIO's Files API for external shuffle service  (was: 
Change the usage of FileInputStream/OutputStream to Files.newInput/OutputStream 
in the critical path)

> Change to use NIO's Files API for external shuffle service
> --
>
> Key: SPARK-21475
> URL: https://issues.apache.org/jira/browse/SPARK-21475
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.3.0, 3.0.0
>
>
> Java's {{FileInputStream}} and {{FileOutputStream}} overrides {{finalize()}}, 
> even this file input/output stream is closed correctly and promptly, it will 
> still leave some memory footprints which will get cleaned in Full GC. This 
> will introduce two side effects:
> 1. Lots of memory footprints regarding to Finalizer will be kept in memory 
> and this will increase the memory overhead. In our use case of external 
> shuffle service, a busy shuffle service will have bunch of this object and 
> potentially lead to OOM.
> 2. The Finalizer will only be called in Full GC, and this will increase the 
> overhead of Full GC and lead to long GC pause.
> So to fix this potential issue, here propose to use NIO's 
> Files#newInput/OutputStream instead in some critical paths like shuffle.
> https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12344) Remove env-based configurations

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-12344:
--

Assignee: (was: Marcelo Vanzin)

> Remove env-based configurations
> ---
>
> Key: SPARK-12344
> URL: https://issues.apache.org/jira/browse/SPARK-12344
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Reporter: Marcelo Vanzin
>
> We should remove as many env-based configurations as it makes sense, since 
> they are deprecated and we prefer to use Spark's configuration.
> Tools available through the command line should consistently support both a 
> properties file with configuration keys and the {{--conf}} command line 
> argument such as the one SparkSubmit supports.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12344) Remove env-based configurations

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-12344:
--

Assignee: Marcelo Vanzin

> Remove env-based configurations
> ---
>
> Key: SPARK-12344
> URL: https://issues.apache.org/jira/browse/SPARK-12344
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> We should remove as many env-based configurations as it makes sense, since 
> they are deprecated and we prefer to use Spark's configuration.
> Tools available through the command line should consistently support both a 
> properties file with configuration keys and the {{--conf}} command line 
> argument such as the one SparkSubmit supports.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20657) Speed up Stage page

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-20657:
---
Target Version/s: 2.3.0

I'm targeting 2.3.0 since the stages page is really slow for large apps without 
this fix. The fix also changes the data saved to the disk store, so adding this 
to a later release would require revving the store version and re-parsing event 
logs, so better to avoid that.

The patch has been up for review for a while but nobody has looked at it yet.

> Speed up Stage page
> ---
>
> Key: SPARK-20657
> URL: https://issues.apache.org/jira/browse/SPARK-20657
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> The Stage page in the UI is very slow when a large number of tasks exist 
> (tens of thousands). The new work being done in SPARK-18085 makes that worse, 
> since it adds potential disk access to the mix.
> A lot of the slowness is because the code loads all the tasks in memory then 
> sorts a really large list, and does a lot of calculations on all the data; 
> both can be avoided with the new app state store by having smarter indices 
> (so data is read from the store sorted in the desired order) and by keeping 
> statistics about metrics pre-calculated (instead of re-doing that on every 
> page access).
> Then only the tasks on the current page (100 items by default) need to 
> actually be loaded. This also saves a lot on memory usage, not just CPU time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22850) Executor page in SHS does not show driver

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-22850:
---
Target Version/s: 2.3.0

> Executor page in SHS does not show driver
> -
>
> Key: SPARK-22850
> URL: https://issues.apache.org/jira/browse/SPARK-22850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> This bug is sort of related to SPARK-22836.
> Starting with Spark 2.2 (at least), the event logs generated by Spark do not 
> contain a {{SparkListenerBlockManagerAdded}} entry for the driver. That means 
> when applications are replayed in the SHS, the driver is not listed in the 
> executors page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311864#comment-16311864
 ] 

Apache Spark commented on SPARK-22948:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20156

> "SparkPodInitContainer" shouldn't be in "rest" package
> --
>
> Key: SPARK-22948
> URL: https://issues.apache.org/jira/browse/SPARK-22948
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Trivial
>
> Just noticed while playing with the code that this class is in 
> {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since 
> there's no REST server (and it's the only class in there, too).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22948:


Assignee: Apache Spark

> "SparkPodInitContainer" shouldn't be in "rest" package
> --
>
> Key: SPARK-22948
> URL: https://issues.apache.org/jira/browse/SPARK-22948
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Trivial
>
> Just noticed while playing with the code that this class is in 
> {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since 
> there's no REST server (and it's the only class in there, too).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22948) "SparkPodInitContainer" shouldn't be in "rest" package

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22948:


Assignee: (was: Apache Spark)

> "SparkPodInitContainer" shouldn't be in "rest" package
> --
>
> Key: SPARK-22948
> URL: https://issues.apache.org/jira/browse/SPARK-22948
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Trivial
>
> Just noticed while playing with the code that this class is in 
> {{org.apache.spark.deploy.rest.k8s}}; "rest" doesn't make sense here since 
> there's no REST server (and it's the only class in there, too).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-04 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311859#comment-16311859
 ] 

Li Jin commented on SPARK-22947:


I have implemented a simple version of "as-of join" using existing Spark SQL 
syntax:

{code:java}
from pyspark.sql.functions import monotonically_increasing_id, col, max

df1 = spark.range(0, 1000, 2).toDF('time').withColumn('v1', col('time'))
df2 = spark.range(1, 1001, 2).toDF('time').withColumn('v2', col('time'))

tolerance = 100
{code}

{code:java}
df2 = df2.withColumn("rowId", monotonically_increasing_id()) 
df3 = df2.join(df1, (df2.time >= df1.time) & (df1.time >= df2.time - 
tolerance), 'leftOuter')
max_time = df3.groupby('rowId').agg(max(df1['time'])).sort('rowId')
df4 = df3.join(max_time, "rowId", 'leftOuter').filter((df1['time'] == 
col('max(time)')) | (df1['time'].isNull()))
df5 = df4.select(df2['time'], df1['time'], df1['v1'], 
df2['v2']).sort(df2['time'])

df5.count()
{code}

This works and gives me the correct result. [~rxin] I agree with you that we 
should separate logical plan from physical execution. However, I don't see how 
it is feasible for the query planner to understand this logical plan and turn 
it into a "range partition + merge join" physical plan... 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is jo

[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-04 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-22962:
---
Issue Type: Bug  (was: Improvement)

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-04 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-22962:
--

 Summary: Kubernetes app fails if local files are used
 Key: SPARK-22962
 URL: https://issues.apache.org/jira/browse/SPARK-22962
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


If you try to start a Spark app on kubernetes using a local file as the app 
resource, for example, it will fail:

{code}
./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
{code}

{noformat}
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
\'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
/tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
 ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && 
if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! 
[ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
_MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
"${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
Y -Xmx$SPARK_DRIVER_MEMORY 
-Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
$SPARK_DRIVER_ARGS'
Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
{noformat}

Using an http server to provide the app jar solves the problem.

The k8s backend should either somehow make these files available to the cluster 
or error out with a more user-friendly message if that feature is not yet 
available.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22961) Constant columns no longer picked as constraints in 2.3

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311771#comment-16311771
 ] 

Apache Spark commented on SPARK-22961:
--

User 'adrian-ionescu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20155

> Constant columns no longer picked as constraints in 2.3
> ---
>
> Key: SPARK-22961
> URL: https://issues.apache.org/jira/browse/SPARK-22961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Adrian Ionescu
>  Labels: constraints, optimizer, regression
>
> We're no longer picking up {{x = 2}} as a constraint from something like 
> {{df.withColumn("x", lit(2))}}
> The unit test below succeeds in {{branch-2.2}}:
> {code}
> test("constraints should be inferred from aliased literals") {
> val originalLeft = testRelation.subquery('left).as("left")
> val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && 
> 'a <=> 2).as("left")
> val right = Project(Seq(Literal(2).as("two")), 
> testRelation.subquery('right)).as("right")
> val condition = Some("left.a".attr === "right.two".attr)
> val original = originalLeft.join(right, Inner, condition)
> val correct = optimizedLeft.join(right, Inner, condition)
> comparePlans(Optimize.execute(original.analyze), correct.analyze)
>   }
> {code}
> but fails in {{branch-2.3}} with:
> {code}
> == FAIL: Plans do not match ===
>  'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0)
> !:- Filter isnotnull(a#0)   :- Filter ((2 <=> a#0) && 
> isnotnull(a#0))
>  :  +- LocalRelation , [a#0, b#0, c#0]   :  +- LocalRelation , 
> [a#0, b#0, c#0]
>  +- Project [2 AS two#0]+- Project [2 AS two#0]
> +- LocalRelation , [a#0, b#0, c#0]  +- LocalRelation , 
> [a#0, b#0, c#0] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22961) Constant columns no longer picked as constraints in 2.3

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22961:


Assignee: (was: Apache Spark)

> Constant columns no longer picked as constraints in 2.3
> ---
>
> Key: SPARK-22961
> URL: https://issues.apache.org/jira/browse/SPARK-22961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Adrian Ionescu
>  Labels: constraints, optimizer, regression
>
> We're no longer picking up {{x = 2}} as a constraint from something like 
> {{df.withColumn("x", lit(2))}}
> The unit test below succeeds in {{branch-2.2}}:
> {code}
> test("constraints should be inferred from aliased literals") {
> val originalLeft = testRelation.subquery('left).as("left")
> val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && 
> 'a <=> 2).as("left")
> val right = Project(Seq(Literal(2).as("two")), 
> testRelation.subquery('right)).as("right")
> val condition = Some("left.a".attr === "right.two".attr)
> val original = originalLeft.join(right, Inner, condition)
> val correct = optimizedLeft.join(right, Inner, condition)
> comparePlans(Optimize.execute(original.analyze), correct.analyze)
>   }
> {code}
> but fails in {{branch-2.3}} with:
> {code}
> == FAIL: Plans do not match ===
>  'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0)
> !:- Filter isnotnull(a#0)   :- Filter ((2 <=> a#0) && 
> isnotnull(a#0))
>  :  +- LocalRelation , [a#0, b#0, c#0]   :  +- LocalRelation , 
> [a#0, b#0, c#0]
>  +- Project [2 AS two#0]+- Project [2 AS two#0]
> +- LocalRelation , [a#0, b#0, c#0]  +- LocalRelation , 
> [a#0, b#0, c#0] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22961) Constant columns no longer picked as constraints in 2.3

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22961:


Assignee: Apache Spark

> Constant columns no longer picked as constraints in 2.3
> ---
>
> Key: SPARK-22961
> URL: https://issues.apache.org/jira/browse/SPARK-22961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Adrian Ionescu
>Assignee: Apache Spark
>  Labels: constraints, optimizer, regression
>
> We're no longer picking up {{x = 2}} as a constraint from something like 
> {{df.withColumn("x", lit(2))}}
> The unit test below succeeds in {{branch-2.2}}:
> {code}
> test("constraints should be inferred from aliased literals") {
> val originalLeft = testRelation.subquery('left).as("left")
> val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && 
> 'a <=> 2).as("left")
> val right = Project(Seq(Literal(2).as("two")), 
> testRelation.subquery('right)).as("right")
> val condition = Some("left.a".attr === "right.two".attr)
> val original = originalLeft.join(right, Inner, condition)
> val correct = optimizedLeft.join(right, Inner, condition)
> comparePlans(Optimize.execute(original.analyze), correct.analyze)
>   }
> {code}
> but fails in {{branch-2.3}} with:
> {code}
> == FAIL: Plans do not match ===
>  'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0)
> !:- Filter isnotnull(a#0)   :- Filter ((2 <=> a#0) && 
> isnotnull(a#0))
>  :  +- LocalRelation , [a#0, b#0, c#0]   :  +- LocalRelation , 
> [a#0, b#0, c#0]
>  +- Project [2 AS two#0]+- Project [2 AS two#0]
> +- LocalRelation , [a#0, b#0, c#0]  +- LocalRelation , 
> [a#0, b#0, c#0] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22961) Constant columns no longer picked as constraints in 2.3

2018-01-04 Thread Adrian Ionescu (JIRA)
Adrian Ionescu created SPARK-22961:
--

 Summary: Constant columns no longer picked as constraints in 2.3
 Key: SPARK-22961
 URL: https://issues.apache.org/jira/browse/SPARK-22961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0, 3.0.0
Reporter: Adrian Ionescu


We're no longer picking up {{x = 2}} as a constraint from something like 
{{df.withColumn("x", lit(2))}}

The unit test below succeeds in {{branch-2.2}}:
{code}
test("constraints should be inferred from aliased literals") {
val originalLeft = testRelation.subquery('left).as("left")
val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && 'a 
<=> 2).as("left")

val right = Project(Seq(Literal(2).as("two")), 
testRelation.subquery('right)).as("right")
val condition = Some("left.a".attr === "right.two".attr)

val original = originalLeft.join(right, Inner, condition)
val correct = optimizedLeft.join(right, Inner, condition)

comparePlans(Optimize.execute(original.analyze), correct.analyze)
  }
{code}
but fails in {{branch-2.3}} with:
{code}
== FAIL: Plans do not match ===
 'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0)
!:- Filter isnotnull(a#0)   :- Filter ((2 <=> a#0) && 
isnotnull(a#0))
 :  +- LocalRelation , [a#0, b#0, c#0]   :  +- LocalRelation , 
[a#0, b#0, c#0]
 +- Project [2 AS two#0]+- Project [2 AS two#0]
+- LocalRelation , [a#0, b#0, c#0]  +- LocalRelation , 
[a#0, b#0, c#0] 
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22960:


Assignee: Apache Spark

> Make build-push-docker-images.sh more dev-friendly
> --
>
> Key: SPARK-22960
> URL: https://issues.apache.org/jira/browse/SPARK-22960
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> It's currently kinda hard to test the kubernetes backend. To use 
> {{build-push-docker-images.sh}} you have to create a Spark dist archive which 
> slows down things a lot of you're doing iterative development.
> The script could be enhanced to make it easier for developers to test their 
> changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311746#comment-16311746
 ] 

Apache Spark commented on SPARK-22960:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20154

> Make build-push-docker-images.sh more dev-friendly
> --
>
> Key: SPARK-22960
> URL: https://issues.apache.org/jira/browse/SPARK-22960
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> It's currently kinda hard to test the kubernetes backend. To use 
> {{build-push-docker-images.sh}} you have to create a Spark dist archive which 
> slows down things a lot of you're doing iterative development.
> The script could be enhanced to make it easier for developers to test their 
> changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22960:


Assignee: (was: Apache Spark)

> Make build-push-docker-images.sh more dev-friendly
> --
>
> Key: SPARK-22960
> URL: https://issues.apache.org/jira/browse/SPARK-22960
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> It's currently kinda hard to test the kubernetes backend. To use 
> {{build-push-docker-images.sh}} you have to create a Spark dist archive which 
> slows down things a lot of you're doing iterative development.
> The script could be enhanced to make it easier for developers to test their 
> changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22960) Make build-push-docker-images.sh more dev-friendly

2018-01-04 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-22960:
--

 Summary: Make build-push-docker-images.sh more dev-friendly
 Key: SPARK-22960
 URL: https://issues.apache.org/jira/browse/SPARK-22960
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Minor


It's currently kinda hard to test the kubernetes backend. To use 
{{build-push-docker-images.sh}} you have to create a Spark dist archive which 
slows down things a lot of you're doing iterative development.

The script could be enhanced to make it easier for developers to test their 
changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2018-01-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311731#comment-16311731
 ] 

Reynold Xin commented on SPARK-7721:


We can add it first but in my experience this will only be used when it is 
automatic :)


> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-04 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311675#comment-16311675
 ] 

Li Jin commented on SPARK-22947:


There could be multiple matches within the range, and in that case, match only 
with the row that is closest.

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2, 50, null
> 20160102, 1, -50, 105.0
> 20160102, 2, 50, 195.0
> {code}
> h2. Optional Design Sketch
> h3. Implementation A
> (This is just ini

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311673#comment-16311673
 ] 

Reynold Xin commented on SPARK-22947:
-

Basically we should separate the logical plan from the physical execution. 
Sometimes the optimizer won't have enough information to come up with the best 
execution plan, and we can give the optimizer hints. It looks to me in this 
case the logical plan doesn't really need any extra language constructs; it can 
be entirely expressed just using normal inner join with range predicates. The 
physical execution, on the other hand, should be aware of the specific 
properties of the join (i.e. only one tuple matching).
 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
>

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311667#comment-16311667
 ] 

Reynold Xin commented on SPARK-22947:
-

So this is just a hint that there is only one matching tuple right? 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
> 20160101, 2, 50, null
> 20160102, 1, -50, 105.0
> 20160102, 2, 50, 195.0
> {code}
> h2. Optional Design Sketch
> h3. Implementation A
> (This is just initial thought of how to implemen

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-01-04 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311566#comment-16311566
 ] 

Li Jin commented on SPARK-22947:


Thanks for the feedback.

[~rxin], as-of join is different from range join - as-of join joins each left 
row with 0 or 1 right row (the closest row within the range), where range join 
joins each left row with 0 to n right rows. To [~marmbrus]'s point, as-of join 
can be implemented as sugar on top of range join, maybe by a range join 
followed by a groupby to filter out the right rows that are not closest. 
However, implement as-of join using this approach is fairly expansive - a range 
query like 'dfA.time > dfB.time & dfB.time > dfA.time - tolerance' results in a 
CartesianProduct, which requires much more memory and CPU I think O(N*M), plus 
there is the cost of groupby to filter down joined result, where an as-of join 
can be done much faster (basically a range partition followed by merge join), 
this should be something like O(M+N) if the input data is already sorted. 

Regarding SPARK-8682, it seems to target the case when one of the table can be 
broadcasted. I am not sure it helps the general case (both table are large) ...

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is 

[jira] [Assigned] (SPARK-22392) columnar reader interface

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22392:


Assignee: (was: Apache Spark)

> columnar reader interface 
> --
>
> Key: SPARK-22392
> URL: https://issues.apache.org/jira/browse/SPARK-22392
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22392) columnar reader interface

2018-01-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311487#comment-16311487
 ] 

Apache Spark commented on SPARK-22392:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20153

> columnar reader interface 
> --
>
> Key: SPARK-22392
> URL: https://issues.apache.org/jira/browse/SPARK-22392
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22392) columnar reader interface

2018-01-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22392:


Assignee: Apache Spark

> columnar reader interface 
> --
>
> Key: SPARK-22392
> URL: https://issues.apache.org/jira/browse/SPARK-22392
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20436) NullPointerException when restart from checkpoint file

2018-01-04 Thread Riccardo Vincelli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311402#comment-16311402
 ] 

Riccardo Vincelli commented on SPARK-20436:
---

I have tested this myself and it seems that this is the case, thanks [~ffbin] 
and [~zsxwing]!

Spark team, this basically means that there are issues around the serialization 
of unused streams, or something like that right? It is a shame that it took me 
a while, like a couple of days, to figure this out: it must be documented in 
the checkpointing section of the docs, because this is really unexpected.

By the way I do not think that this can be closed as a nap, again because it is 
rather surprising behavior, but perhaps easy to fix.

Thanks,

> NullPointerException when restart from checkpoint file
> --
>
> Key: SPARK-20436
> URL: https://issues.apache.org/jira/browse/SPARK-20436
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.5.0
>Reporter: fangfengbin
>
> I have written a Spark Streaming application which have two DStreams.
> Code is :
> {code}
> object KafkaTwoInkfk {
>   def main(args: Array[String]) {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val ssc = StreamingContext.getOrCreate(checkPointDir, () => 
> createContext(args))
> ssc.start()
> ssc.awaitTermination()
>   }
>   def createContext(args : Array[String]) : StreamingContext = {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val sparkConf = new SparkConf().setAppName("KafkaWordCount")
> val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong))
> ssc.checkpoint(checkPointDir)
> val topicArr1 = topic1.split(",")
> val topicSet1 = topicArr1.toSet
> val topicArr2 = topic2.split(",")
> val topicSet2 = topicArr2.toSet
> val kafkaParams = Map[String, String](
>   "metadata.broker.list" -> brokers
> )
> val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet1)
> val words1 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts1 = words1.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts1.print()
> val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet2)
> val words2 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts2 = words2.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts2.print()
> return ssc
>   }
> }
> {code}
> when  restart from checkpoint file, it throw NullPointerException:
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAcc

[jira] [Commented] (SPARK-22955) Error generating jobs when Stopping JobGenerator gracefully

2018-01-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311399#comment-16311399
 ] 

Sean Owen commented on SPARK-22955:
---

[~tdas] should the order of these two stops be reversed? the comment on 
stopping the event loop suggests it needs to happen 'first', also.

> Error generating jobs when Stopping JobGenerator gracefully
> ---
>
> Key: SPARK-22955
> URL: https://issues.apache.org/jira/browse/SPARK-22955
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: zhaoshijie
>
> when I stop a spark-streaming application with parameter 
>  spark.streaming.stopGracefullyOnShutdown, I get ERROR as follows:
> {code:java}
> 2018-01-04 17:31:17,524 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: 
> RECEIVED SIGNAL TERM
> 2018-01-04 17:31:17,527 INFO org.apache.spark.streaming.StreamingContext: 
> Invoking stop(stopGracefully=true) from shutdown hook
> 2018-01-04 17:31:17,530 INFO 
> org.apache.spark.streaming.scheduler.ReceiverTracker: ReceiverTracker stopped
> 2018-01-04 17:31:17,531 INFO 
> org.apache.spark.streaming.scheduler.JobGenerator: Stopping JobGenerator 
> gracefully
> 2018-01-04 17:31:17,532 INFO 
> org.apache.spark.streaming.scheduler.JobGenerator: Waiting for all received 
> blocks to be consumed for job generation
> 2018-01-04 17:31:17,533 INFO 
> org.apache.spark.streaming.scheduler.JobGenerator: Waited for all received 
> blocks to be consumed for job generation
> 2018-01-04 17:31:17,747 INFO 
> org.apache.spark.streaming.scheduler.JobScheduler: Added jobs for time 
> 1515058267000 ms
> 2018-01-04 17:31:18,302 INFO 
> org.apache.spark.streaming.scheduler.JobScheduler: Added jobs for time 
> 1515058268000 ms
> 2018-01-04 17:31:18,785 INFO 
> org.apache.spark.streaming.scheduler.JobScheduler: Added jobs for time 
> 1515058269000 ms
> 2018-01-04 17:31:19,001 INFO org.apache.spark.streaming.util.RecurringTimer: 
> Stopped timer for JobGenerator after time 1515058279000
> 2018-01-04 17:31:19,200 INFO 
> org.apache.spark.streaming.scheduler.JobScheduler: Added jobs for time 
> 151505827 ms
> 2018-01-04 17:31:19,207 INFO 
> org.apache.spark.streaming.scheduler.JobGenerator: Stopped generation timer
> 2018-01-04 17:31:19,207 INFO 
> org.apache.spark.streaming.scheduler.JobGenerator: Waiting for jobs to be 
> processed and checkpoints to be written
> 2018-01-04 17:31:19,210 ERROR 
> org.apache.spark.streaming.scheduler.JobScheduler: Error generating jobs for 
> time 1515058271000 ms
> java.lang.IllegalStateException: This consumer has already been closed.
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:161)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:180)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:208)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collecti

  1   2   >