[jira] [Updated] (SPARK-47085) Preformance issue on thrift API
[ https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-47085: Description: This new complexity was introduced in SPARK-39041. before this PR the code was: {code:java} while (curRow < maxRows && iter.hasNext) { val sparkRow = iter.next() val row = ArrayBuffer[Any]() var curCol = 0 while (curCol < sparkRow.length) { if (sparkRow.isNullAt(curCol)) { row += null } else { addNonNullColumnValue(sparkRow, row, curCol, timeFormatters) } curCol += 1 } resultRowSet.addRow(row.toArray.asInstanceOf[Array[Object]]) curRow += 1 }{code} foreach without the _*O(n^2)*_ complexity so this change just return the state to what it was before. In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted back into _*O( n )*_ complexity. was: This new complexity was introduced in SPARK-39041. before this PR the code was: {code:java} def toTTableSchema(schema: StructType): TTableSchema = { val tTableSchema = new TTableSchema() schema.zipWithIndex.foreach { case (f, i) => tTableSchema.addToColumns(toTColumnDesc(f, i)) } tTableSchema } {code} foreach without the _*O(n^2)*_ complexity so this change just return the state to what it was before. In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted back into _*O( n )*_ complexity. > Preformance issue on thrift API > --- > > Key: SPARK-47085 > URL: https://issues.apache.org/jira/browse/SPARK-47085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This new complexity was introduced in SPARK-39041. > before this PR the code was: > {code:java} > while (curRow < maxRows && iter.hasNext) { > val sparkRow = iter.next() > val row = ArrayBuffer[Any]() > var curCol = 0 > while (curCol < sparkRow.length) { > if (sparkRow.isNullAt(curCol)) { > row += null > } else { > addNonNullColumnValue(sparkRow, row, curCol, timeFormatters) > } > curCol += 1 > } > resultRowSet.addRow(row.toArray.asInstanceOf[Array[Object]]) > curRow += 1 > }{code} > foreach without the _*O(n^2)*_ complexity so this change just return the > state to what it was before. > > In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: > {code:scala} > ... > while (i < rowSize) { > val row = rows(I) > ... > {code} > It can be easily converted back into _*O( n )*_ complexity. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47085) Preformance issue on thrift API
[ https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818932#comment-17818932 ] Izek Greenfield commented on SPARK-47085: - [~dongjoon] I updated the details > Preformance issue on thrift API > --- > > Key: SPARK-47085 > URL: https://issues.apache.org/jira/browse/SPARK-47085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This new complexity was introduced in SPARK-39041. > before this PR the code was: > {code:java} > def toTTableSchema(schema: StructType): TTableSchema = { > val tTableSchema = new TTableSchema() > schema.zipWithIndex.foreach { case (f, i) => > tTableSchema.addToColumns(toTColumnDesc(f, i)) > } > tTableSchema > } {code} > foreach without the _*O(n^2)*_ complexity so this change just return the > state to what it was before. > > In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: > {code:scala} > ... > while (i < rowSize) { > val row = rows(I) > ... > {code} > It can be easily converted back into _*O( n )*_ complexity. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47085) Preformance issue on thrift API
[ https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-47085: Description: This new complexity was introduced in SPARK-39041. before this PR the code was: {code:java} def toTTableSchema(schema: StructType): TTableSchema = { val tTableSchema = new TTableSchema() schema.zipWithIndex.foreach { case (f, i) => tTableSchema.addToColumns(toTColumnDesc(f, i)) } tTableSchema } {code} foreach without the _*O(n^2)*_ complexity so this change just return the state to what it was before. In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted back into _*O( n )*_ complexity. was: This new complexity was introduced in SPARK-39041. In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted back into _*O( n )*_ complexity. > Preformance issue on thrift API > --- > > Key: SPARK-47085 > URL: https://issues.apache.org/jira/browse/SPARK-47085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This new complexity was introduced in SPARK-39041. > before this PR the code was: > {code:java} > def toTTableSchema(schema: StructType): TTableSchema = { > val tTableSchema = new TTableSchema() > schema.zipWithIndex.foreach { case (f, i) => > tTableSchema.addToColumns(toTColumnDesc(f, i)) > } > tTableSchema > } {code} > foreach without the _*O(n^2)*_ complexity so this change just return the > state to what it was before. > > In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: > {code:scala} > ... > while (i < rowSize) { > val row = rows(I) > ... > {code} > It can be easily converted back into _*O( n )*_ complexity. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47085) Preformance issue on thrift API
[ https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-47085: Description: This new complexity was introduced in SPARK-39041. In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted back into _*O( n )*_ complexity. was: This new complexity introduced in SPARK-39041. In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted back into _*O(n)*_ complexity. > Preformance issue on thrift API > --- > > Key: SPARK-47085 > URL: https://issues.apache.org/jira/browse/SPARK-47085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This new complexity was introduced in SPARK-39041. > In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: > {code:scala} > ... > while (i < rowSize) { > val row = rows(I) > ... > {code} > It can be easily converted back into _*O( n )*_ complexity. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47085) Preformance issue on thrift API
[ https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-47085: Description: This new complexity introduced in SPARK-39041. In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted back into _*O(n)*_ complexity. was: in class `RowSetUtils` there is a loop that has O(n^2) complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted into O( n ) complexity. > Preformance issue on thrift API > --- > > Key: SPARK-47085 > URL: https://issues.apache.org/jira/browse/SPARK-47085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This new complexity introduced in SPARK-39041. > In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity: > {code:scala} > ... > while (i < rowSize) { > val row = rows(I) > ... > {code} > It can be easily converted back into _*O(n)*_ complexity. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47085) Preformance issue on thrift API
[ https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-47085: Issue Type: Bug (was: Improvement) > Preformance issue on thrift API > --- > > Key: SPARK-47085 > URL: https://issues.apache.org/jira/browse/SPARK-47085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Izek Greenfield >Assignee: Izek Greenfield >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > in class `RowSetUtils` there is a loop that has O(n^2) complexity: > {code:scala} > ... > while (i < rowSize) { > val row = rows(I) > ... > {code} > It can be easily converted into O( n ) complexity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47085) Preformance issue on thrift API
[ https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818256#comment-17818256 ] Izek Greenfield commented on SPARK-47085: - https://github.com/apache/spark/pull/45155 > Preformance issue on thrift API > --- > > Key: SPARK-47085 > URL: https://issues.apache.org/jira/browse/SPARK-47085 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Izek Greenfield >Priority: Major > > in class `RowSetUtils` there is a loop that has O(n^2) complexity: > {code:scala} > ... > while (i < rowSize) { > val row = rows(I) > ... > {code} > It can be easily converted into O( n ) complexity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47085) Preformance issue on thrift API
Izek Greenfield created SPARK-47085: --- Summary: Preformance issue on thrift API Key: SPARK-47085 URL: https://issues.apache.org/jira/browse/SPARK-47085 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0, 3.4.1 Reporter: Izek Greenfield in class `RowSetUtils` there is a loop that has O(n^2) complexity: {code:scala} ... while (i < rowSize) { val row = rows(I) ... {code} It can be easily converted into O( n ) complexity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46891) allow configure of statistics logical plan visitor
Izek Greenfield created SPARK-46891: --- Summary: allow configure of statistics logical plan visitor Key: SPARK-46891 URL: https://issues.apache.org/jira/browse/SPARK-46891 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.5.0 Reporter: Izek Greenfield allow configuration for setting statistics plan visitor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46137) Janino compiler has new version that fix issue with compilation
Izek Greenfield created SPARK-46137: --- Summary: Janino compiler has new version that fix issue with compilation Key: SPARK-46137 URL: https://issues.apache.org/jira/browse/SPARK-46137 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Izek Greenfield Janino released a new version (3.1.11) with a fix of a compilation of: {code:java} { } while (fasle) {code} [Link to github issue|https://github.com/janino-compiler/janino/issues/208] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44472) change the external catalog thread safety way
[ https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-44472: Labels: (was: spark-) > change the external catalog thread safety way > - > > Key: SPARK-44472 > URL: https://issues.apache.org/jira/browse/SPARK-44472 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Izek Greenfield >Priority: Major > Attachments: add_hive_concurrent_connections.diff > > > We test changing the sync of the external catalog to use thread-local instead > of the synchronized methods. > in our tests, it improve the runtime of parallel actions by about 45% for > certain workload ** (time reduced from ~15min to ~9min) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44472) change the external catalog thread safety way
[ https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-44472: Labels: spark- (was: ) > change the external catalog thread safety way > - > > Key: SPARK-44472 > URL: https://issues.apache.org/jira/browse/SPARK-44472 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Izek Greenfield >Priority: Major > Labels: spark- > Attachments: add_hive_concurrent_connections.diff > > > We test changing the sync of the external catalog to use thread-local instead > of the synchronized methods. > in our tests, it improve the runtime of parallel actions by about 45% for > certain workload ** (time reduced from ~15min to ~9min) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44472) change the external catalog thread safety way
[ https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780741#comment-17780741 ] Izek Greenfield commented on SPARK-44472: - Hi, Any comments about this? > change the external catalog thread safety way > - > > Key: SPARK-44472 > URL: https://issues.apache.org/jira/browse/SPARK-44472 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Izek Greenfield >Priority: Major > Attachments: add_hive_concurrent_connections.diff > > > We test changing the sync of the external catalog to use thread-local instead > of the synchronized methods. > in our tests, it improve the runtime of parallel actions by about 45% for > certain workload ** (time reduced from ~15min to ~9min) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44472) change the external catalog thread safety way
[ https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-44472: Attachment: add_hive_concurrent_connections.diff > change the external catalog thread safety way > - > > Key: SPARK-44472 > URL: https://issues.apache.org/jira/browse/SPARK-44472 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Izek Greenfield >Priority: Major > Attachments: add_hive_concurrent_connections.diff > > > We test changing the sync of the external catalog to use thread-local instead > of the synchronized methods. > in our tests, it improve the runtime of parallel actions by about 45% for > certain workload ** (time reduced from ~15min to ~9min) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44472) change the external catalog thread safety way
Izek Greenfield created SPARK-44472: --- Summary: change the external catalog thread safety way Key: SPARK-44472 URL: https://issues.apache.org/jira/browse/SPARK-44472 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1 Reporter: Izek Greenfield We test changing the sync of the external catalog to use thread-local instead of the synchronized methods. in our tests, it improve the runtime of parallel actions by about 45% for certain workload ** (time reduced from ~15min to ~9min) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37745) Add new option to CSVOption ignoreEmptyLines
Izek Greenfield created SPARK-37745: --- Summary: Add new option to CSVOption ignoreEmptyLines Key: SPARK-37745 URL: https://issues.apache.org/jira/browse/SPARK-37745 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Izek Greenfield In many cases, user need to read full CSV file with all empty lines so it will be good to have such option the default will be true so the default behavior will not changed -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"
[ https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-37321: Description: When CBO is enabled then a situation occurs where spark tries to broadcast very large DataFrame due to wrong output size estimation. In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will use `DataType.defaultSize`. In the case where the output contains `functions.concat_ws`, the `getSizePerRow` function will estimate the size to be 20 bytes, while in our case the actual size can be a lot larger. As a result, we in some cases end up with an estimated size of < 300K while the actual size can be > 8GB, thus leading to exceptions as spark thinks the tables may be broadcast but later realizes the data size is too large. Code sample to reproduce: for running that I used `-Xmx45G` {code:scala} import spark.implicits._ (1 to 10).toDF("index").withColumn("index", col("index").cast("string")).write.parquet("/tmp/a") (1 to 1000).toDF("index_b").withColumn("index_b", col("index_b").cast("string")).write.parquet("/tmp/b") val a = spark.read .parquet("/tmp/a") .withColumn("b", col("index")) .withColumn("l1", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l2", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l3", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l4", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l5", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) val r = Random.alphanumeric val l = 220 val i = 2800 val b = spark.read .parquet("/tmp/b") .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) a.join(b, col("index") === col("index_b")).show(2000) {code} was: When CBO is enabled then a situation occurs where spark tries to broadcast very large DataFrame due to wrong output size estimation. In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will use `DataType.defaultSize`. In the case where the output contains `functions.concat_ws`, the `getSizePerRow` function will estimate the size to be 20 bytes, while in our case the actual size can be a lot larger. As a result, we in some cases end up with an estimated size of < 300K while the actual size can be > 8GB, thus leading to exceptions as spark thinks the tables may be broadcast but later realizes the data size is too large. Code sample to reproduce: {code:scala} import spark.implicits._ (1 to 10).toDF("index").withColumn("index", col("index").cast("string")).write.parquet("/tmp/a") (1 to 1000).toDF("index_b").withColumn("index_b", col("index_b").cast("string")).write.parquet("/tmp/b") val a = spark.read .parquet("/tmp/a") .withColumn("b", col("index")) .withColumn("l1", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l2", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l3", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l4", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l
[jira] [Updated] (SPARK-37321) Wrong size estimation that leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"
[ https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-37321: Summary: Wrong size estimation that leads to "Cannot broadcast the table that is larger than 8GB: 8 GB" (was: Wrong size estimation that leads to Cannot broadcast the table that is larger than 8GB: 8 GB) > Wrong size estimation that leads to "Cannot broadcast the table that is > larger than 8GB: 8 GB" > -- > > Key: SPARK-37321 > URL: https://issues.apache.org/jira/browse/SPARK-37321 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.1, 3.2.0 >Reporter: Izek Greenfield >Priority: Major > > When CBO is enabled then a situation occurs where spark tries to broadcast > very large DataFrame due to wrong output size estimation. > > In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will > use `DataType.defaultSize`. > In the case where the output contains `functions.concat_ws`, the > `getSizePerRow` function will estimate the size to be 20 bytes, while in our > case the actual size can be a lot larger. > As a result, we in some cases end up with an estimated size of < 300K while > the actual size can be > 8GB, thus leading to exceptions as spark thinks the > tables may be broadcast but later realizes the data size is too large. > > Code sample to reproduce: > {code:scala} > import spark.implicits._ > (1 to 10).toDF("index").withColumn("index", > col("index").cast("string")).write.parquet("/tmp/a") > (1 to 1000).toDF("index_b").withColumn("index_b", > col("index_b").cast("string")).write.parquet("/tmp/b") > val a = spark.read > .parquet("/tmp/a") > .withColumn("b", col("index")) > .withColumn("l1", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l2", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l3", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l4", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l5", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > val r = Random.alphanumeric > val l = 220 > val i = 2800 > val b = spark.read > .parquet("/tmp/b") > .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > > a.join(b, col("index") === col("index_b")).show(2000) > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"
[ https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-37321: Component/s: SQL > Wrong size estimation leads to "Cannot broadcast the table that is larger > than 8GB: 8 GB" > - > > Key: SPARK-37321 > URL: https://issues.apache.org/jira/browse/SPARK-37321 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Izek Greenfield >Priority: Major > > When CBO is enabled then a situation occurs where spark tries to broadcast > very large DataFrame due to wrong output size estimation. > > In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will > use `DataType.defaultSize`. > In the case where the output contains `functions.concat_ws`, the > `getSizePerRow` function will estimate the size to be 20 bytes, while in our > case the actual size can be a lot larger. > As a result, we in some cases end up with an estimated size of < 300K while > the actual size can be > 8GB, thus leading to exceptions as spark thinks the > tables may be broadcast but later realizes the data size is too large. > > Code sample to reproduce: > {code:scala} > import spark.implicits._ > (1 to 10).toDF("index").withColumn("index", > col("index").cast("string")).write.parquet("/tmp/a") > (1 to 1000).toDF("index_b").withColumn("index_b", > col("index_b").cast("string")).write.parquet("/tmp/b") > val a = spark.read > .parquet("/tmp/a") > .withColumn("b", col("index")) > .withColumn("l1", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l2", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l3", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l4", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l5", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > val r = Random.alphanumeric > val l = 220 > val i = 2800 > val b = spark.read > .parquet("/tmp/b") > .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > > a.join(b, col("index") === col("index_b")).show(2000) > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"
[ https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-37321: Summary: Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB" (was: Wrong size estimation that leads to "Cannot broadcast the table that is larger than 8GB: 8 GB") > Wrong size estimation leads to "Cannot broadcast the table that is larger > than 8GB: 8 GB" > - > > Key: SPARK-37321 > URL: https://issues.apache.org/jira/browse/SPARK-37321 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.1, 3.2.0 >Reporter: Izek Greenfield >Priority: Major > > When CBO is enabled then a situation occurs where spark tries to broadcast > very large DataFrame due to wrong output size estimation. > > In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will > use `DataType.defaultSize`. > In the case where the output contains `functions.concat_ws`, the > `getSizePerRow` function will estimate the size to be 20 bytes, while in our > case the actual size can be a lot larger. > As a result, we in some cases end up with an estimated size of < 300K while > the actual size can be > 8GB, thus leading to exceptions as spark thinks the > tables may be broadcast but later realizes the data size is too large. > > Code sample to reproduce: > {code:scala} > import spark.implicits._ > (1 to 10).toDF("index").withColumn("index", > col("index").cast("string")).write.parquet("/tmp/a") > (1 to 1000).toDF("index_b").withColumn("index_b", > col("index_b").cast("string")).write.parquet("/tmp/b") > val a = spark.read > .parquet("/tmp/a") > .withColumn("b", col("index")) > .withColumn("l1", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l2", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l3", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l4", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > .withColumn("l5", functions.concat_ws("/", col("index"), > functions.current_date(), functions.current_date(), functions.current_date(), > functions.current_date())) > val r = Random.alphanumeric > val l = 220 > val i = 2800 > val b = spark.read > .parquet("/tmp/b") > .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => > List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) > > a.join(b, col("index") === col("index_b")).show(2000) > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37321) Wrong size estimation that leads to Cannot broadcast the table that is larger than 8GB: 8 GB
[ https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-37321: Description: When CBO is enabled then a situation occurs where spark tries to broadcast very large DataFrame due to wrong output size estimation. In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will use `DataType.defaultSize`. In the case where the output contains `functions.concat_ws`, the `getSizePerRow` function will estimate the size to be 20 bytes, while in our case the actual size can be a lot larger. As a result, we in some cases end up with an estimated size of < 300K while the actual size can be > 8GB, thus leading to exceptions as spark thinks the tables may be broadcast but later realizes the data size is too large. Code sample to reproduce: {code:scala} import spark.implicits._ (1 to 10).toDF("index").withColumn("index", col("index").cast("string")).write.parquet("/tmp/a") (1 to 1000).toDF("index_b").withColumn("index_b", col("index_b").cast("string")).write.parquet("/tmp/b") val a = spark.read .parquet("/tmp/a") .withColumn("b", col("index")) .withColumn("l1", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l2", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l3", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l4", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l5", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) val r = Random.alphanumeric val l = 220 val i = 2800 val b = spark.read .parquet("/tmp/b") .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) a.join(b, col("index") === col("index_b")).show(2000) {code} was: When CBO is enabled then a situation occurs where spark tries to broadcast very large DataFrame due to wrong output size estimation. In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will use `DataType.defaultSize`. In the case where the output contains `functions.concat_ws`, the `getSizePerRow` function will estimate the size to be 20 bytes, while in our case the actual size can be a lot larger. As a result, we in some cases end up with an estimated size of < 300K while the actual size can be > 8GB, thus leading to exceptions as spark thinks the tables may be broadcast but later realizes the data size is too large. Code sample to reproduce: {code:scala} import spark.implicits._ (1 to 10).toDF("index").withColumn("index", col("index").cast("string")).write.parquet("/tmp/a") (1 to 1000).toDF("index_b").withColumn("index_b", col("index_b").cast("string")).write.parquet("/tmp/b") val a = spark.read .parquet("/tmp/a") .withColumn("b", col("index")) .withColumn("l1", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l2", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l3", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l4", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l5", functions.concat_ws("/", col("
[jira] [Created] (SPARK-37321) Wrong size estimation that leads to Cannot broadcast the table that is larger than 8GB: 8 GB
Izek Greenfield created SPARK-37321: --- Summary: Wrong size estimation that leads to Cannot broadcast the table that is larger than 8GB: 8 GB Key: SPARK-37321 URL: https://issues.apache.org/jira/browse/SPARK-37321 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.2.0, 3.1.1 Reporter: Izek Greenfield When CBO is enabled then a situation occurs where spark tries to broadcast very large DataFrame due to wrong output size estimation. In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will use `DataType.defaultSize`. In the case where the output contains `functions.concat_ws`, the `getSizePerRow` function will estimate the size to be 20 bytes, while in our case the actual size can be a lot larger. As a result, we in some cases end up with an estimated size of < 300K while the actual size can be > 8GB, thus leading to exceptions as spark thinks the tables may be broadcast but later realizes the data size is too large. Code sample to reproduce: {code:scala} import spark.implicits._ (1 to 10).toDF("index").withColumn("index", col("index").cast("string")).write.parquet("/tmp/a") (1 to 1000).toDF("index_b").withColumn("index_b", col("index_b").cast("string")).write.parquet("/tmp/b") val a = spark.read .parquet("/tmp/a") .withColumn("b", col("index")) .withColumn("l1", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l2", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l3", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l4", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) .withColumn("l5", functions.concat_ws("/", col("index"), functions.current_date(), functions.current_date(), functions.current_date(), functions.current_date())) val r = Random.alphanumeric val l = 220 val i = 2800 val b = spark.read .parquet("/tmp/b") .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*)) a.join(b, col("index") === col("index_b")).show(2000) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36329) show api of Dataset should get as input the output method
[ https://issues.apache.org/jira/browse/SPARK-36329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391159#comment-17391159 ] Izek Greenfield commented on SPARK-36329: - # if you do it like that you should write it again and again. # it very handy to have it in the same format as show. > show api of Dataset should get as input the output method > - > > Key: SPARK-36329 > URL: https://issues.apache.org/jira/browse/SPARK-36329 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Izek Greenfield >Priority: Major > > For now show is: > {code:scala} > def show(numRows: Int, truncate: Boolean): Unit = if (truncate) { > println(showString(numRows, truncate = 20)) > } else { > println(showString(numRows, truncate = 0)) > } > {code} > it can be turn into: > {code:scala} > def show(numRows: Int, truncate: Boolean, out: String => Unit = println): > Unit = if (truncate) { > out(showString(numRows, truncate = 20)) > } else { > out(showString(numRows, truncate = 0)) > } > {code} > so user will be able to send that to file/log... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36329) show api of Dataset should get as input the output method
[ https://issues.apache.org/jira/browse/SPARK-36329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-36329: Description: For now show is: {code:scala} def show(numRows: Int, truncate: Boolean): Unit = if (truncate) { println(showString(numRows, truncate = 20)) } else { println(showString(numRows, truncate = 0)) } {code} it can be turn into: {code:scala} def show(numRows: Int, truncate: Boolean, out: String => Unit = println): Unit = if (truncate) { out(showString(numRows, truncate = 20)) } else { out(showString(numRows, truncate = 0)) } {code} so user will be able to send that to file/log... was: For now show is: {code:scala} def show(numRows: Int, truncate: Boolean): Unit = if (truncate) { println(showString(numRows, truncate = 20)) } else { println(showString(numRows, truncate = 0)) } {code} it can be turn into: {code:scala} def show(numRows: Int, truncate: Boolean, out: Any => Unit = println): Unit = if (truncate) { out(showString(numRows, truncate = 20)) } else { out(showString(numRows, truncate = 0)) } {code} so user will be able to send that to file/log... > show api of Dataset should get as input the output method > - > > Key: SPARK-36329 > URL: https://issues.apache.org/jira/browse/SPARK-36329 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Izek Greenfield >Priority: Major > > For now show is: > {code:scala} > def show(numRows: Int, truncate: Boolean): Unit = if (truncate) { > println(showString(numRows, truncate = 20)) > } else { > println(showString(numRows, truncate = 0)) > } > {code} > it can be turn into: > {code:scala} > def show(numRows: Int, truncate: Boolean, out: String => Unit = println): > Unit = if (truncate) { > out(showString(numRows, truncate = 20)) > } else { > out(showString(numRows, truncate = 0)) > } > {code} > so user will be able to send that to file/log... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36329) show api of Dataset should get as input the output method
Izek Greenfield created SPARK-36329: --- Summary: show api of Dataset should get as input the output method Key: SPARK-36329 URL: https://issues.apache.org/jira/browse/SPARK-36329 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.2 Reporter: Izek Greenfield For now show is: {code:scala} def show(numRows: Int, truncate: Boolean): Unit = if (truncate) { println(showString(numRows, truncate = 20)) } else { println(showString(numRows, truncate = 0)) } {code} it can be turn into: {code:scala} def show(numRows: Int, truncate: Boolean, out: Any => Unit = println): Unit = if (truncate) { out(showString(numRows, truncate = 20)) } else { out(showString(numRows, truncate = 0)) } {code} so user will be able to send that to file/log... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice
[ https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285663#comment-17285663 ] Izek Greenfield commented on SPARK-8582: [~zsxwing] Why that Jira closed? I still see that as an issue on the code... > Optimize checkpointing to avoid computing an RDD twice > -- > > Key: SPARK-8582 > URL: https://issues.apache.org/jira/browse/SPARK-8582 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Shixiong Zhu >Priority: Major > Labels: bulk-closed > > In Spark, checkpointing allows the user to truncate the lineage of his RDD > and save the intermediate contents to HDFS for fault tolerance. However, this > is not currently implemented super efficiently: > Every time we checkpoint an RDD, we actually compute it twice: once during > the action that triggered the checkpointing in the first place, and once > while we checkpoint (we iterate through an RDD's partitions and write them to > disk). See this line for more detail: > https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102. > Instead, we should have a `CheckpointingInterator` that writes checkpoint > data to HDFS while we run the action. This will speed up many usages of > `RDD#checkpoint` by 2X. > (Alternatively, the user can just cache the RDD before checkpointing it, but > this is not always viable for very large input data. It's also not a great > API to use in general.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32764) compare of -0.0 < 0.0 return true
Izek Greenfield created SPARK-32764: --- Summary: compare of -0.0 < 0.0 return true Key: SPARK-32764 URL: https://issues.apache.org/jira/browse/SPARK-32764 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Izek Greenfield {code:scala} val spark: SparkSession = SparkSession .builder() .master("local") .appName("SparkByExamples.com") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") import spark.sqlContext.implicits._ val df = Seq((-0.0, 0.0)).toDF("neg", "pos") .withColumn("comp", col("neg") < col("pos")) df.show(false) == ++---++ |neg |pos|comp| ++---++ |-0.0|0.0|true| ++---++{code} I think that result should be false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31769) Add support of MDC in spark driver logs.
[ https://issues.apache.org/jira/browse/SPARK-31769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111955#comment-17111955 ] Izek Greenfield commented on SPARK-31769: - Hi [~cloud_fan] Following previous discussions on [PR-26624|https://github.com/apache/spark/pull/26624], I'd like us to agree on the necessity of this feature, after which I'll create a dedicated PR. What are your thoughts? > Add support of MDC in spark driver logs. > > > Key: SPARK-31769 > URL: https://issues.apache.org/jira/browse/SPARK-31769 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Izek Greenfield >Priority: Major > > In order to align driver logs with spark executors logs, add support to log > MDC state for properties that are shared between the driver the server > executors (e.g. application name. task id) > The use case is applicable in 2 cases: > # streaming > # running spark under job server (in this case you have a long-running spark > context that handles many tasks from different sources). > See also [SPARK-8981|https://issues.apache.org/jira/browse/SPARK-8981] that > handles the MDC in executor context. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31769) Add support of MDC in spark driver logs.
Izek Greenfield created SPARK-31769: --- Summary: Add support of MDC in spark driver logs. Key: SPARK-31769 URL: https://issues.apache.org/jira/browse/SPARK-31769 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Izek Greenfield In order to align driver logs with spark executors logs, add support to log MDC state for properties that are shared between the driver the server executors (e.g. application name. task id) The use case is applicable in 2 cases: # streaming # running spark under job server (in this case you have a long-running spark context that handles many tasks from different sources). See also [SPARK-8981|https://issues.apache.org/jira/browse/SPARK-8981] that handles the MDC in executor context. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096340#comment-17096340 ] Izek Greenfield commented on SPARK-30332: - Can someone check that? > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, > T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsM
[jira] [Reopened] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield reopened SPARK-30332: - Added code for reproduce > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, > T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCapitalised,
[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042851#comment-17042851 ] Izek Greenfield commented on SPARK-30332: - Code to reproduce the problem: {code:scala} import java.nio.file.{Files, Paths} import org.apache.spark.sql.SparkSession object Test { def main(args: Array[String]): Unit = { val spark = { SparkSession .builder() .master("local[*]") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.cbo.enabled", "true") .config("spark.scheduler.mode", "FAIR") .config("spark.sql.crossJoin.enabled", "true") .config("spark.sql.adaptive.enabled", "true") .config("spark.sql.parquet.filterPushdown", "true") .config("spark.sql.shuffle.partitions", "500") .config("spark.executor.heartbeatInterval", "600s") .config("spark.network.timeout", "1200s") .config("spark.sql.broadcastTimeout", "1200s") .config("spark.shuffle.file.buffer", "64k") .appName("error") .enableHiveSupport() .getOrCreate() } val pathToCsvFiles = "db" import scala.collection.JavaConverters._ Files.walk(Paths.get(pathToCsvFiles)).iterator().asScala.map(_.toFile).foreach{ file => if (!file.isDirectory){ val name = file.getName spark.read.format("csv") .option("inferSchema", "true") .option("header", "true") .option("mode", "DROPMALFORMED") .load(file.getAbsolutePath) .createOrReplaceGlobalTempView(name.split("\\.").head) } } spark.sql( """ |SELECT BT_capital.asof_date, |BT_capital.run_id, |BT_capital.v, |BT_capital.id, |BT_capital.entity, |BT_capital.level_1, |BT_capital.level_2, |BT_capital.level_3, |BT_capital.level_4, |BT_capital.level_5, |BT_capital.level_6, |BT_capital.path_bt_capital, |BT_capital.line_item, |t0.target_line_item, |t0.line_description, |BT_capital.col_item, |BT_capital.rep_amount, |root.orgUnitId, |root.cptyId, |root.instId, |root.startDate, |root.maturityDate, |root.amount, |root.nominalAmount, |root.quantity, |root.lkupAssetLiability, |root.lkupCurrency, |root.lkupProdType, |root.interestResetDate, |root.interestResetTerm, |root.noticePeriod, |root.historicCostAmount, |root.dueDate, |root.lkupResidence, |root.lkupCountryOfUltimateRisk, |root.lkupSector, |root.lkupIndustry, |root.lkupAccountingPortfolioType, |root.lkupLoanDepositTerm, |root.lkupFixedFloating, |root.lkupCollateralType, |root.lkupRiskType, |root.lkupEligibleRefinancing, |root.lkupHedging, |root.lkupIsOwnIssued, |root.lkupIsSubordinated, |root.lkupIsQuoted, |root.lkupIsSecuritised, |root.lkupIsSecuritisedServiced, |root.lkupIsSyndicated, |root.lkupIsDeRecognised, |root.lkupIsRenegotiated, |root.lkupIsTransferable, |root.lkupIsNewBusiness, |root.lkupIsFiduciary, |root.lkupIsNonPerforming, |root.lkupIsInterGroup, |root.lkupIsIntraGroup, |root.lkupIsRediscounted, |root.lkupIsCollateral, |root.lkupIsExercised, |root.lkupIsImpaired, |root.facilityId, |root.lkupIsOTC, |root.lkupIsDefaulted, |root.lkupIsSavingsPosition, |root.lkupIsForborne, |root.lkupIsDebtRestructuringLoan, |root.interestRateAAR, |root.interestRateAPRC, |root.custom1, |root.custom2, |root.custom3, |root.lkupSecuritisationType, |root.lkupIsCashPooling, |root.lkupIsEquityParticipationGTE10, |root.lkupIsConvertible, |root.lkupEconomicHedge, |root.lkupIsNonCurrHeldForSale, |root.lkupIsEmbeddedDerivative, |root.lkupLoanPurpose, |root.lkupRegulated, |root.lkupRepaymentType, |root.glAccount, |root.lkupIsRecourse, |root.lkupIsNotFullyGuaranteed, |root.lkupImpairmentStage, |root.lkupIsEntireAmountWrittenOff, |root.lkupIsLowCreditRisk, |root.lkupIsOBSWithinIFRS9, |root.lkupIsUnderSpecialSurveillance, |root.lkupProtection, |root.lkupIsGeneralAllowance, |root.lkupSectorUltimateRisk, |root.cptyOrgUnitId, |root.name, |root.lkupNationality, |root.lkupSize, |root.lkupIsSPV, |root.lkupIsCentralCounterparty, |root.lkupIsMMRMFI, |root.lkupIsKeyManagement, |root.lkupIsOtherRelat
[jira] [Reopened] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield reopened SPARK-8981: Created PR > Set applicationId and appName in log4j MDC > -- > > Key: SPARK-8981 > URL: https://issues.apache.org/jira/browse/SPARK-8981 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Paweł Kopiczko >Priority: Minor > > It would be nice to have, because it's good to have logs in one file when > using log agents (like logentires) in standalone mode. Also allows > configuring rolling file appender without a mess when multiple applications > are running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield reopened SPARK-30332: - I attached the DS to run the query. > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, > T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCa
[jira] [Updated] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-30332: Attachment: AGGR_41406.csv > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, > T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCapitalise
[jira] [Updated] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-30332: Attachment: T_41233.csv > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCapitalised, > GB_position.lkupC
[jira] [Updated] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-30332: Attachment: AGGR_41418.csv AGGR_41410.csv AGGR_41406.csv AGGR_41390.csv AGGR_41380.csv PORTFOLIO_41446.csv > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail,
[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036203#comment-17036203 ] Izek Greenfield commented on SPARK-16387: - Hi [~dongjoon] thanks for the quick response. we found that on oracle the addition of "" cause error in our other servers that try to read that table. if the column name is `t_id` and you create it "t_id" if you will try to do `select T_ID from table` you will get an error from oracle. from the docs on the method, it says that it will escape only reserved words but it actually escapes all... so I this it could be change to really escape only reserved words. Like I do in the link I put in the previous comment. > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.0.0 > > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer
[ https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035395#comment-17035395 ] Izek Greenfield commented on SPARK-16387: - [~dongjoon] I think that PR fix one issue but create other issue and should be fixed in other way. not quotation of all but for each supported dialect check if the word is reserved and just in that case quote: like the scaladoc of the function said: /** * Quotes the identifier. This is used to put quotes around the identifier in case the column * name is a reserved keyword, or in case it contains characters that require quotes (e.g. space). */ something like: [link|https://stackoverflow.com/a/60167628] > Reserved SQL words are not escaped by JDBC writer > - > > Key: SPARK-16387 > URL: https://issues.apache.org/jira/browse/SPARK-16387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Lev >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.0.0 > > > Here is a code (imports are omitted) > object Main extends App { > val sqlSession = SparkSession.builder().config(new SparkConf(). > setAppName("Sql Test").set("spark.app.id", "SQLTest"). > set("spark.master", "local[2]"). > set("spark.ui.enabled", "false") > .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" )) > ).getOrCreate() > import sqlSession.implicits._ > val localprops = new Properties > localprops.put("user", "") > localprops.put("password", "") > val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order") > val writer = df.write > .mode(SaveMode.Append) > writer > .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops) > } > End error is : > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error > in your SQL syntax; check the manual that corresponds to your MySQL server > version for the right syntax to use near 'order TEXT )' at line 1 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > Clearly the reserved word has to be quoted -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007373#comment-17007373 ] Izek Greenfield commented on SPARK-30332: - Yes when removing the limit all works and get resultd > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCapitalised, > GB_position.lkupCollateralPossession, > GB_position.lkupIsLifetimeMortgage
[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004146#comment-17004146 ] Izek Greenfield commented on SPARK-30332: - The task failed, do not get any result > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, > GB_position.lkupIsMoreInArrears, > GB_position.lkupIsArrearsCapitalised, > GB_position.lkupCollateralPossession, > GB_position.lkupIsLifetimeMortgage, > GB_position
[jira] [Created] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
Izek Greenfield created SPARK-30332: --- Summary: When running sql query with limit catalyst throw StackOverFlow exception Key: SPARK-30332 URL: https://issues.apache.org/jira/browse/SPARK-30332 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: spark version 3.0.0-preview Reporter: Izek Greenfield Running that SQL: {code:sql} SELECT BT_capital.asof_date, BT_capital.run_id, BT_capital.v, BT_capital.id, BT_capital.entity, BT_capital.level_1, BT_capital.level_2, BT_capital.level_3, BT_capital.level_4, BT_capital.level_5, BT_capital.level_6, BT_capital.path_bt_capital, BT_capital.line_item, t0.target_line_item, t0.line_description, BT_capital.col_item, BT_capital.rep_amount, root.orgUnitId, root.cptyId, root.instId, root.startDate, root.maturityDate, root.amount, root.nominalAmount, root.quantity, root.lkupAssetLiability, root.lkupCurrency, root.lkupProdType, root.interestResetDate, root.interestResetTerm, root.noticePeriod, root.historicCostAmount, root.dueDate, root.lkupResidence, root.lkupCountryOfUltimateRisk, root.lkupSector, root.lkupIndustry, root.lkupAccountingPortfolioType, root.lkupLoanDepositTerm, root.lkupFixedFloating, root.lkupCollateralType, root.lkupRiskType, root.lkupEligibleRefinancing, root.lkupHedging, root.lkupIsOwnIssued, root.lkupIsSubordinated, root.lkupIsQuoted, root.lkupIsSecuritised, root.lkupIsSecuritisedServiced, root.lkupIsSyndicated, root.lkupIsDeRecognised, root.lkupIsRenegotiated, root.lkupIsTransferable, root.lkupIsNewBusiness, root.lkupIsFiduciary, root.lkupIsNonPerforming, root.lkupIsInterGroup, root.lkupIsIntraGroup, root.lkupIsRediscounted, root.lkupIsCollateral, root.lkupIsExercised, root.lkupIsImpaired, root.facilityId, root.lkupIsOTC, root.lkupIsDefaulted, root.lkupIsSavingsPosition, root.lkupIsForborne, root.lkupIsDebtRestructuringLoan, root.interestRateAAR, root.interestRateAPRC, root.custom1, root.custom2, root.custom3, root.lkupSecuritisationType, root.lkupIsCashPooling, root.lkupIsEquityParticipationGTE10, root.lkupIsConvertible, root.lkupEconomicHedge, root.lkupIsNonCurrHeldForSale, root.lkupIsEmbeddedDerivative, root.lkupLoanPurpose, root.lkupRegulated, root.lkupRepaymentType, root.glAccount, root.lkupIsRecourse, root.lkupIsNotFullyGuaranteed, root.lkupImpairmentStage, root.lkupIsEntireAmountWrittenOff, root.lkupIsLowCreditRisk, root.lkupIsOBSWithinIFRS9, root.lkupIsUnderSpecialSurveillance, root.lkupProtection, root.lkupIsGeneralAllowance, root.lkupSectorUltimateRisk, root.cptyOrgUnitId, root.name, root.lkupNationality, root.lkupSize, root.lkupIsSPV, root.lkupIsCentralCounterparty, root.lkupIsMMRMFI, root.lkupIsKeyManagement, root.lkupIsOtherRelatedParty, root.lkupResidenceProvince, root.lkupIsTradingBook, root.entityHierarchy_entityId, root.entityHierarchy_Residence, root.lkupLocalCurrency, root.cpty_entityhierarchy_entityId, root.lkupRelationship, root.cpty_lkupRelationship, root.entityNationality, root.lkupRepCurrency, root.startDateFinancialYear, root.numEmployees, root.numEmployeesTotal, root.collateralAmount, root.guaranteeAmount, root.impairmentSpecificIndividual, root.impairmentSpecificCollective, root.impairmentGeneral, root.creditRiskAmount, root.provisionSpecificIndividual, root.provisionSpecificCollective, root.provisionGeneral, root.writeOffAmount, root.interest, root.fairValueAmount, root.grossCarryingAmount, root.carryingAmount, root.code, root.lkupInstrumentType, root.price, root.amountAtIssue, root.yield, root.totalFacilityAmount, root.facility_rate, root.spec_indiv_est, root.spec_coll_est, root.coll_inc_loss, root.impairment_amount, root.provision_amount, root.accumulated_impairment, root.exclusionFlag, root.lkupIsHoldingCompany, root.instrument_startDate, root.entityResidence, fxRate.enumerator, fxRate.lkupFromCurrency, fxRate.rate, fxRate.custom1, fxRate.custom2, fxRate.custom3, GB_position.lkupIsECGDGuaranteed, GB_position.lkupIsMultiAcctOffsetMortgage, GB_position.lkupIsIndexLinked, GB_position.lkupIsRetail, GB_position.lkupCollateralLocation, GB_position.percentAboveBBR, GB_position.lkupIsMoreInArrears, GB_position.lkupIsArrearsCapitalised, GB_position.lkupCollateralPossession, GB_position.lkupIsLifetimeMortgage, GB_position.lkupLoanConcessionType, GB_position.lkupIsMultiCurrency, GB_position.lkupIsJointIncomeBasis, GB_position.ratioIncomeMultiple, GB_position.interestRate, GB_position.exclusionFlag, GB_position.lkupFDIDirection, GB_position.lkupIsRTGS, GB_positionExtended.nonRecourseFinanceAmount, GB_positionExtended.arrearsAmount, GB_Counterparty.lkupIsClearingFirm, GB_Counterparty.lkupIsIntermediateFinCorp, GB_Counterparty.lkupIsImpairedCreditHistory, GB_Counterparty.lkupFDIRelationship FROM portfolio_41446 BT_capital JOIN aggr_41390 root ON (root.id = BT_capital.id AND root.entity = BT_capital.entity AND (root.instance_id = 'e3b82807-9371-44f4-9c97-d63cde
[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979236#comment-16979236 ] Izek Greenfield commented on SPARK-8981: [~srowen] I created pull request for that. > Set applicationId and appName in log4j MDC > -- > > Key: SPARK-8981 > URL: https://issues.apache.org/jira/browse/SPARK-8981 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Paweł Kopiczko >Priority: Minor > > It would be nice to have, because it's good to have logs in one file when > using log agents (like logentires) in standalone mode. Also allows > configuring rolling file appender without a mess when multiple applications > are running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to slow query planning
[ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948443#comment-16948443 ] Izek Greenfield commented on SPARK-13346: - [~davies] why this issue gets closed? > Using DataFrames iteratively leads to slow query planning > - > > Key: SPARK-13346 > URL: https://issues.apache.org/jira/browse/SPARK-13346 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Priority: Major > Labels: bulk-closed > > I have an iterative algorithm based on DataFrames, and the query plan grows > very quickly with each iteration. Caching the current DataFrame at the end > of an iteration does not fix the problem. However, converting the DataFrame > to an RDD and back at the end of each iteration does fix the problem. > Printing the query plans shows that the plan explodes quickly (10 lines, to > several hundred lines, to several thousand lines, ...) with successive > iterations. > The desired behavior is for the analyzer to recognize that a big chunk of the > query plan does not need to be computed since it is already cached. The > computation on each iteration should be the same. > If useful, I can push (complex) code to reproduce the issue. But it should > be simple to see if you create an iterative algorithm which produces a new > DataFrame from an old one on each iteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27862) Upgrade json4s-jackson to 3.6.6
[ https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-27862: Summary: Upgrade json4s-jackson to 3.6.6 (was: Upgrade json4s-jackson to 3.6.5) > Upgrade json4s-jackson to 3.6.6 > --- > > Key: SPARK-27862 > URL: https://issues.apache.org/jira/browse/SPARK-27862 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0, 2.4.3 >Reporter: Izek Greenfield >Priority: Minor > > it will be very good to upgrade to newer version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-27862) Upgrade json4s-jackson to 3.6.5
[ https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-27862: Comment: was deleted (was: create pull request for 2.4 branch: https://github.com/apache/spark/pull/24729 create pull request for master branch: https://github.com/apache/spark/pull/24736 ) > Upgrade json4s-jackson to 3.6.5 > --- > > Key: SPARK-27862 > URL: https://issues.apache.org/jira/browse/SPARK-27862 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0, 2.4.3 >Reporter: Izek Greenfield >Priority: Minor > > it will be very good to upgrade to newer version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27862) Upgrade json4s-jackson to 3.6.5
[ https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849703#comment-16849703 ] Izek Greenfield edited comment on SPARK-27862 at 5/29/19 5:47 AM: -- create pull request for 2.4 branch: https://github.com/apache/spark/pull/24729 create pull request for master branch: https://github.com/apache/spark/pull/24736 was (Author: igreenfi): create pull request for 2.4 branch: https://github.com/apache/spark/pull/24729 > Upgrade json4s-jackson to 3.6.5 > --- > > Key: SPARK-27862 > URL: https://issues.apache.org/jira/browse/SPARK-27862 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0, 2.4.3 >Reporter: Izek Greenfield >Priority: Minor > > it will be very good to upgrade to newer version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27862) Upgrade json4s-jackson to 3.6.5
[ https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849703#comment-16849703 ] Izek Greenfield commented on SPARK-27862: - create pull request for 2.4 branch: https://github.com/apache/spark/pull/24729 > Upgrade json4s-jackson to 3.6.5 > --- > > Key: SPARK-27862 > URL: https://issues.apache.org/jira/browse/SPARK-27862 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0, 2.4.3 >Reporter: Izek Greenfield >Priority: Minor > > it will be very good to upgrade to newer version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27862) Upgrade json4s-jackson to 3.6.5
Izek Greenfield created SPARK-27862: --- Summary: Upgrade json4s-jackson to 3.6.5 Key: SPARK-27862 URL: https://issues.apache.org/jira/browse/SPARK-27862 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.3, 3.0.0 Reporter: Izek Greenfield it will be very good to upgrade to newer version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27745) build/mvn take wrong scala version when compile for scala 2.12
Izek Greenfield created SPARK-27745: --- Summary: build/mvn take wrong scala version when compile for scala 2.12 Key: SPARK-27745 URL: https://issues.apache.org/jira/browse/SPARK-27745 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.3 Reporter: Izek Greenfield in `build/mvn` line: local scala_binary_version=`grep "scala.binary.version" "${_DIR}/../pom.xml" | head -n1 | awk -F '[<>]' '{print $3}'` it grep the pom and there will be 2.11 and if I set -Pscala-2.12 it will take 2.11 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820058#comment-16820058 ] Izek Greenfield commented on SPARK-23904: - [~DaveDeCaprio] Does that PR go into 2.4.1 release? > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10746) count ( distinct columnref) over () returns wrong result set
[ https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790428#comment-16790428 ] Izek Greenfield edited comment on SPARK-10746 at 3/12/19 10:41 AM: --- you can implement that by using: {code:scala} import org.apache.spark.sql.functions._ size(collect_set(column).over(window)) {code} was (Author: igreenfi): you can implement that by using: {code:java} // Some comments here size(collect_set(column).over(window)) {code} > count ( distinct columnref) over () returns wrong result set > > > Key: SPARK-10746 > URL: https://issues.apache.org/jira/browse/SPARK-10746 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell >Priority: Major > > Same issue as report against HIVE (HIVE-9534) > Result set was expected to contain 5 rows instead of 1 row as others vendors > (ORACLE, Netezza etc) would. > select count( distinct column) over () from t1 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10746) count ( distinct columnref) over () returns wrong result set
[ https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790428#comment-16790428 ] Izek Greenfield commented on SPARK-10746: - you can implement that by using: {code:java} // Some comments here size(collect_set(column).over(window)) {code} > count ( distinct columnref) over () returns wrong result set > > > Key: SPARK-10746 > URL: https://issues.apache.org/jira/browse/SPARK-10746 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell >Priority: Major > > Same issue as report against HIVE (HIVE-9534) > Result set was expected to contain 5 rows instead of 1 row as others vendors > (ORACLE, Netezza etc) would. > select count( distinct column) over () from t1 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745454#comment-16745454 ] Izek Greenfield commented on SPARK-23904: - [~staslos] The plan description is created in any case buy this code: `queryExecution.toString` {code:java} object SQLExecution { ... withSQLConfPropagated(sparkSession) { sc.listenerBus.post(SparkListenerSQLExecutionStart( executionId, callSite.shortForm, callSite.longForm, queryExecution.toString, SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), System.currentTimeMillis())) try { body } finally { sc.listenerBus.post(SparkListenerSQLExecutionEnd( executionId, System.currentTimeMillis())) } } } finally { executionIdToQueryExecution.remove(executionId) sc.setLocalProperty(EXECUTION_ID_KEY, oldExecutionId) } ... } {code} so the new PR will solve one issue: Memory usage. but the working down the tree to create unneeded string will still happen and waste the time of execution... can't this be lazily created if needed? > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb
[ https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578275#comment-16578275 ] Izek Greenfield commented on SPARK-25094: - looking in the code the problem is here: def splitExpressionsWithCurrentInputs( expressions: Seq[String], funcName: String = "apply", extraArguments: Seq[(String, String)] = Nil, returnType: String = "void", makeSplitFunction: String => String = identity, foldFunctions: Seq[String] => String = _.mkString("", ";\n", ";")): String = { // TODO: support whole stage codegen if (INPUT_ROW == null || currentVars != null) { expressions.mkString("\n") } else { splitExpressions( expressions, funcName, ("InternalRow", INPUT_ROW) +: extraArguments, returnType, makeSplitFunction, foldFunctions) } } the TODO section!! > proccesNext() failed to compile size is over 64kb > - > > Key: SPARK-25094 > URL: https://issues.apache.org/jira/browse/SPARK-25094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > Attachments: generated_code.txt > > > I have this tree: > 2018-08-12T07:14:31,289 WARN [] > org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen > disabled for plan (id=1): > *(1) Project [, ... 10 more fields] > +- *(1) Filter NOT exposure_calc_method#10141 IN > (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES) >+- InMemoryTableScan [, ... 11 more fields], [NOT > exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)] > +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, > deserialized, 1 replicas) >+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner > :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0 > : +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > : +- *(1) Project [, ... 6 more fields] > :+- *(1) Filter (isnotnull(v#49) && > isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) > && (v#49 = DATA_REG)) && isnotnull(unique_id#39)) > : +- InMemoryTableScan [, ... 6 more fields], [, > ... 6 more fields] > : +- InMemoryRelation [, ... 6 more > fields], StorageLevel(memory, deserialized, 1 replicas) > : +- *(1) FileScan csv [,... 6 more > fields] , ... 6 more fields > +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > +- *(3) Project [, ... 74 more fields] >+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 > <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54)) > +- InMemoryTableScan [, ... 74 more fields], [, > ... 4 more fields] > +- InMemoryRelation [, ... 74 more > fields], StorageLevel(memory, deserialized, 1 replicas) > +- *(1) FileScan csv [,... 74 more > fields] , ... 6 more fields > Compiling "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > and the generated code failed to compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25094) proccesNext() failed to compile size is over 64kb
[ https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578127#comment-16578127 ] Izek Greenfield edited comment on SPARK-25094 at 8/13/18 1:03 PM: -- the code that creates this plan is very complex. I will try to reproduce it in simple code in the meanwhile I can attach the generated code so you can see the problem is that the code does not create functions and inline all the Plan into the processNext method. [^generated_code.txt] it contains 2 DataFrames on with 80 columns 10 of them built from `case when` expressions: like that: CASE WHEN (`predefined_hc` IS NOT NULL) THEN '/Predefined_hc/' WHEN (`zero_volatility_adj_ind` = 'Y') THEN '/Zero_Haircuts_cases/' WHEN (`collateral_allocation_method` = 'FCSM') THEN '/FCSM_Collaterals/' WHEN `underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND ((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 4 AND (`residual_maturity_instrument` <= 12.0D)) THEN '/Debt/Central_Government_Issuer/Eligible/res_mat_1Y/' WHEN `underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND ((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 4 AND (`residual_maturity_instrument` <= 60.0D)) THEN '/Debt/Central_Government_Issuer/Eligible/res_mat_5Y/' WHEN (((`underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND ((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 4 THEN '/Debt/Central_Government_Issuer/Eligible/res_mat_G5/' WHEN ((`underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) THEN '/Debt/Central_Government_Issuer/Non_Eligible/' WHEN `underlying_type` = 'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 12.0D)) THEN '/Debt/Other_Issuers/Eligible/res_mat_1Y/' WHEN `underlying_type` = 'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 60.0D)) THEN '/Debt/Other_Issuers/Eligible/res_mat_5Y/' WHEN (((`underlying_type` = 'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 3 THEN '/Debt/Other_Issuers/Eligible/res_mat_G5/' WHEN ((`underlying_type` = 'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) THEN '/Debt/Other_Issuers/Non_Eligible/' WHEN (((`underlying_type` = 'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 12.0D)) THEN '/Securitisation/Eligible/res_mat_1Y/' WHEN (((`underlying_type` = 'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 60.0D)) THEN '/Securitisation/Eligible/res_mat_5Y/' WHEN ((`underlying_type` = 'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 3 THEN '/Securitisation/Eligible/res_mat_G5/' WHEN (`underlying_type` = 'SECURITISATION') THEN '/Securitisation/Non_Eligible/' WHEN ((`underlying_type` IN ('EQUITY', 'MAIN_INDEX_EQUITY', 'COMMODITY', 'NON_ELIGIBLE_SECURITY')) AND (`underlying_type` = 'MAIN_INDEX_EQUITY')) THEN '/Other_Securities/Main_index/' WHEN (`underlying_type` IN ('EQUITY', 'MAIN_INDEX_EQUITY', 'COMMODITY', 'NON_ELIGIBLE_SECURITY')) THEN '/Other_Securities/Others/' WHEN (`underlying_type` = 'CASH') THEN '/Cash/' WHEN (`underlying_type` = 'GOLD') THEN '/Gold/' WHEN (`underlying_type` = 'CIU') THEN '/CIU/' WHEN true THEN '/Others/' END AS `108_0___Portfolio_CRD4_Art_224_Volatility_Adjustments_Codespath_CRD4_Art_224_Volatil` was (Author: igreenfi): the code that creates this plan is very complex. I will try to reproduce it in simple code in the meanwhile I can attach the generated code so you can see the problem is that the code does not create functions and inline all the Plan into the processNext method. [^generated_code.txt] > proccesNext() failed to compile size is over 64kb > - > > Key: SPARK-25094 > URL: https://issues.apache.org/jira/browse/SPARK-25094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > Attachments: generated_code.txt > > > I have this tree: > 2018-08-12T07:14:31,289 WARN [] > org.apache.spark.sql.execution.WholeStageCo
[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb
[ https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578127#comment-16578127 ] Izek Greenfield commented on SPARK-25094: - the code that creates this plan is very complex. I will try to reproduce it in simple code in the meanwhile I can attach the generated code so you can see the problem is that the code does not create functions and inline all the Plan into the processNext method. [^generated_code.txt] > proccesNext() failed to compile size is over 64kb > - > > Key: SPARK-25094 > URL: https://issues.apache.org/jira/browse/SPARK-25094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > Attachments: generated_code.txt > > > I have this tree: > 2018-08-12T07:14:31,289 WARN [] > org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen > disabled for plan (id=1): > *(1) Project [, ... 10 more fields] > +- *(1) Filter NOT exposure_calc_method#10141 IN > (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES) >+- InMemoryTableScan [, ... 11 more fields], [NOT > exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)] > +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, > deserialized, 1 replicas) >+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner > :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0 > : +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > : +- *(1) Project [, ... 6 more fields] > :+- *(1) Filter (isnotnull(v#49) && > isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) > && (v#49 = DATA_REG)) && isnotnull(unique_id#39)) > : +- InMemoryTableScan [, ... 6 more fields], [, > ... 6 more fields] > : +- InMemoryRelation [, ... 6 more > fields], StorageLevel(memory, deserialized, 1 replicas) > : +- *(1) FileScan csv [,... 6 more > fields] , ... 6 more fields > +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > +- *(3) Project [, ... 74 more fields] >+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 > <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54)) > +- InMemoryTableScan [, ... 74 more fields], [, > ... 4 more fields] > +- InMemoryRelation [, ... 74 more > fields], StorageLevel(memory, deserialized, 1 replicas) > +- *(1) FileScan csv [,... 74 more > fields] , ... 6 more fields > Compiling "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > and the generated code failed to compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25094) proccesNext() failed to compile size is over 64kb
[ https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-25094: Attachment: generated_code.txt > proccesNext() failed to compile size is over 64kb > - > > Key: SPARK-25094 > URL: https://issues.apache.org/jira/browse/SPARK-25094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > Attachments: generated_code.txt > > > I have this tree: > 2018-08-12T07:14:31,289 WARN [] > org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen > disabled for plan (id=1): > *(1) Project [, ... 10 more fields] > +- *(1) Filter NOT exposure_calc_method#10141 IN > (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES) >+- InMemoryTableScan [, ... 11 more fields], [NOT > exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)] > +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, > deserialized, 1 replicas) >+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner > :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0 > : +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > : +- *(1) Project [, ... 6 more fields] > :+- *(1) Filter (isnotnull(v#49) && > isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) > && (v#49 = DATA_REG)) && isnotnull(unique_id#39)) > : +- InMemoryTableScan [, ... 6 more fields], [, > ... 6 more fields] > : +- InMemoryRelation [, ... 6 more > fields], StorageLevel(memory, deserialized, 1 replicas) > : +- *(1) FileScan csv [,... 6 more > fields] , ... 6 more fields > +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > +- *(3) Project [, ... 74 more fields] >+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 > <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54)) > +- InMemoryTableScan [, ... 74 more fields], [, > ... 4 more fields] > +- InMemoryRelation [, ... 74 more > fields], StorageLevel(memory, deserialized, 1 replicas) > +- *(1) FileScan csv [,... 74 more > fields] , ... 6 more fields > Compiling "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > and the generated code failed to compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb
[ https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578037#comment-16578037 ] Izek Greenfield commented on SPARK-25094: - [~hyukjin.kwon] Does the full Plan is OK? > proccesNext() failed to compile size is over 64kb > - > > Key: SPARK-25094 > URL: https://issues.apache.org/jira/browse/SPARK-25094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > > I have this tree: > 2018-08-12T07:14:31,289 WARN [] > org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen > disabled for plan (id=1): > *(1) Project [, ... 10 more fields] > +- *(1) Filter NOT exposure_calc_method#10141 IN > (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES) >+- InMemoryTableScan [, ... 11 more fields], [NOT > exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)] > +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, > deserialized, 1 replicas) >+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner > :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0 > : +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > : +- *(1) Project [, ... 6 more fields] > :+- *(1) Filter (isnotnull(v#49) && > isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) > && (v#49 = DATA_REG)) && isnotnull(unique_id#39)) > : +- InMemoryTableScan [, ... 6 more fields], [, > ... 6 more fields] > : +- InMemoryRelation [, ... 6 more > fields], StorageLevel(memory, deserialized, 1 replicas) > : +- *(1) FileScan csv [,... 6 more > fields] , ... 6 more fields > +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 1456511137) > UnknownPartitioning(9), coordinator[target post-shuffle partition size: > 67108864] > +- *(3) Project [, ... 74 more fields] >+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 > <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54)) > +- InMemoryTableScan [, ... 74 more fields], [, > ... 4 more fields] > +- InMemoryRelation [, ... 74 more > fields], StorageLevel(memory, deserialized, 1 replicas) > +- *(1) FileScan csv [,... 74 more > fields] , ... 6 more fields > Compiling "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" > grows beyond 64 KB > and the generated code failed to compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25094) proccesNext() failed to compile size is over 64kb
Izek Greenfield created SPARK-25094: --- Summary: proccesNext() failed to compile size is over 64kb Key: SPARK-25094 URL: https://issues.apache.org/jira/browse/SPARK-25094 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Izek Greenfield I have this tree: 2018-08-12T07:14:31,289 WARN [] org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen disabled for plan (id=1): *(1) Project [, ... 10 more fields] +- *(1) Filter NOT exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES) +- InMemoryTableScan [, ... 11 more fields], [NOT exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)] +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, deserialized, 1 replicas) +- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0 : +- Exchange(coordinator id: 1456511137) UnknownPartitioning(9), coordinator[target post-shuffle partition size: 67108864] : +- *(1) Project [, ... 6 more fields] :+- *(1) Filter (isnotnull(v#49) && isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) && (v#49 = DATA_REG)) && isnotnull(unique_id#39)) : +- InMemoryTableScan [, ... 6 more fields], [, ... 6 more fields] : +- InMemoryRelation [, ... 6 more fields], StorageLevel(memory, deserialized, 1 replicas) : +- *(1) FileScan csv [,... 6 more fields] , ... 6 more fields +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0 +- Exchange(coordinator id: 1456511137) UnknownPartitioning(9), coordinator[target post-shuffle partition size: 67108864] +- *(3) Project [, ... 74 more fields] +- *(3) Filter (((isnotnull(v#51) && (asof_date#42 <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54)) +- InMemoryTableScan [, ... 74 more fields], [, ... 4 more fields] +- InMemoryRelation [, ... 74 more fields], StorageLevel(memory, deserialized, 1 replicas) +- *(1) FileScan csv [,... 74 more fields] , ... 6 more fields Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1" grows beyond 64 KB and the generated code failed to compile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25093) CodeFormatter could avoid creating regex object again and again
Izek Greenfield created SPARK-25093: --- Summary: CodeFormatter could avoid creating regex object again and again Key: SPARK-25093 URL: https://issues.apache.org/jira/browse/SPARK-25093 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Izek Greenfield in class `CodeFormatter` method: `stripExtraNewLinesAndComments` could be refactored to: {code:scala} // Some comments here val commentReg = ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/ """([ |\t]*?\/\/[\s\S]*?\n)""").r // strip //comment val emptyRowsReg = """\n\s*\n""".r def stripExtraNewLinesAndComments(input: String): String = { val codeWithoutComment = commentReg.replaceAllIn(input, "") emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines } {code} so the Regex would be compiled only once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null
[ https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-25028: Description: on line 143: val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) If the value is NULL the code will fail with NPE *sample:* {code:scala} val df = List((1, null , "first"), (2, null , "second")).toDF("index", "name", "value").withColumn("name", $"name".cast("string")) df.write.partitionBy("name").saveAsTable("df13") spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS") {code} output: 2018-08-08 09:25:43 WARN BaseSessionStateBuilder$$anon$1:66 - Max iterations (2) reached for batch Resolution java.lang.NullPointerException at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) ... 49 elided was: on line 143: val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) If the value is NULL the code will fail with NPE sample: val df = List((1, null , "first"), (2, null , "second")).toDF("index", "name", "value").withColumn("name", $"name".cast("string")) df.write.partitionBy("name").saveAsTable("df13") spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS") output: 2018-08-08 09:25:43 WARN BaseSessionStateBuilder$$anon$1:66 - Max iterations (2) reached for batch Resolution java.lang.NullPointerException at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142) at scala.collecti
[jira] [Reopened] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null
[ https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield reopened SPARK-25028: - Add sample code > AnalyzePartitionCommand failed with NPE if value is null > > > Key: SPARK-25028 > URL: https://issues.apache.org/jira/browse/SPARK-25028 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > > on line 143: val partitionColumnValues = > partitionColumns.indices.map(r.get(_).toString) > If the value is NULL the code will fail with NPE > sample: > val df = List((1, null , "first"), (2, null , "second")).toDF("index", > "name", "value").withColumn("name", $"name".cast("string")) > df.write.partitionBy("name").saveAsTable("df13") > spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS") > output: > 2018-08-08 09:25:43 WARN BaseSessionStateBuilder$$anon$1:66 - Max iterations > (2) reached for batch Resolution > java.lang.NullPointerException > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) > ... 49 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null
[ https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-25028: Description: on line 143: val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) If the value is NULL the code will fail with NPE sample: val df = List((1, null , "first"), (2, null , "second")).toDF("index", "name", "value").withColumn("name", $"name".cast("string")) df.write.partitionBy("name").saveAsTable("df13") spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS") output: 2018-08-08 09:25:43 WARN BaseSessionStateBuilder$$anon$1:66 - Max iterations (2) reached for batch Resolution java.lang.NullPointerException at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142) at org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) ... 49 elided was: on line 143: val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) if the value is NULL the code will fail with NPE > AnalyzePartitionCommand failed with NPE if value is null > > > Key: SPARK-25028 > URL: https://issues.apache.org/jira/browse/SPARK-25028 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > > on line 143: val partitionColumnValues = > partitionColumns.indices.map(r.get(_).toString) > If the value is NULL the code will fail with NPE > sample: > val df = List((1, null , "first"), (2, null , "second")).toDF("index", > "name", "value").withColumn("name", $"name".cast("string")) > df.write.partitionBy("name").saveAsTable("df13") > spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS") > output: > 2018-08-08 09:25:43 WARN BaseSessionStateBuilder$$anon$1:66 - Max iterations > (2) reached for batch Resolution > java.lang.NullPointerException > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfu
[jira] [Created] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null
Izek Greenfield created SPARK-25028: --- Summary: AnalyzePartitionCommand failed with NPE if value is null Key: SPARK-25028 URL: https://issues.apache.org/jira/browse/SPARK-25028 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Izek Greenfield on line 143: val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) if the value is NULL the code will fail with NPE -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to slow query planning
[ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564885#comment-16564885 ] Izek Greenfield commented on SPARK-13346: - What the status of that? we face this issue too! > Using DataFrames iteratively leads to slow query planning > - > > Key: SPARK-13346 > URL: https://issues.apache.org/jira/browse/SPARK-13346 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Priority: Major > > I have an iterative algorithm based on DataFrames, and the query plan grows > very quickly with each iteration. Caching the current DataFrame at the end > of an iteration does not fix the problem. However, converting the DataFrame > to an RDD and back at the end of each iteration does fix the problem. > Printing the query plans shows that the plan explodes quickly (10 lines, to > several hundred lines, to several thousand lines, ...) with successive > iterations. > The desired behavior is for the analyzer to recognize that a big chunk of the > query plan does not need to be computed since it is already cached. The > computation on each iteration should be the same. > If useful, I can push (complex) code to reproduce the issue. But it should > be simple to see if you create an iterative algorithm which produces a new > DataFrame from an old one on each iteration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24831) Spark Executor class create new serializer each time
Izek Greenfield created SPARK-24831: --- Summary: Spark Executor class create new serializer each time Key: SPARK-24831 URL: https://issues.apache.org/jira/browse/SPARK-24831 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Izek Greenfield When I look in the code I find this line: (org.apache.spark.executor.Executor: 319) val resultSer = env.serializer.newInstance() why not hold this in threadLocal and reuse? [questioin in stackOverflow|https://stackoverflow.com/questions/51298877/spark-executor-class-create-new-serializer-each-time] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524720#comment-16524720 ] Izek Greenfield commented on SPARK-23904: - [~RBerenguel] Have you find something? BTW your workaround help a lot! thanks for that. > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499315#comment-16499315 ] Izek Greenfield edited comment on SPARK-23904 at 6/3/18 6:11 AM: - [~RBerenguel] Did you remember to run with this flag `-XX:+UseG1GC` I think think this is part of the problem because of the `Humongous objects` was (Author: igreenfi): [~RBerenguel] Did you remember to run with this flag `-XX:+UseG1GC` I think think this is part of the problem because of the `homogenous regions` > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499315#comment-16499315 ] Izek Greenfield commented on SPARK-23904: - [~RBerenguel] Did you remember to run with this flag `-XX:+UseG1GC` I think think this is part of the problem because of the `homogenous regions` > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496272#comment-16496272 ] Izek Greenfield edited comment on SPARK-23904 at 5/31/18 8:43 AM: -- [~RBerenguel] Class: SQLExecution Method: withNewExecutionId Line: 73 {code:scala} sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart( executionId, callSite.shortForm, callSite.longForm, queryExecution.toString, SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), System.currentTimeMillis())) {code} was (Author: igreenfi): [~RBerenguel] Class: SQLExecution Method: withNewExecutionId Line: 73 {code:scala} sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart( executionId, callSite.shortForm, callSite.longForm, "queryExecution.toString", SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), System.currentTimeMillis())) {code} > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496272#comment-16496272 ] Izek Greenfield commented on SPARK-23904: - [~RBerenguel] Class: SQLExecution Method: withNewExecutionId Line: 73 {code:scala} sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart( executionId, callSite.shortForm, callSite.longForm, "queryExecution.toString", SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), System.currentTimeMillis())) {code} > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494701#comment-16494701 ] Izek Greenfield commented on SPARK-23904: - [~RBerenguel] `setting completeString to no-op` what you mean at this? After I comment out in the code the generating of the string I don't get the OOM. > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493323#comment-16493323 ] Izek Greenfield commented on SPARK-23904: - [~RBerenguel] Did you manage to reproduce? In my side it had become a major issue, I compile the version 2.3 and comment out the generation of the string and get a very big performance boost. > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473827#comment-16473827 ] Izek Greenfield commented on SPARK-23904: - [~RBerenguel] Any update on that? > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438591#comment-16438591 ] Izek Greenfield commented on SPARK-23904: - [~viirya] this does not help because spark creates these string when it notifies the bus... > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23904) Big execution plan cause OOM
[ https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Izek Greenfield updated SPARK-23904: Description: I create a question in [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] Spark create the text representation of query in any case even if I don't need it. That causes many garbage object and unneeded GC... [Gist with code to reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] was: I create a question in [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] Spark create the text representation of query in any case even if I don't need it. That causes many garbage object and unneeded GC... > Big execution plan cause OOM > > > Key: SPARK-23904 > URL: https://issues.apache.org/jira/browse/SPARK-23904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Izek Greenfield >Priority: Major > Labels: SQL, query > > I create a question in > [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] > > Spark create the text representation of query in any case even if I don't > need it. > That causes many garbage object and unneeded GC... > [Gist with code to > reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23904) Big execution plan cause OOM
Izek Greenfield created SPARK-23904: --- Summary: Big execution plan cause OOM Key: SPARK-23904 URL: https://issues.apache.org/jira/browse/SPARK-23904 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Izek Greenfield I create a question in [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big] Spark create the text representation of query in any case even if I don't need it. That causes many garbage object and unneeded GC... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org