[jira] [Updated] (SPARK-47085) Preformance issue on thrift API

2024-02-20 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-47085:

Description: 
This new complexity was introduced in SPARK-39041.

before this PR the code was:
{code:java}
while (curRow < maxRows && iter.hasNext) {
  val sparkRow = iter.next()
  val row = ArrayBuffer[Any]()
  var curCol = 0
  while (curCol < sparkRow.length) {
if (sparkRow.isNullAt(curCol)) {
  row += null
} else {
  addNonNullColumnValue(sparkRow, row, curCol, timeFormatters)
}
curCol += 1
  }
  resultRowSet.addRow(row.toArray.asInstanceOf[Array[Object]])
  curRow += 1
}{code}
 foreach without the _*O(n^2)*_ complexity so this change just return the state 
to what it was before.

 

In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}
It can be easily converted back into _*O( n )*_ complexity.

 

 

  was:
This new complexity was introduced in SPARK-39041.

before this PR the code was:


{code:java}
def toTTableSchema(schema: StructType): TTableSchema = {
  val tTableSchema = new TTableSchema()
  schema.zipWithIndex.foreach { case (f, i) =>
tTableSchema.addToColumns(toTColumnDesc(f, i))
  }
  tTableSchema
} {code}
 foreach without the _*O(n^2)*_ complexity so this change just return the state 
to what it was before.

 

In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}
It can be easily converted back into _*O( n )*_ complexity.

 

 


> Preformance issue on thrift API
> ---
>
> Key: SPARK-47085
> URL: https://issues.apache.org/jira/browse/SPARK-47085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This new complexity was introduced in SPARK-39041.
> before this PR the code was:
> {code:java}
> while (curRow < maxRows && iter.hasNext) {
>   val sparkRow = iter.next()
>   val row = ArrayBuffer[Any]()
>   var curCol = 0
>   while (curCol < sparkRow.length) {
> if (sparkRow.isNullAt(curCol)) {
>   row += null
> } else {
>   addNonNullColumnValue(sparkRow, row, curCol, timeFormatters)
> }
> curCol += 1
>   }
>   resultRowSet.addRow(row.toArray.asInstanceOf[Array[Object]])
>   curRow += 1
> }{code}
>  foreach without the _*O(n^2)*_ complexity so this change just return the 
> state to what it was before.
>  
> In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
> {code:scala}
> ...
>  while (i < rowSize) {
>   val row = rows(I)
>   ...
> {code}
> It can be easily converted back into _*O( n )*_ complexity.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47085) Preformance issue on thrift API

2024-02-20 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818932#comment-17818932
 ] 

Izek Greenfield commented on SPARK-47085:
-

[~dongjoon] I updated the details

> Preformance issue on thrift API
> ---
>
> Key: SPARK-47085
> URL: https://issues.apache.org/jira/browse/SPARK-47085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This new complexity was introduced in SPARK-39041.
> before this PR the code was:
> {code:java}
> def toTTableSchema(schema: StructType): TTableSchema = {
>   val tTableSchema = new TTableSchema()
>   schema.zipWithIndex.foreach { case (f, i) =>
> tTableSchema.addToColumns(toTColumnDesc(f, i))
>   }
>   tTableSchema
> } {code}
>  foreach without the _*O(n^2)*_ complexity so this change just return the 
> state to what it was before.
>  
> In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
> {code:scala}
> ...
>  while (i < rowSize) {
>   val row = rows(I)
>   ...
> {code}
> It can be easily converted back into _*O( n )*_ complexity.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47085) Preformance issue on thrift API

2024-02-20 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-47085:

Description: 
This new complexity was introduced in SPARK-39041.

before this PR the code was:


{code:java}
def toTTableSchema(schema: StructType): TTableSchema = {
  val tTableSchema = new TTableSchema()
  schema.zipWithIndex.foreach { case (f, i) =>
tTableSchema.addToColumns(toTColumnDesc(f, i))
  }
  tTableSchema
} {code}
 foreach without the _*O(n^2)*_ complexity so this change just return the state 
to what it was before.

 

In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}
It can be easily converted back into _*O( n )*_ complexity.

 

 

  was:
This new complexity was introduced in SPARK-39041.

In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}
It can be easily converted back into _*O( n )*_ complexity.

 

 


> Preformance issue on thrift API
> ---
>
> Key: SPARK-47085
> URL: https://issues.apache.org/jira/browse/SPARK-47085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This new complexity was introduced in SPARK-39041.
> before this PR the code was:
> {code:java}
> def toTTableSchema(schema: StructType): TTableSchema = {
>   val tTableSchema = new TTableSchema()
>   schema.zipWithIndex.foreach { case (f, i) =>
> tTableSchema.addToColumns(toTColumnDesc(f, i))
>   }
>   tTableSchema
> } {code}
>  foreach without the _*O(n^2)*_ complexity so this change just return the 
> state to what it was before.
>  
> In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
> {code:scala}
> ...
>  while (i < rowSize) {
>   val row = rows(I)
>   ...
> {code}
> It can be easily converted back into _*O( n )*_ complexity.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47085) Preformance issue on thrift API

2024-02-19 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-47085:

Description: 
This new complexity was introduced in SPARK-39041.

In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}
It can be easily converted back into _*O( n )*_ complexity.

 

 

  was:
This new complexity introduced in SPARK-39041.

In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}
It can be easily converted back into _*O(n)*_ complexity.

 

 


> Preformance issue on thrift API
> ---
>
> Key: SPARK-47085
> URL: https://issues.apache.org/jira/browse/SPARK-47085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This new complexity was introduced in SPARK-39041.
> In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
> {code:scala}
> ...
>  while (i < rowSize) {
>   val row = rows(I)
>   ...
> {code}
> It can be easily converted back into _*O( n )*_ complexity.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47085) Preformance issue on thrift API

2024-02-19 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-47085:

Description: 
This new complexity introduced in SPARK-39041.

In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}
It can be easily converted back into _*O(n)*_ complexity.

 

 

  was:
in class `RowSetUtils` there is a loop that has O(n^2) complexity:


{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}

It can be easily converted into O( n ) complexity. 


> Preformance issue on thrift API
> ---
>
> Key: SPARK-47085
> URL: https://issues.apache.org/jira/browse/SPARK-47085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This new complexity introduced in SPARK-39041.
> In class `RowSetUtils` there is a loop that has _*O(n^2)*_ complexity:
> {code:scala}
> ...
>  while (i < rowSize) {
>   val row = rows(I)
>   ...
> {code}
> It can be easily converted back into _*O(n)*_ complexity.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47085) Preformance issue on thrift API

2024-02-19 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-47085:

Issue Type: Bug  (was: Improvement)

> Preformance issue on thrift API
> ---
>
> Key: SPARK-47085
> URL: https://issues.apache.org/jira/browse/SPARK-47085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> in class `RowSetUtils` there is a loop that has O(n^2) complexity:
> {code:scala}
> ...
>  while (i < rowSize) {
>   val row = rows(I)
>   ...
> {code}
> It can be easily converted into O( n ) complexity. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47085) Preformance issue on thrift API

2024-02-18 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818256#comment-17818256
 ] 

Izek Greenfield commented on SPARK-47085:
-

https://github.com/apache/spark/pull/45155

> Preformance issue on thrift API
> ---
>
> Key: SPARK-47085
> URL: https://issues.apache.org/jira/browse/SPARK-47085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Izek Greenfield
>Priority: Major
>
> in class `RowSetUtils` there is a loop that has O(n^2) complexity:
> {code:scala}
> ...
>  while (i < rowSize) {
>   val row = rows(I)
>   ...
> {code}
> It can be easily converted into O( n ) complexity. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47085) Preformance issue on thrift API

2024-02-18 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-47085:
---

 Summary: Preformance issue on thrift API
 Key: SPARK-47085
 URL: https://issues.apache.org/jira/browse/SPARK-47085
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0, 3.4.1
Reporter: Izek Greenfield


in class `RowSetUtils` there is a loop that has O(n^2) complexity:


{code:scala}
...
 while (i < rowSize) {
  val row = rows(I)
  ...
{code}

It can be easily converted into O( n ) complexity. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46891) allow configure of statistics logical plan visitor

2024-01-28 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-46891:
---

 Summary: allow configure of statistics logical plan visitor
 Key: SPARK-46891
 URL: https://issues.apache.org/jira/browse/SPARK-46891
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.5.0
Reporter: Izek Greenfield


allow configuration for setting statistics plan visitor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46137) Janino compiler has new version that fix issue with compilation

2023-11-28 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-46137:
---

 Summary: Janino compiler has new version that fix issue with 
compilation
 Key: SPARK-46137
 URL: https://issues.apache.org/jira/browse/SPARK-46137
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Izek Greenfield


Janino released a new version (3.1.11) with a fix of a compilation of:
{code:java}
{
} while (fasle) {code}
[Link to github issue|https://github.com/janino-compiler/janino/issues/208]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44472) change the external catalog thread safety way

2023-10-29 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-44472:

Labels:   (was: spark-)

> change the external catalog thread safety way
> -
>
> Key: SPARK-44472
> URL: https://issues.apache.org/jira/browse/SPARK-44472
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: add_hive_concurrent_connections.diff
>
>
> We test changing the sync of the external catalog to use thread-local instead 
> of the synchronized methods.
> in our tests, it improve the runtime of parallel actions by about 45% for 
> certain workload ** (time reduced from ~15min to ~9min) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44472) change the external catalog thread safety way

2023-10-29 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-44472:

Labels: spark-  (was: )

> change the external catalog thread safety way
> -
>
> Key: SPARK-44472
> URL: https://issues.apache.org/jira/browse/SPARK-44472
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: spark-
> Attachments: add_hive_concurrent_connections.diff
>
>
> We test changing the sync of the external catalog to use thread-local instead 
> of the synchronized methods.
> in our tests, it improve the runtime of parallel actions by about 45% for 
> certain workload ** (time reduced from ~15min to ~9min) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44472) change the external catalog thread safety way

2023-10-29 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780741#comment-17780741
 ] 

Izek Greenfield commented on SPARK-44472:
-

Hi, Any comments about this?

> change the external catalog thread safety way
> -
>
> Key: SPARK-44472
> URL: https://issues.apache.org/jira/browse/SPARK-44472
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: add_hive_concurrent_connections.diff
>
>
> We test changing the sync of the external catalog to use thread-local instead 
> of the synchronized methods.
> in our tests, it improve the runtime of parallel actions by about 45% for 
> certain workload ** (time reduced from ~15min to ~9min) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44472) change the external catalog thread safety way

2023-07-18 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-44472:

Attachment: add_hive_concurrent_connections.diff

> change the external catalog thread safety way
> -
>
> Key: SPARK-44472
> URL: https://issues.apache.org/jira/browse/SPARK-44472
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: add_hive_concurrent_connections.diff
>
>
> We test changing the sync of the external catalog to use thread-local instead 
> of the synchronized methods.
> in our tests, it improve the runtime of parallel actions by about 45% for 
> certain workload ** (time reduced from ~15min to ~9min) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44472) change the external catalog thread safety way

2023-07-18 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-44472:
---

 Summary: change the external catalog thread safety way
 Key: SPARK-44472
 URL: https://issues.apache.org/jira/browse/SPARK-44472
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Izek Greenfield


We test changing the sync of the external catalog to use thread-local instead 
of the synchronized methods.

in our tests, it improve the runtime of parallel actions by about 45% for 
certain workload ** (time reduced from ~15min to ~9min) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37745) Add new option to CSVOption ignoreEmptyLines

2021-12-26 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-37745:
---

 Summary: Add new option to CSVOption ignoreEmptyLines
 Key: SPARK-37745
 URL: https://issues.apache.org/jira/browse/SPARK-37745
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Izek Greenfield


In many cases, user need to read full CSV file with all empty lines so it will 
be good to have such option the default will be true so the default behavior 
will not changed



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Description: 
When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:

for running that I used `-Xmx45G`
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")
(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l5", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))

val r = Random.alphanumeric
val l = 220
val i = 2800

val b = spark.read
   .parquet("/tmp/b")
   .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
 
a.join(b, col("index") === col("index_b")).show(2000)
{code}
 

  was:
When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")
(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l

[jira] [Updated] (SPARK-37321) Wrong size estimation that leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Summary: Wrong size estimation that leads to "Cannot broadcast the table 
that is larger than 8GB: 8 GB"  (was: Wrong size estimation that leads to 
Cannot broadcast the table that is larger than 8GB: 8 GB)

> Wrong size estimation that leads to "Cannot broadcast the table that is 
> larger than 8GB: 8 GB"
> --
>
> Key: SPARK-37321
> URL: https://issues.apache.org/jira/browse/SPARK-37321
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Izek Greenfield
>Priority: Major
>
> When CBO is enabled then a situation occurs where spark tries to broadcast 
> very large DataFrame due to wrong output size estimation.
>  
> In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
> use `DataType.defaultSize`.
> In the case where the output contains `functions.concat_ws`, the 
> `getSizePerRow` function will estimate the size to be 20 bytes, while in our 
> case the actual size can be a lot larger.
> As a result, we in some cases end up with an estimated size of < 300K while 
> the actual size can be > 8GB, thus leading to exceptions as spark thinks the 
> tables may be broadcast but later realizes the data size is too large.
>  
> Code sample to reproduce:
> {code:scala}
> import spark.implicits._
> (1 to 10).toDF("index").withColumn("index", 
> col("index").cast("string")).write.parquet("/tmp/a")
> (1 to 1000).toDF("index_b").withColumn("index_b", 
> col("index_b").cast("string")).write.parquet("/tmp/b")
> val a = spark.read
>    .parquet("/tmp/a")
>    .withColumn("b", col("index"))
>    .withColumn("l1", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l2", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l3", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l4", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l5", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
> val r = Random.alphanumeric
> val l = 220
> val i = 2800
> val b = spark.read
>    .parquet("/tmp/b")
>    .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>  
> a.join(b, col("index") === col("index_b")).show(2000)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Component/s: SQL

> Wrong size estimation leads to "Cannot broadcast the table that is larger 
> than 8GB: 8 GB"
> -
>
> Key: SPARK-37321
> URL: https://issues.apache.org/jira/browse/SPARK-37321
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Izek Greenfield
>Priority: Major
>
> When CBO is enabled then a situation occurs where spark tries to broadcast 
> very large DataFrame due to wrong output size estimation.
>  
> In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
> use `DataType.defaultSize`.
> In the case where the output contains `functions.concat_ws`, the 
> `getSizePerRow` function will estimate the size to be 20 bytes, while in our 
> case the actual size can be a lot larger.
> As a result, we in some cases end up with an estimated size of < 300K while 
> the actual size can be > 8GB, thus leading to exceptions as spark thinks the 
> tables may be broadcast but later realizes the data size is too large.
>  
> Code sample to reproduce:
> {code:scala}
> import spark.implicits._
> (1 to 10).toDF("index").withColumn("index", 
> col("index").cast("string")).write.parquet("/tmp/a")
> (1 to 1000).toDF("index_b").withColumn("index_b", 
> col("index_b").cast("string")).write.parquet("/tmp/b")
> val a = spark.read
>    .parquet("/tmp/a")
>    .withColumn("b", col("index"))
>    .withColumn("l1", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l2", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l3", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l4", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l5", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
> val r = Random.alphanumeric
> val l = 220
> val i = 2800
> val b = spark.read
>    .parquet("/tmp/b")
>    .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>  
> a.join(b, col("index") === col("index_b")).show(2000)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Summary: Wrong size estimation leads to "Cannot broadcast the table that is 
larger than 8GB: 8 GB"  (was: Wrong size estimation that leads to "Cannot 
broadcast the table that is larger than 8GB: 8 GB")

> Wrong size estimation leads to "Cannot broadcast the table that is larger 
> than 8GB: 8 GB"
> -
>
> Key: SPARK-37321
> URL: https://issues.apache.org/jira/browse/SPARK-37321
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Izek Greenfield
>Priority: Major
>
> When CBO is enabled then a situation occurs where spark tries to broadcast 
> very large DataFrame due to wrong output size estimation.
>  
> In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
> use `DataType.defaultSize`.
> In the case where the output contains `functions.concat_ws`, the 
> `getSizePerRow` function will estimate the size to be 20 bytes, while in our 
> case the actual size can be a lot larger.
> As a result, we in some cases end up with an estimated size of < 300K while 
> the actual size can be > 8GB, thus leading to exceptions as spark thinks the 
> tables may be broadcast but later realizes the data size is too large.
>  
> Code sample to reproduce:
> {code:scala}
> import spark.implicits._
> (1 to 10).toDF("index").withColumn("index", 
> col("index").cast("string")).write.parquet("/tmp/a")
> (1 to 1000).toDF("index_b").withColumn("index_b", 
> col("index_b").cast("string")).write.parquet("/tmp/b")
> val a = spark.read
>    .parquet("/tmp/a")
>    .withColumn("b", col("index"))
>    .withColumn("l1", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l2", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l3", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l4", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l5", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
> val r = Random.alphanumeric
> val l = 220
> val i = 2800
> val b = spark.read
>    .parquet("/tmp/b")
>    .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>  
> a.join(b, col("index") === col("index_b")).show(2000)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation that leads to Cannot broadcast the table that is larger than 8GB: 8 GB

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Description: 
When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")
(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l5", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))

val r = Random.alphanumeric
val l = 220
val i = 2800

val b = spark.read
   .parquet("/tmp/b")
   .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
 
a.join(b, col("index") === col("index_b")).show(2000)
{code}
 

  was:
When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")

(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l5", functions.concat_ws("/", col("

[jira] [Created] (SPARK-37321) Wrong size estimation that leads to Cannot broadcast the table that is larger than 8GB: 8 GB

2021-11-14 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-37321:
---

 Summary: Wrong size estimation that leads to Cannot broadcast the 
table that is larger than 8GB: 8 GB
 Key: SPARK-37321
 URL: https://issues.apache.org/jira/browse/SPARK-37321
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.2.0, 3.1.1
Reporter: Izek Greenfield


When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")

(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l5", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))

val r = Random.alphanumeric
val l = 220
val i = 2800

val b = spark.read
   .parquet("/tmp/b")
   .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
 
a.join(b, col("index") === col("index_b")).show(2000)
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36329) show api of Dataset should get as input the output method

2021-08-01 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391159#comment-17391159
 ] 

Izek Greenfield commented on SPARK-36329:
-

# if you do it like that you should write it again and again.
 # it very handy to have it in the same format as show.

> show api of Dataset should get as input the output method
> -
>
> Key: SPARK-36329
> URL: https://issues.apache.org/jira/browse/SPARK-36329
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Izek Greenfield
>Priority: Major
>
> For now show is:
> {code:scala}
> def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
> println(showString(numRows, truncate = 20))
>   } else {
> println(showString(numRows, truncate = 0))
>   }
> {code}
> it can be turn into:
> {code:scala}
> def show(numRows: Int, truncate: Boolean, out: String => Unit = println): 
> Unit = if (truncate) {
> out(showString(numRows, truncate = 20))
>   } else {
> out(showString(numRows, truncate = 0))
>   }
> {code}
> so user will be able to send that to file/log...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36329) show api of Dataset should get as input the output method

2021-07-28 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-36329:

Description: 
For now show is:
{code:scala}
def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
println(showString(numRows, truncate = 20))
  } else {
println(showString(numRows, truncate = 0))
  }
{code}
it can be turn into:
{code:scala}
def show(numRows: Int, truncate: Boolean, out: String => Unit = println): Unit 
= if (truncate) {
out(showString(numRows, truncate = 20))
  } else {
out(showString(numRows, truncate = 0))
  }
{code}
so user will be able to send that to file/log...

  was:
For now show is: 

{code:scala}
def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
println(showString(numRows, truncate = 20))
  } else {
println(showString(numRows, truncate = 0))
  }
{code}

it can be turn into:


{code:scala}
def show(numRows: Int, truncate: Boolean, out: Any => Unit = println): Unit = 
if (truncate) {
out(showString(numRows, truncate = 20))
  } else {
out(showString(numRows, truncate = 0))
  }
{code}

so user will be able to send that to file/log...




> show api of Dataset should get as input the output method
> -
>
> Key: SPARK-36329
> URL: https://issues.apache.org/jira/browse/SPARK-36329
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Izek Greenfield
>Priority: Major
>
> For now show is:
> {code:scala}
> def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
> println(showString(numRows, truncate = 20))
>   } else {
> println(showString(numRows, truncate = 0))
>   }
> {code}
> it can be turn into:
> {code:scala}
> def show(numRows: Int, truncate: Boolean, out: String => Unit = println): 
> Unit = if (truncate) {
> out(showString(numRows, truncate = 20))
>   } else {
> out(showString(numRows, truncate = 0))
>   }
> {code}
> so user will be able to send that to file/log...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36329) show api of Dataset should get as input the output method

2021-07-28 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-36329:
---

 Summary: show api of Dataset should get as input the output method
 Key: SPARK-36329
 URL: https://issues.apache.org/jira/browse/SPARK-36329
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Izek Greenfield


For now show is: 

{code:scala}
def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
println(showString(numRows, truncate = 20))
  } else {
println(showString(numRows, truncate = 0))
  }
{code}

it can be turn into:


{code:scala}
def show(numRows: Int, truncate: Boolean, out: Any => Unit = println): Unit = 
if (truncate) {
out(showString(numRows, truncate = 20))
  } else {
out(showString(numRows, truncate = 0))
  }
{code}

so user will be able to send that to file/log...





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2021-02-16 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285663#comment-17285663
 ] 

Izek Greenfield commented on SPARK-8582:


[~zsxwing] Why that Jira closed? I still see that as an issue on the code...

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32764) compare of -0.0 < 0.0 return true

2020-09-01 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-32764:
---

 Summary: compare of -0.0 < 0.0 return true
 Key: SPARK-32764
 URL: https://issues.apache.org/jira/browse/SPARK-32764
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Izek Greenfield


{code:scala}
 val spark: SparkSession = SparkSession
  .builder()
  .master("local")
  .appName("SparkByExamples.com")
  .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

import spark.sqlContext.implicits._

val df = Seq((-0.0, 0.0)).toDF("neg", "pos")
  .withColumn("comp", col("neg") < col("pos"))
  df.show(false)

==

++---++
|neg |pos|comp|
++---++
|-0.0|0.0|true|
++---++{code}

I think that result should be false.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31769) Add support of MDC in spark driver logs.

2020-05-20 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111955#comment-17111955
 ] 

Izek Greenfield commented on SPARK-31769:
-

Hi [~cloud_fan] Following previous discussions on 
[PR-26624|https://github.com/apache/spark/pull/26624], I'd like us to agree on 
the necessity of this feature, after which I'll create a dedicated PR.

What are your thoughts?

> Add support of MDC in spark driver logs.
> 
>
> Key: SPARK-31769
> URL: https://issues.apache.org/jira/browse/SPARK-31769
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Izek Greenfield
>Priority: Major
>
> In order to align driver logs with spark executors logs, add support to log 
> MDC state for properties that are shared between the driver the server 
> executors (e.g. application name. task id) 
>  The use case is applicable in 2 cases:
>  # streaming
>  # running spark under job server (in this case you have a long-running spark 
> context that handles many tasks from different sources).
> See also [SPARK-8981|https://issues.apache.org/jira/browse/SPARK-8981] that 
> handles the MDC in executor context.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31769) Add support of MDC in spark driver logs.

2020-05-20 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-31769:
---

 Summary: Add support of MDC in spark driver logs.
 Key: SPARK-31769
 URL: https://issues.apache.org/jira/browse/SPARK-31769
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Izek Greenfield


In order to align driver logs with spark executors logs, add support to log MDC 
state for properties that are shared between the driver the server executors 
(e.g. application name. task id) 
 The use case is applicable in 2 cases:
 # streaming
 # running spark under job server (in this case you have a long-running spark 
context that handles many tasks from different sources).

See also [SPARK-8981|https://issues.apache.org/jira/browse/SPARK-8981] that 
handles the MDC in executor context.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-04-30 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096340#comment-17096340
 ] 

Izek Greenfield commented on SPARK-30332:
-

Can someone check that?

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, 
> T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsM

[jira] [Reopened] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-02-23 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield reopened SPARK-30332:
-

Added code for reproduce

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, 
> T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalised,

[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-02-23 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042851#comment-17042851
 ] 

Izek Greenfield commented on SPARK-30332:
-

Code to reproduce the problem:

{code:scala}

import java.nio.file.{Files, Paths}

import org.apache.spark.sql.SparkSession

object Test {

  def main(args: Array[String]): Unit = {
val spark = {
  SparkSession
.builder()
.master("local[*]")
.config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.cbo.enabled", "true")
.config("spark.scheduler.mode", "FAIR")
.config("spark.sql.crossJoin.enabled", "true")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.sql.parquet.filterPushdown", "true")
.config("spark.sql.shuffle.partitions", "500")
.config("spark.executor.heartbeatInterval", "600s")
.config("spark.network.timeout", "1200s")
.config("spark.sql.broadcastTimeout", "1200s")
.config("spark.shuffle.file.buffer", "64k")
.appName("error")
.enableHiveSupport()
.getOrCreate()
}

val pathToCsvFiles = "db"
import scala.collection.JavaConverters._

Files.walk(Paths.get(pathToCsvFiles)).iterator().asScala.map(_.toFile).foreach{ 
file =>
  if (!file.isDirectory){
val name = file.getName
spark.read.format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .option("mode", "DROPMALFORMED")
  .load(file.getAbsolutePath)
  .createOrReplaceGlobalTempView(name.split("\\.").head)
  }
}

spark.sql(
  """
|SELECT  BT_capital.asof_date,
|BT_capital.run_id,
|BT_capital.v,
|BT_capital.id,
|BT_capital.entity,
|BT_capital.level_1,
|BT_capital.level_2,
|BT_capital.level_3,
|BT_capital.level_4,
|BT_capital.level_5,
|BT_capital.level_6,
|BT_capital.path_bt_capital,
|BT_capital.line_item,
|t0.target_line_item,
|t0.line_description,
|BT_capital.col_item,
|BT_capital.rep_amount,
|root.orgUnitId,
|root.cptyId,
|root.instId,
|root.startDate,
|root.maturityDate,
|root.amount,
|root.nominalAmount,
|root.quantity,
|root.lkupAssetLiability,
|root.lkupCurrency,
|root.lkupProdType,
|root.interestResetDate,
|root.interestResetTerm,
|root.noticePeriod,
|root.historicCostAmount,
|root.dueDate,
|root.lkupResidence,
|root.lkupCountryOfUltimateRisk,
|root.lkupSector,
|root.lkupIndustry,
|root.lkupAccountingPortfolioType,
|root.lkupLoanDepositTerm,
|root.lkupFixedFloating,
|root.lkupCollateralType,
|root.lkupRiskType,
|root.lkupEligibleRefinancing,
|root.lkupHedging,
|root.lkupIsOwnIssued,
|root.lkupIsSubordinated,
|root.lkupIsQuoted,
|root.lkupIsSecuritised,
|root.lkupIsSecuritisedServiced,
|root.lkupIsSyndicated,
|root.lkupIsDeRecognised,
|root.lkupIsRenegotiated,
|root.lkupIsTransferable,
|root.lkupIsNewBusiness,
|root.lkupIsFiduciary,
|root.lkupIsNonPerforming,
|root.lkupIsInterGroup,
|root.lkupIsIntraGroup,
|root.lkupIsRediscounted,
|root.lkupIsCollateral,
|root.lkupIsExercised,
|root.lkupIsImpaired,
|root.facilityId,
|root.lkupIsOTC,
|root.lkupIsDefaulted,
|root.lkupIsSavingsPosition,
|root.lkupIsForborne,
|root.lkupIsDebtRestructuringLoan,
|root.interestRateAAR,
|root.interestRateAPRC,
|root.custom1,
|root.custom2,
|root.custom3,
|root.lkupSecuritisationType,
|root.lkupIsCashPooling,
|root.lkupIsEquityParticipationGTE10,
|root.lkupIsConvertible,
|root.lkupEconomicHedge,
|root.lkupIsNonCurrHeldForSale,
|root.lkupIsEmbeddedDerivative,
|root.lkupLoanPurpose,
|root.lkupRegulated,
|root.lkupRepaymentType,
|root.glAccount,
|root.lkupIsRecourse,
|root.lkupIsNotFullyGuaranteed,
|root.lkupImpairmentStage,
|root.lkupIsEntireAmountWrittenOff,
|root.lkupIsLowCreditRisk,
|root.lkupIsOBSWithinIFRS9,
|root.lkupIsUnderSpecialSurveillance,
|root.lkupProtection,
|root.lkupIsGeneralAllowance,
|root.lkupSectorUltimateRisk,
|root.cptyOrgUnitId,
|root.name,
|root.lkupNationality,
|root.lkupSize,
|root.lkupIsSPV,
|root.lkupIsCentralCounterparty,
|root.lkupIsMMRMFI,
|root.lkupIsKeyManagement,
|root.lkupIsOtherRelat

[jira] [Reopened] (SPARK-8981) Set applicationId and appName in log4j MDC

2020-02-17 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield reopened SPARK-8981:


Created PR

> Set applicationId and appName in log4j MDC
> --
>
> Key: SPARK-8981
> URL: https://issues.apache.org/jira/browse/SPARK-8981
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Paweł Kopiczko
>Priority: Minor
>
> It would be nice to have, because it's good to have logs in one file when 
> using log agents (like logentires) in standalone mode. Also allows 
> configuring rolling file appender without a mess when multiple applications 
> are running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-02-17 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield reopened SPARK-30332:
-

I attached the DS to run the query.

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, 
> T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCa

[jira] [Updated] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-02-17 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-30332:

Attachment: AGGR_41406.csv

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, 
> T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalise

[jira] [Updated] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-02-17 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-30332:

Attachment: T_41233.csv

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalised,
> GB_position.lkupC

[jira] [Updated] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-02-17 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-30332:

Attachment: AGGR_41418.csv
AGGR_41410.csv
AGGR_41406.csv
AGGR_41390.csv
AGGR_41380.csv
PORTFOLIO_41446.csv

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, 
> AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, T_41233.csv
>
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,

[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2020-02-13 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036203#comment-17036203
 ] 

Izek Greenfield commented on SPARK-16387:
-

Hi [~dongjoon]

thanks for the quick response. we found that on oracle the addition of "" cause 
error in our other servers that try to read that table.

if the column name is `t_id` and you create it "t_id" if you will try to do 
`select T_ID from table` you will get an error from oracle.

from the docs on the method, it says that it will escape only reserved words 
but it actually escapes all... so I this it could be change to really escape 
only reserved words. 

Like I do in the link I put in the previous comment.

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.0.0
>
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2020-02-12 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035395#comment-17035395
 ] 

Izek Greenfield commented on SPARK-16387:
-

 

[~dongjoon] 
I think that PR fix one issue but create other issue and should be fixed in 
other way.
not quotation of all but for each supported dialect check if the word is 
reserved and just in that case quote: like the scaladoc of the function said:

/**
 * Quotes the identifier. This is used to put quotes around the identifier in 
case the column
 * name is a reserved keyword, or in case it contains characters that require 
quotes (e.g. space).
 */

 

something like: [link|https://stackoverflow.com/a/60167628]

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.0.0
>
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-01-03 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007373#comment-17007373
 ] 

Izek Greenfield commented on SPARK-30332:
-

Yes when removing the limit all works and get resultd

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalised,
> GB_position.lkupCollateralPossession,
> GB_position.lkupIsLifetimeMortgage

[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2019-12-27 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004146#comment-17004146
 ] 

Izek Greenfield commented on SPARK-30332:
-

The task failed, do not get any result

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalised,
> GB_position.lkupCollateralPossession,
> GB_position.lkupIsLifetimeMortgage,
> GB_position

[jira] [Created] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2019-12-23 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-30332:
---

 Summary: When running sql query with limit catalyst throw 
StackOverFlow exception 
 Key: SPARK-30332
 URL: https://issues.apache.org/jira/browse/SPARK-30332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: spark version 3.0.0-preview
Reporter: Izek Greenfield


Running that SQL:
{code:sql}
SELECT  BT_capital.asof_date,
BT_capital.run_id,
BT_capital.v,
BT_capital.id,
BT_capital.entity,
BT_capital.level_1,
BT_capital.level_2,
BT_capital.level_3,
BT_capital.level_4,
BT_capital.level_5,
BT_capital.level_6,
BT_capital.path_bt_capital,
BT_capital.line_item,
t0.target_line_item,
t0.line_description,
BT_capital.col_item,
BT_capital.rep_amount,
root.orgUnitId,
root.cptyId,
root.instId,
root.startDate,
root.maturityDate,
root.amount,
root.nominalAmount,
root.quantity,
root.lkupAssetLiability,
root.lkupCurrency,
root.lkupProdType,
root.interestResetDate,
root.interestResetTerm,
root.noticePeriod,
root.historicCostAmount,
root.dueDate,
root.lkupResidence,
root.lkupCountryOfUltimateRisk,
root.lkupSector,
root.lkupIndustry,
root.lkupAccountingPortfolioType,
root.lkupLoanDepositTerm,
root.lkupFixedFloating,
root.lkupCollateralType,
root.lkupRiskType,
root.lkupEligibleRefinancing,
root.lkupHedging,
root.lkupIsOwnIssued,
root.lkupIsSubordinated,
root.lkupIsQuoted,
root.lkupIsSecuritised,
root.lkupIsSecuritisedServiced,
root.lkupIsSyndicated,
root.lkupIsDeRecognised,
root.lkupIsRenegotiated,
root.lkupIsTransferable,
root.lkupIsNewBusiness,
root.lkupIsFiduciary,
root.lkupIsNonPerforming,
root.lkupIsInterGroup,
root.lkupIsIntraGroup,
root.lkupIsRediscounted,
root.lkupIsCollateral,
root.lkupIsExercised,
root.lkupIsImpaired,
root.facilityId,
root.lkupIsOTC,
root.lkupIsDefaulted,
root.lkupIsSavingsPosition,
root.lkupIsForborne,
root.lkupIsDebtRestructuringLoan,
root.interestRateAAR,
root.interestRateAPRC,
root.custom1,
root.custom2,
root.custom3,
root.lkupSecuritisationType,
root.lkupIsCashPooling,
root.lkupIsEquityParticipationGTE10,
root.lkupIsConvertible,
root.lkupEconomicHedge,
root.lkupIsNonCurrHeldForSale,
root.lkupIsEmbeddedDerivative,
root.lkupLoanPurpose,
root.lkupRegulated,
root.lkupRepaymentType,
root.glAccount,
root.lkupIsRecourse,
root.lkupIsNotFullyGuaranteed,
root.lkupImpairmentStage,
root.lkupIsEntireAmountWrittenOff,
root.lkupIsLowCreditRisk,
root.lkupIsOBSWithinIFRS9,
root.lkupIsUnderSpecialSurveillance,
root.lkupProtection,
root.lkupIsGeneralAllowance,
root.lkupSectorUltimateRisk,
root.cptyOrgUnitId,
root.name,
root.lkupNationality,
root.lkupSize,
root.lkupIsSPV,
root.lkupIsCentralCounterparty,
root.lkupIsMMRMFI,
root.lkupIsKeyManagement,
root.lkupIsOtherRelatedParty,
root.lkupResidenceProvince,
root.lkupIsTradingBook,
root.entityHierarchy_entityId,
root.entityHierarchy_Residence,
root.lkupLocalCurrency,
root.cpty_entityhierarchy_entityId,
root.lkupRelationship,
root.cpty_lkupRelationship,
root.entityNationality,
root.lkupRepCurrency,
root.startDateFinancialYear,
root.numEmployees,
root.numEmployeesTotal,
root.collateralAmount,
root.guaranteeAmount,
root.impairmentSpecificIndividual,
root.impairmentSpecificCollective,
root.impairmentGeneral,
root.creditRiskAmount,
root.provisionSpecificIndividual,
root.provisionSpecificCollective,
root.provisionGeneral,
root.writeOffAmount,
root.interest,
root.fairValueAmount,
root.grossCarryingAmount,
root.carryingAmount,
root.code,
root.lkupInstrumentType,
root.price,
root.amountAtIssue,
root.yield,
root.totalFacilityAmount,
root.facility_rate,
root.spec_indiv_est,
root.spec_coll_est,
root.coll_inc_loss,
root.impairment_amount,
root.provision_amount,
root.accumulated_impairment,
root.exclusionFlag,
root.lkupIsHoldingCompany,
root.instrument_startDate,
root.entityResidence,
fxRate.enumerator,
fxRate.lkupFromCurrency,
fxRate.rate,
fxRate.custom1,
fxRate.custom2,
fxRate.custom3,
GB_position.lkupIsECGDGuaranteed,
GB_position.lkupIsMultiAcctOffsetMortgage,
GB_position.lkupIsIndexLinked,
GB_position.lkupIsRetail,
GB_position.lkupCollateralLocation,
GB_position.percentAboveBBR,
GB_position.lkupIsMoreInArrears,
GB_position.lkupIsArrearsCapitalised,
GB_position.lkupCollateralPossession,
GB_position.lkupIsLifetimeMortgage,
GB_position.lkupLoanConcessionType,
GB_position.lkupIsMultiCurrency,
GB_position.lkupIsJointIncomeBasis,
GB_position.ratioIncomeMultiple,
GB_position.interestRate,
GB_position.exclusionFlag,
GB_position.lkupFDIDirection,
GB_position.lkupIsRTGS,
GB_positionExtended.nonRecourseFinanceAmount,
GB_positionExtended.arrearsAmount,
GB_Counterparty.lkupIsClearingFirm,
GB_Counterparty.lkupIsIntermediateFinCorp,
GB_Counterparty.lkupIsImpairedCreditHistory,
GB_Counterparty.lkupFDIRelationship  FROM portfolio_41446 BT_capital
JOIN aggr_41390 root ON (root.id = BT_capital.id AND root.entity = 
BT_capital.entity AND (root.instance_id = 
'e3b82807-9371-44f4-9c97-d63cde

[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC

2019-11-21 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979236#comment-16979236
 ] 

Izek Greenfield commented on SPARK-8981:


[~srowen] I created pull request for that.

> Set applicationId and appName in log4j MDC
> --
>
> Key: SPARK-8981
> URL: https://issues.apache.org/jira/browse/SPARK-8981
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Paweł Kopiczko
>Priority: Minor
>
> It would be nice to have, because it's good to have logs in one file when 
> using log agents (like logentires) in standalone mode. Also allows 
> configuring rolling file appender without a mess when multiple applications 
> are running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to slow query planning

2019-10-10 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948443#comment-16948443
 ] 

Izek Greenfield commented on SPARK-13346:
-

[~davies] why this issue gets closed?

> Using DataFrames iteratively leads to slow query planning
> -
>
> Key: SPARK-13346
> URL: https://issues.apache.org/jira/browse/SPARK-13346
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> I have an iterative algorithm based on DataFrames, and the query plan grows 
> very quickly with each iteration.  Caching the current DataFrame at the end 
> of an iteration does not fix the problem.  However, converting the DataFrame 
> to an RDD and back at the end of each iteration does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to 
> several hundred lines, to several thousand lines, ...) with successive 
> iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the 
> query plan does not need to be computed since it is already cached.  The 
> computation on each iteration should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should 
> be simple to see if you create an iterative algorithm which produces a new 
> DataFrame from an old one on each iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27862) Upgrade json4s-jackson to 3.6.6

2019-05-30 Thread Izek Greenfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-27862:

Summary: Upgrade json4s-jackson to 3.6.6  (was: Upgrade json4s-jackson to 
3.6.5)

> Upgrade json4s-jackson to 3.6.6
> ---
>
> Key: SPARK-27862
> URL: https://issues.apache.org/jira/browse/SPARK-27862
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Izek Greenfield
>Priority: Minor
>
> it will be very good to upgrade to newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-27862) Upgrade json4s-jackson to 3.6.5

2019-05-28 Thread Izek Greenfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-27862:

Comment: was deleted

(was: create pull request for 2.4 branch: 
https://github.com/apache/spark/pull/24729
create pull request for master branch: 
https://github.com/apache/spark/pull/24736
)

> Upgrade json4s-jackson to 3.6.5
> ---
>
> Key: SPARK-27862
> URL: https://issues.apache.org/jira/browse/SPARK-27862
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Izek Greenfield
>Priority: Minor
>
> it will be very good to upgrade to newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27862) Upgrade json4s-jackson to 3.6.5

2019-05-28 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849703#comment-16849703
 ] 

Izek Greenfield edited comment on SPARK-27862 at 5/29/19 5:47 AM:
--

create pull request for 2.4 branch: https://github.com/apache/spark/pull/24729
create pull request for master branch: 
https://github.com/apache/spark/pull/24736



was (Author: igreenfi):
create pull request for 2.4 branch: https://github.com/apache/spark/pull/24729

> Upgrade json4s-jackson to 3.6.5
> ---
>
> Key: SPARK-27862
> URL: https://issues.apache.org/jira/browse/SPARK-27862
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Izek Greenfield
>Priority: Minor
>
> it will be very good to upgrade to newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27862) Upgrade json4s-jackson to 3.6.5

2019-05-28 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849703#comment-16849703
 ] 

Izek Greenfield commented on SPARK-27862:
-

create pull request for 2.4 branch: https://github.com/apache/spark/pull/24729

> Upgrade json4s-jackson to 3.6.5
> ---
>
> Key: SPARK-27862
> URL: https://issues.apache.org/jira/browse/SPARK-27862
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Izek Greenfield
>Priority: Minor
>
> it will be very good to upgrade to newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27862) Upgrade json4s-jackson to 3.6.5

2019-05-28 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-27862:
---

 Summary: Upgrade json4s-jackson to 3.6.5
 Key: SPARK-27862
 URL: https://issues.apache.org/jira/browse/SPARK-27862
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.3, 3.0.0
Reporter: Izek Greenfield


it will be very good to upgrade to newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27745) build/mvn take wrong scala version when compile for scala 2.12

2019-05-16 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-27745:
---

 Summary: build/mvn take wrong scala version when compile for scala 
2.12
 Key: SPARK-27745
 URL: https://issues.apache.org/jira/browse/SPARK-27745
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.3
Reporter: Izek Greenfield


in `build/mvn`

line: 
local scala_binary_version=`grep "scala.binary.version" "${_DIR}/../pom.xml" | 
head -n1 | awk -F '[<>]' '{print $3}'`

it grep the pom and there will be 2.11 and if I set -Pscala-2.12 it will take 
2.11 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2019-04-17 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820058#comment-16820058
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~DaveDeCaprio] Does that PR go into 2.4.1 release?

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10746) count ( distinct columnref) over () returns wrong result set

2019-03-12 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790428#comment-16790428
 ] 

Izek Greenfield edited comment on SPARK-10746 at 3/12/19 10:41 AM:
---

you can implement that by using: 
{code:scala}
import org.apache.spark.sql.functions._
size(collect_set(column).over(window))
{code}



was (Author: igreenfi):
you can implement that by using: 
{code:java}
// Some comments here
size(collect_set(column).over(window))
{code}


> count ( distinct columnref) over () returns wrong result set
> 
>
> Key: SPARK-10746
> URL: https://issues.apache.org/jira/browse/SPARK-10746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>Priority: Major
>
> Same issue as report against HIVE (HIVE-9534) 
> Result set was expected to contain 5 rows instead of 1 row as others vendors 
> (ORACLE, Netezza etc) would.
> select count( distinct column) over () from t1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10746) count ( distinct columnref) over () returns wrong result set

2019-03-12 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790428#comment-16790428
 ] 

Izek Greenfield commented on SPARK-10746:
-

you can implement that by using: 
{code:java}
// Some comments here
size(collect_set(column).over(window))
{code}


> count ( distinct columnref) over () returns wrong result set
> 
>
> Key: SPARK-10746
> URL: https://issues.apache.org/jira/browse/SPARK-10746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>Priority: Major
>
> Same issue as report against HIVE (HIVE-9534) 
> Result set was expected to contain 5 rows instead of 1 row as others vendors 
> (ORACLE, Netezza etc) would.
> select count( distinct column) over () from t1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2019-01-17 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745454#comment-16745454
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~staslos] The plan description is created in any case buy this code: 
`queryExecution.toString`

{code:java}
object SQLExecution {
...
withSQLConfPropagated(sparkSession) {
sc.listenerBus.post(SparkListenerSQLExecutionStart(
  executionId, callSite.shortForm, callSite.longForm, 
queryExecution.toString,
  SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), 
System.currentTimeMillis()))
try {
  body
} finally {
  sc.listenerBus.post(SparkListenerSQLExecutionEnd(
executionId, System.currentTimeMillis()))
}
  }
} finally {
  executionIdToQueryExecution.remove(executionId)
  sc.setLocalProperty(EXECUTION_ID_KEY, oldExecutionId)
}
...
}
{code}

so the new PR will solve one issue: Memory usage. but the working down the tree 
to create unneeded string will still happen and waste the time of execution... 

can't this be lazily created if needed?

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578275#comment-16578275
 ] 

Izek Greenfield commented on SPARK-25094:
-

looking in the code the problem is here:

def splitExpressionsWithCurrentInputs(
  expressions: Seq[String],
  funcName: String = "apply",
  extraArguments: Seq[(String, String)] = Nil,
  returnType: String = "void",
  makeSplitFunction: String => String = identity,
  foldFunctions: Seq[String] => String = _.mkString("", ";\n", ";")): 
String = {
// TODO: support whole stage codegen
if (INPUT_ROW == null || currentVars != null) {
  expressions.mkString("\n")
} else {
  splitExpressions(
expressions,
funcName,
("InternalRow", INPUT_ROW) +: extraArguments,
returnType,
makeSplitFunction,
foldFunctions)
}
  }

the TODO section!!

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578127#comment-16578127
 ] 

Izek Greenfield edited comment on SPARK-25094 at 8/13/18 1:03 PM:
--

the code that creates this plan is very complex. 
I will try to reproduce it in simple code in the meanwhile I can attach the 
generated code so you can see the problem is that the code does not create 
functions and inline all the Plan into the processNext method. 
[^generated_code.txt]  

it contains 2 DataFrames on with 80 columns 10 of them built from `case when` 
expressions:
 like that: 
CASE WHEN (`predefined_hc` IS NOT NULL) THEN '/Predefined_hc/' WHEN 
(`zero_volatility_adj_ind` = 'Y') THEN '/Zero_Haircuts_cases/' WHEN 
(`collateral_allocation_method` = 'FCSM') THEN '/FCSM_Collaterals/' WHEN 
`underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND 
((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 4 AND (`residual_maturity_instrument` <= 12.0D)) 
THEN '/Debt/Central_Government_Issuer/Eligible/res_mat_1Y/' WHEN 
`underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND 
((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 4 AND (`residual_maturity_instrument` <= 60.0D)) 
THEN '/Debt/Central_Government_Issuer/Eligible/res_mat_5Y/' WHEN 
(((`underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND 
((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 4 THEN 
'/Debt/Central_Government_Issuer/Eligible/res_mat_G5/' WHEN ((`underlying_type` 
= 'DEBT') AND (`issuer_type` = 'CGVT')) THEN 
'/Debt/Central_Government_Issuer/Non_Eligible/' WHEN `underlying_type` = 
'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) 
AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 12.0D)) 
THEN '/Debt/Other_Issuers/Eligible/res_mat_1Y/' WHEN `underlying_type` = 
'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) 
AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 60.0D)) 
THEN '/Debt/Other_Issuers/Eligible/res_mat_5Y/' WHEN (((`underlying_type` = 
'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) 
AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 3 THEN '/Debt/Other_Issuers/Eligible/res_mat_G5/' 
WHEN ((`underlying_type` = 'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 
'PSE', 'RGLA', 'IO_LISTED'))) THEN '/Debt/Other_Issuers/Non_Eligible/' WHEN 
(((`underlying_type` = 'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR 
((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 3 AND 
(`residual_maturity_instrument` <= 12.0D)) THEN 
'/Securitisation/Eligible/res_mat_1Y/' WHEN (((`underlying_type` = 
'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) 
AND (`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 
60.0D)) THEN '/Securitisation/Eligible/res_mat_5Y/' WHEN ((`underlying_type` = 
'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) 
AND (`instrument_cqs_lt` <= 3 THEN '/Securitisation/Eligible/res_mat_G5/' 
WHEN (`underlying_type` = 'SECURITISATION') THEN 
'/Securitisation/Non_Eligible/' WHEN ((`underlying_type` IN ('EQUITY', 
'MAIN_INDEX_EQUITY', 'COMMODITY', 'NON_ELIGIBLE_SECURITY')) AND 
(`underlying_type` = 'MAIN_INDEX_EQUITY')) THEN '/Other_Securities/Main_index/' 
WHEN (`underlying_type` IN ('EQUITY', 'MAIN_INDEX_EQUITY', 'COMMODITY', 
'NON_ELIGIBLE_SECURITY')) THEN '/Other_Securities/Others/' WHEN 
(`underlying_type` = 'CASH') THEN '/Cash/' WHEN (`underlying_type` = 'GOLD') 
THEN '/Gold/' WHEN (`underlying_type` = 'CIU') THEN '/CIU/' WHEN true THEN 
'/Others/' END AS 
`108_0___Portfolio_CRD4_Art_224_Volatility_Adjustments_Codespath_CRD4_Art_224_Volatil`


was (Author: igreenfi):
the code that creates this plan is very complex. 
I will try to reproduce it in simple code in the meanwhile I can attach the 
generated code so you can see the problem is that the code does not create 
functions and inline all the Plan into the processNext method. 
[^generated_code.txt]  

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCo

[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578127#comment-16578127
 ] 

Izek Greenfield commented on SPARK-25094:
-

the code that creates this plan is very complex. 
I will try to reproduce it in simple code in the meanwhile I can attach the 
generated code so you can see the problem is that the code does not create 
functions and inline all the Plan into the processNext method. 
[^generated_code.txt]  

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-25094:

Attachment: generated_code.txt

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578037#comment-16578037
 ] 

Izek Greenfield commented on SPARK-25094:
-

[~hyukjin.kwon] Does the full Plan is OK?

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-12 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-25094:
---

 Summary: proccesNext() failed to compile size is over 64kb
 Key: SPARK-25094
 URL: https://issues.apache.org/jira/browse/SPARK-25094
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Izek Greenfield


I have this tree:

2018-08-12T07:14:31,289 WARN  [] 
org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
disabled for plan (id=1):
 *(1) Project [, ... 10 more fields]
+- *(1) Filter NOT exposure_calc_method#10141 IN 
(UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
   +- InMemoryTableScan [, ... 11 more fields], [NOT exposure_calc_method#10141 
IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
 +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
deserialized, 1 replicas)
   +- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
  :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
  :  +- Exchange(coordinator id: 1456511137) 
UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
67108864]
  : +- *(1) Project [, ... 6 more fields]
  :+- *(1) Filter (isnotnull(v#49) && 
isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
&& (v#49 = DATA_REG)) && isnotnull(unique_id#39))
  :   +- InMemoryTableScan [, ... 6 more fields], [, 
... 6 more fields]
  : +- InMemoryRelation [, ... 6 more fields], 
StorageLevel(memory, deserialized, 1 replicas)
  :   +- *(1) FileScan csv [,... 6 more 
fields] , ... 6 more fields
  +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
 +- Exchange(coordinator id: 1456511137) 
UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
67108864]
+- *(3) Project [, ... 74 more fields]
   +- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
<=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
  +- InMemoryTableScan [, ... 74 more fields], [, 
... 4 more fields]
+- InMemoryRelation [, ... 74 more fields], 
StorageLevel(memory, deserialized, 1 replicas)
  +- *(1) FileScan csv [,... 74 more 
fields] , ... 6 more fields

Compiling "GeneratedClass": Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
 grows beyond 64 KB

and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25093) CodeFormatter could avoid creating regex object again and again

2018-08-11 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-25093:
---

 Summary: CodeFormatter could avoid creating regex object again and 
again
 Key: SPARK-25093
 URL: https://issues.apache.org/jira/browse/SPARK-25093
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Izek Greenfield


in class `CodeFormatter` 

method: `stripExtraNewLinesAndComments`

could be refactored to: 


{code:scala}
// Some comments here
 val commentReg =
("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/
  """([ |\t]*?\/\/[\s\S]*?\n)""").r  // strip //comment
  val emptyRowsReg = """\n\s*\n""".r
def stripExtraNewLinesAndComments(input: String): String = {
val codeWithoutComment = commentReg.replaceAllIn(input, "")
emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines
  }
{code}

so the Regex would be compiled only once.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-08 Thread Izek Greenfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-25028:

Description: 
on line 143: val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)

If the value is NULL the code will fail with NPE

*sample:*

{code:scala}
val df = List((1, null , "first"), (2, null , "second")).toDF("index", "name", 
"value").withColumn("name", $"name".cast("string"))
df.write.partitionBy("name").saveAsTable("df13")
spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS")
{code}

output:

2018-08-08 09:25:43 WARN  BaseSessionStateBuilder$$anon$1:66 - Max iterations 
(2) reached for batch Resolution
java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.Range.foreach(Range.scala:160)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
  at org.apache.spark.sql.Dataset.(Dataset.scala:190)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
  ... 49 elided

  was:
on line 143: val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)

If the value is NULL the code will fail with NPE

sample:

val df = List((1, null , "first"), (2, null , "second")).toDF("index", "name", 
"value").withColumn("name", $"name".cast("string"))
df.write.partitionBy("name").saveAsTable("df13")
spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS")


output:

2018-08-08 09:25:43 WARN  BaseSessionStateBuilder$$anon$1:66 - Max iterations 
(2) reached for batch Resolution
java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.Range.foreach(Range.scala:160)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142)
  at 
scala.collecti

[jira] [Reopened] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-08 Thread Izek Greenfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield reopened SPARK-25028:
-

Add sample code

> AnalyzePartitionCommand failed with NPE if value is null
> 
>
> Key: SPARK-25028
> URL: https://issues.apache.org/jira/browse/SPARK-25028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> on line 143: val partitionColumnValues = 
> partitionColumns.indices.map(r.get(_).toString)
> If the value is NULL the code will fail with NPE
> sample:
> val df = List((1, null , "first"), (2, null , "second")).toDF("index", 
> "name", "value").withColumn("name", $"name".cast("string"))
> df.write.partitionBy("name").saveAsTable("df13")
> spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS")
> output:
> 2018-08-08 09:25:43 WARN  BaseSessionStateBuilder$$anon$1:66 - Max iterations 
> (2) reached for batch Resolution
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.Range.foreach(Range.scala:160)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
>   ... 49 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-08 Thread Izek Greenfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-25028:

Description: 
on line 143: val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)

If the value is NULL the code will fail with NPE

sample:

val df = List((1, null , "first"), (2, null , "second")).toDF("index", "name", 
"value").withColumn("name", $"name".cast("string"))
df.write.partitionBy("name").saveAsTable("df13")
spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS")


output:

2018-08-08 09:25:43 WARN  BaseSessionStateBuilder$$anon$1:66 - Max iterations 
(2) reached for batch Resolution
java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.Range.foreach(Range.scala:160)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142)
  at 
org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
  at org.apache.spark.sql.Dataset.(Dataset.scala:190)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
  ... 49 elided

  was:
on line 143: val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)

if the value is NULL the code will fail with NPE


> AnalyzePartitionCommand failed with NPE if value is null
> 
>
> Key: SPARK-25028
> URL: https://issues.apache.org/jira/browse/SPARK-25028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> on line 143: val partitionColumnValues = 
> partitionColumns.indices.map(r.get(_).toString)
> If the value is NULL the code will fail with NPE
> sample:
> val df = List((1, null , "first"), (2, null , "second")).toDF("index", 
> "name", "value").withColumn("name", $"name".cast("string"))
> df.write.partitionBy("name").saveAsTable("df13")
> spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS")
> output:
> 2018-08-08 09:25:43 WARN  BaseSessionStateBuilder$$anon$1:66 - Max iterations 
> (2) reached for batch Resolution
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfu

[jira] [Created] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-05 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-25028:
---

 Summary: AnalyzePartitionCommand failed with NPE if value is null
 Key: SPARK-25028
 URL: https://issues.apache.org/jira/browse/SPARK-25028
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Izek Greenfield


on line 143: val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)

if the value is NULL the code will fail with NPE



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13346) Using DataFrames iteratively leads to slow query planning

2018-08-01 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564885#comment-16564885
 ] 

Izek Greenfield commented on SPARK-13346:
-

What the status of that? we face this issue too!

> Using DataFrames iteratively leads to slow query planning
> -
>
> Key: SPARK-13346
> URL: https://issues.apache.org/jira/browse/SPARK-13346
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> I have an iterative algorithm based on DataFrames, and the query plan grows 
> very quickly with each iteration.  Caching the current DataFrame at the end 
> of an iteration does not fix the problem.  However, converting the DataFrame 
> to an RDD and back at the end of each iteration does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to 
> several hundred lines, to several thousand lines, ...) with successive 
> iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the 
> query plan does not need to be computed since it is already cached.  The 
> computation on each iteration should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should 
> be simple to see if you create an iterative algorithm which produces a new 
> DataFrame from an old one on each iteration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24831) Spark Executor class create new serializer each time

2018-07-17 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-24831:
---

 Summary: Spark Executor class create new serializer each time
 Key: SPARK-24831
 URL: https://issues.apache.org/jira/browse/SPARK-24831
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Izek Greenfield


When I look in the code I find this line:

  (org.apache.spark.executor.Executor: 319)

 val resultSer = env.serializer.newInstance()
 why not hold this in threadLocal and reuse?

[questioin in 
stackOverflow|https://stackoverflow.com/questions/51298877/spark-executor-class-create-new-serializer-each-time]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-06-27 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524720#comment-16524720
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~RBerenguel] Have you find something? BTW your workaround help a lot! thanks 
for that.

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23904) Big execution plan cause OOM

2018-06-02 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499315#comment-16499315
 ] 

Izek Greenfield edited comment on SPARK-23904 at 6/3/18 6:11 AM:
-

[~RBerenguel] Did you remember to run with this flag `-XX:+UseG1GC` I think 
think this is part of the problem because of the `Humongous objects`


was (Author: igreenfi):
[~RBerenguel] Did you remember to run with this flag `-XX:+UseG1GC` I think 
think this is part of the problem because of the `homogenous regions`

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-06-02 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499315#comment-16499315
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~RBerenguel] Did you remember to run with this flag `-XX:+UseG1GC` I think 
think this is part of the problem because of the `homogenous regions`

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23904) Big execution plan cause OOM

2018-05-31 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496272#comment-16496272
 ] 

Izek Greenfield edited comment on SPARK-23904 at 5/31/18 8:43 AM:
--

[~RBerenguel] 
Class: SQLExecution
Method: withNewExecutionId
Line: 73

{code:scala}
sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart(
executionId, callSite.shortForm, callSite.longForm, 
queryExecution.toString,
SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), 
System.currentTimeMillis()))
{code}
 


was (Author: igreenfi):
[~RBerenguel] 
Class: SQLExecution
Method: withNewExecutionId
Line: 73

{code:scala}
sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart(
executionId, callSite.shortForm, callSite.longForm, 
"queryExecution.toString",
SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), 
System.currentTimeMillis()))
{code}
 

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-31 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496272#comment-16496272
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~RBerenguel] 
Class: SQLExecution
Method: withNewExecutionId
Line: 73

{code:scala}
sparkSession.sparkContext.listenerBus.post(SparkListenerSQLExecutionStart(
executionId, callSite.shortForm, callSite.longForm, 
"queryExecution.toString",
SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan), 
System.currentTimeMillis()))
{code}
 

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-29 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494701#comment-16494701
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~RBerenguel] `setting completeString to no-op` what you mean at this? After I 
comment out in the code the generating of the string I don't get the OOM.

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-29 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493323#comment-16493323
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~RBerenguel] Did you manage to reproduce? In my side it had become a major 
issue, I compile the version 2.3 and comment out the generation of the string 
and get a very big performance boost.

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-05-13 Thread Izek Greenfield (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473827#comment-16473827
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~RBerenguel] Any update on that? 

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-04-14 Thread Izek Greenfield (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438591#comment-16438591
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~viirya] this does not help because spark creates these string when it 
notifies the bus...

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23904) Big execution plan cause OOM

2018-04-12 Thread Izek Greenfield (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-23904:

Description: 
I create a question in 
[StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
 

Spark create the text representation of query in any case even if I don't need 
it.

That causes many garbage object and unneeded GC... 

 [Gist with code to 
reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]

 

  was:
I create a question in 
[StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
 

Spark create the text representation of query in any case even if I don't need 
it.

That causes many garbage object and unneeded GC... 


> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23904) Big execution plan cause OOM

2018-04-08 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-23904:
---

 Summary: Big execution plan cause OOM
 Key: SPARK-23904
 URL: https://issues.apache.org/jira/browse/SPARK-23904
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Izek Greenfield


I create a question in 
[StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
 

Spark create the text representation of query in any case even if I don't need 
it.

That causes many garbage object and unneeded GC... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org