date:20200908

[jira] [Commented] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192667#comment-17192667
 ] 

Apache Spark commented on SPARK-32755:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29689

> Maintain the order of expressions in AttributeSet and ExpressionSet 
> 
>
> Key: SPARK-32755
> URL: https://issues.apache.org/jira/browse/SPARK-32755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.1.0
>
>
> Expressions identity is based on the ExprId which is an auto-incremented 
> number. This means that the same query can yield a query plan with different 
> expression ids in different runs. AttributeSet and ExpressionSet internally 
> use a HashSet as the underlying data structure, and therefore cannot 
> guarantee the a fixed order of operations in different runs. This can be 
> problematic in cases we like to check for plan changes in different runs.
> We change do the following changes to AttributeSet and ExpressionSet to 
> maintain the insertion order of the elements:
>  * We change the underlying data structure of AttributeSet from HashSet to 
> LinkedHashSet to maintain the insertion order.
>  * ExpressionSet already uses a list to keep track of the expressions, 
> however, since it is extending Scala's immutable.Set class, operations such 
> as map and flatMap are delegated to the immutable.Set itself. This means that 
> the result of these operations is not an instance of ExpressionSet anymore, 
> rather it's a implementation picked up by the parent class. We also remove 
> this inheritance from immutable.Set and implement the needed methods 
> directly. ExpressionSet has a very specific semantics and it does not make 
> sense to extend immutable.Set anyway.
>  * We change the PlanStabilitySuite to not sort the attributes, to be able to 
> catch changes in the order of expressions in different runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192665#comment-17192665
 ] 

Apache Spark commented on SPARK-32755:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29689

> Maintain the order of expressions in AttributeSet and ExpressionSet 
> 
>
> Key: SPARK-32755
> URL: https://issues.apache.org/jira/browse/SPARK-32755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.1.0
>
>
> Expressions identity is based on the ExprId which is an auto-incremented 
> number. This means that the same query can yield a query plan with different 
> expression ids in different runs. AttributeSet and ExpressionSet internally 
> use a HashSet as the underlying data structure, and therefore cannot 
> guarantee the a fixed order of operations in different runs. This can be 
> problematic in cases we like to check for plan changes in different runs.
> We change do the following changes to AttributeSet and ExpressionSet to 
> maintain the insertion order of the elements:
>  * We change the underlying data structure of AttributeSet from HashSet to 
> LinkedHashSet to maintain the insertion order.
>  * ExpressionSet already uses a list to keep track of the expressions, 
> however, since it is extending Scala's immutable.Set class, operations such 
> as map and flatMap are delegated to the immutable.Set itself. This means that 
> the result of these operations is not an instance of ExpressionSet anymore, 
> rather it's a implementation picked up by the parent class. We also remove 
> this inheritance from immutable.Set and implement the needed methods 
> directly. ExpressionSet has a very specific semantics and it does not make 
> sense to extend immutable.Set anyway.
>  * We change the PlanStabilitySuite to not sort the attributes, to be able to 
> catch changes in the order of expressions in different runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-08 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192628#comment-17192628
 ] 

Hyukjin Kwon commented on SPARK-32810:
--

Thanks [~dongjoon] for fixing it here and in other JIRAs.

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32810:
--
Affects Version/s: 2.4.6
   3.0.0
   3.0.1

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32827) Add spark.sql.maxMetadataStringLength config

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192621#comment-17192621
 ] 

Apache Spark commented on SPARK-32827:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29688

> Add spark.sql.maxMetadataStringLength config
> 
>
> Key: SPARK-32827
> URL: https://issues.apache.org/jira/browse/SPARK-32827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Add a new config `spark.sql.maxMetadataStringLength`. This config aims to 
> limit metadata value length, e.g. file location.
> Found that metadata has been abbreviated by `...` when tried to add some test 
> in `SQLQueryTestSuite`. That caused we can't replace the location value by 
> `className` since the `className` has been abbreviated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32827) Add spark.sql.maxMetadataStringLength config

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192620#comment-17192620
 ] 

Apache Spark commented on SPARK-32827:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29688

> Add spark.sql.maxMetadataStringLength config
> 
>
> Key: SPARK-32827
> URL: https://issues.apache.org/jira/browse/SPARK-32827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Add a new config `spark.sql.maxMetadataStringLength`. This config aims to 
> limit metadata value length, e.g. file location.
> Found that metadata has been abbreviated by `...` when tried to add some test 
> in `SQLQueryTestSuite`. That caused we can't replace the location value by 
> `className` since the `className` has been abbreviated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32827) Add spark.sql.maxMetadataStringLength config

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32827:


Assignee: Apache Spark

> Add spark.sql.maxMetadataStringLength config
> 
>
> Key: SPARK-32827
> URL: https://issues.apache.org/jira/browse/SPARK-32827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> Add a new config `spark.sql.maxMetadataStringLength`. This config aims to 
> limit metadata value length, e.g. file location.
> Found that metadata has been abbreviated by `...` when tried to add some test 
> in `SQLQueryTestSuite`. That caused we can't replace the location value by 
> `className` since the `className` has been abbreviated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32827) Add spark.sql.maxMetadataStringLength config

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32827:


Assignee: (was: Apache Spark)

> Add spark.sql.maxMetadataStringLength config
> 
>
> Key: SPARK-32827
> URL: https://issues.apache.org/jira/browse/SPARK-32827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Add a new config `spark.sql.maxMetadataStringLength`. This config aims to 
> limit metadata value length, e.g. file location.
> Found that metadata has been abbreviated by `...` when tried to add some test 
> in `SQLQueryTestSuite`. That caused we can't replace the location value by 
> `className` since the `className` has been abbreviated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32827) Add spark.sql.maxMetadataStringLength config

2020-09-08 Thread ulysses you (Jira)

ulysses you created SPARK-32827:
---

 Summary: Add spark.sql.maxMetadataStringLength config
 Key: SPARK-32827
 URL: https://issues.apache.org/jira/browse/SPARK-32827
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you


Add a new config `spark.sql.maxMetadataStringLength`. This config aims to limit 
metadata value length, e.g. file location.

Found that metadata has been abbreviated by `...` when tried to add some test 
in `SQLQueryTestSuite`. That caused we can't replace the location value by 
`className` since the `className` has been abbreviated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-09-08 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192607#comment-17192607
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

Hey [~fhoering] are you back now :-)?

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192603#comment-17192603
 ] 

Apache Spark commented on SPARK-32826:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29687

> Add test case for get null columns using SparkGetColumnsOperation
> -
>
> Key: SPARK-32826
> URL: https://issues.apache.org/jira/browse/SPARK-32826
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns 
> but now we can because the side effect of 
> https://issues.apache.org/jira/browse/SPARK-32696



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32826:


Assignee: (was: Apache Spark)

> Add test case for get null columns using SparkGetColumnsOperation
> -
>
> Key: SPARK-32826
> URL: https://issues.apache.org/jira/browse/SPARK-32826
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns 
> but now we can because the side effect of 
> https://issues.apache.org/jira/browse/SPARK-32696



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192602#comment-17192602
 ] 

Apache Spark commented on SPARK-32826:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29687

> Add test case for get null columns using SparkGetColumnsOperation
> -
>
> Key: SPARK-32826
> URL: https://issues.apache.org/jira/browse/SPARK-32826
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns 
> but now we can because the side effect of 
> https://issues.apache.org/jira/browse/SPARK-32696



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32826:


Assignee: Apache Spark

> Add test case for get null columns using SparkGetColumnsOperation
> -
>
> Key: SPARK-32826
> URL: https://issues.apache.org/jira/browse/SPARK-32826
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns 
> but now we can because the side effect of 
> https://issues.apache.org/jira/browse/SPARK-32696



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32826) Add test case for get null columns using SparkGetColumnsOperation

2020-09-08 Thread Kent Yao (Jira)

Kent Yao created SPARK-32826:


 Summary: Add test case for get null columns using 
SparkGetColumnsOperation
 Key: SPARK-32826
 URL: https://issues.apache.org/jira/browse/SPARK-32826
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


In Spark 3.0.0, the SparkGetColumnsOperation can not recognize NULL columns but 
now we can because the side effect of 
https://issues.apache.org/jira/browse/SPARK-32696



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32187) User Guide - Shipping Python Package

2020-09-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32187:
-
Description: 
- Zipped file
- Python files
- Virtualenv with Yarn
- PEX \(?\) (see also SPARK-25433)

  was:
- Zipped file
- Python files
- PEX \(?\) (see also SPARK-25433)


> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - Virtualenv with Yarn
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32813.
--
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29667
[https://github.com/apache/spark/pull/29667]

> Reading parquet rdd in non columnar mode fails in multithreaded environment
> ---
>
> Key: SPARK-32813
> URL: https://issues.apache.org/jira/browse/SPARK-32813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
> Environment: Spark 3.0.0, Scala 2.12.12
>Reporter: Vladimir Klyushnikov
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
> session was  created in one thread and rdd is being read in another  - so 
> InheritableThreadLocal  with active session is not propagated. Code below was 
> working perfectly in Spark 2.X, but fails in Spark 3  
> {code:scala}
> import java.util.concurrent.Executors
> import org.apache.spark.sql.SparkSession
> import scala.concurrent.{Await, ExecutionContext, Future}
> import scala.concurrent.duration._
> object Main {
>   final case class Data(list: List[Int])
>   def main(args: Array[String]): Unit = {
> val executor1 = Executors.newSingleThreadExecutor()
> val executor2 = Executors.newSingleThreadExecutor()
> try {
>   val ds = Await.result(Future {
> val session = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import session.implicits._
> val path = "test.parquet"
> session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path)
> session.read.parquet(path).as[Data]
>   }(ExecutionContext.fromExecutorService(executor1)), 1.minute)
>   Await.result(Future {
> ds.rdd.collect().foreach(println(_))
>   }(ExecutionContext.fromExecutorService(executor2)), 1.minute)
> } finally {
>   executor1.shutdown()
>   executor2.shutdown()
> }
>   }
> }
> {code}
> This code fails with following exception:
> {code}
> Exception in thread "main" java.util.NoSuchElementException: 
> None.getException in thread "main" java.util.NoSuchElementException: None.get 
> at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment

2020-09-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32813:


Assignee: L. C. Hsieh

> Reading parquet rdd in non columnar mode fails in multithreaded environment
> ---
>
> Key: SPARK-32813
> URL: https://issues.apache.org/jira/browse/SPARK-32813
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
> Environment: Spark 3.0.0, Scala 2.12.12
>Reporter: Vladimir Klyushnikov
>Assignee: L. C. Hsieh
>Priority: Major
>
> Reading parquet rdd in non columnar mode (i.e. with list fields)  if Spark 
> session was  created in one thread and rdd is being read in another  - so 
> InheritableThreadLocal  with active session is not propagated. Code below was 
> working perfectly in Spark 2.X, but fails in Spark 3  
> {code:scala}
> import java.util.concurrent.Executors
> import org.apache.spark.sql.SparkSession
> import scala.concurrent.{Await, ExecutionContext, Future}
> import scala.concurrent.duration._
> object Main {
>   final case class Data(list: List[Int])
>   def main(args: Array[String]): Unit = {
> val executor1 = Executors.newSingleThreadExecutor()
> val executor2 = Executors.newSingleThreadExecutor()
> try {
>   val ds = Await.result(Future {
> val session = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import session.implicits._
> val path = "test.parquet"
> session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path)
> session.read.parquet(path).as[Data]
>   }(ExecutionContext.fromExecutorService(executor1)), 1.minute)
>   Await.result(Future {
> ds.rdd.collect().foreach(println(_))
>   }(ExecutionContext.fromExecutorService(executor2)), 1.minute)
> } finally {
>   executor1.shutdown()
>   executor2.shutdown()
> }
>   }
> }
> {code}
> This code fails with following exception:
> {code}
> Exception in thread "main" java.util.NoSuchElementException: 
> None.getException in thread "main" java.util.NoSuchElementException: None.get 
> at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) 
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at 
> org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2020-09-08 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192574#comment-17192574
 ] 

Jungtaek Lim commented on SPARK-24295:
--

[~sta...@gmail.com]

Thanks for sharing the workaround. I've proposed applying TTL on FileStreamSink 
output, which does the similar with your workaround, but purges for every 
compact batch. Unfortunately it hasn't made enough interest for committers, 
though.

SPARK-27188 ([https://github.com/apache/spark/pull/28363])

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32821) cannot group by with window in sql statement for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192573#comment-17192573
 ] 

Johnny Bai commented on SPARK-32821:


Let's take a discussion about watermark with window grammar in the SQL 
statement, and I would like  to implement it.

> cannot group by with window in sql statement for structured streaming with 
> watermark
> 
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>
> current only support dsl style as below: 
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32788) non-partitioned table scan should not have partition filter

2020-09-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32788:
--
Affects Version/s: (was: 3.0.0)
   3.0.1

> non-partitioned table scan should not have partition filter
> ---
>
> Key: SPARK-32788
> URL: https://issues.apache.org/jira/browse/SPARK-32788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32788) non-partitioned table scan should not have partition filter

2020-09-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32788:
--
Affects Version/s: 3.0.0

> non-partitioned table scan should not have partition filter
> ---
>
> Key: SPARK-32788
> URL: https://issues.apache.org/jira/browse/SPARK-32788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27089) Loss of precision during decimal division

2020-09-08 Thread Daeho Ro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192566#comment-17192566
 ] 

Daeho Ro commented on SPARK-27089:
--

It seems that the bug persists on the spark version 3.0.0

> Loss of precision during decimal division
> -
>
> Key: SPARK-27089
> URL: https://issues.apache.org/jira/browse/SPARK-27089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: ylo0ztlmtusq
>Priority: Major
>
> Spark looses decimal places when dividing decimal numbers.
>  
> Expected behavior (In Spark 2.2.3 or before)
>  
> {code:java}
> scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val"""
> sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val
> scala> spark.sql(sql).show
> 19/03/07 21:23:51 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> ++
> | val|
> ++
> |0.33|
> ++
> {code}
>  
> Current behavior (In Spark 2.3.2 and later)
>  
> {code:java}
> scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val"""
> sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val
> scala> spark.sql(sql).show
> ++
> | val|
> ++
> |0.33|
> ++
> {code}
>  
> Seems to caused by {{promote_precision(38, 6) }}
>  
> {code:java}
> scala> spark.sql(sql).explain(true)
> == Parsed Logical Plan ==
> Project [cast((cast(3 as decimal(38,14)) / cast(9 as decimal(38,14))) as 
> decimal(38,14)) AS val#20]
> +- OneRowRelation
> == Analyzed Logical Plan ==
> val: decimal(38,14)
> Project [cast(CheckOverflow((promote_precision(cast(cast(3 as decimal(38,14)) 
> as decimal(38,14))) / promote_precision(cast(cast(9 as decimal(38,14)) as 
> decimal(38,14, DecimalType(38,6)) as decimal(38,14)) AS val#20]
> +- OneRowRelation
> == Optimized Logical Plan ==
> Project [0.33 AS val#20]
> +- OneRowRelation
> == Physical Plan ==
> *(1) Project [0.33 AS val#20]
> +- Scan OneRowRelation[]
> {code}
>  
> Source https://stackoverflow.com/q/55046492



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql statement for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Summary: cannot group by with window in sql statement for structured 
streaming with watermark  (was: cannot group by with window in sql sentence for 
structured streaming with watermark)

> cannot group by with window in sql statement for structured streaming with 
> watermark
> 
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>
> current only support dsl style as below: 
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192563#comment-17192563
 ] 

Johnny Bai edited comment on SPARK-32821 at 9/9/20, 1:45 AM:
-

[~kabhwan] As structured streaming going, I think it is necessary to build a 
relative complete structured streaming SQL standard specification like the ANSI 
SQL standard


was (Author: johnny bai):
[~kabhwan] as structured streaming going, I think it is necessary to build a 
relative complete structured streaming SQL standard specification like the ANSI 
SQL standard

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>
> current only support dsl style as below: 
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192563#comment-17192563
 ] 

Johnny Bai commented on SPARK-32821:


[~kabhwan] as structured streaming going, I think it is necessary to build a 
relative complete structured streaming SQL standard specification like the ANSI 
SQL standard

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>
> current only support dsl style as below: 
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32823.
--
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29683
[https://github.com/apache/spark/pull/29683]

> Standalone Master UI resources in use wrong
> ---
>
> Key: SPARK-32823
> URL: https://issues.apache.org/jira/browse/SPARK-32823
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> I was using the standalone deployment with workers with GPUs and the master 
> ui was wrong for:
>  * *Resources in use:* 0 / 4 gpu
> In this case I had 2 workers, each with 4 gpus, so this total should have 
> been 8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32823:


Assignee: Thomas Graves

> Standalone Master UI resources in use wrong
> ---
>
> Key: SPARK-32823
> URL: https://issues.apache.org/jira/browse/SPARK-32823
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> I was using the standalone deployment with workers with GPUs and the master 
> ui was wrong for:
>  * *Resources in use:* 0 / 4 gpu
> In this case I had 2 workers, each with 4 gpus, so this total should have 
> been 8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32824) The error is confusing when resource .amount not provided

2020-09-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32824.
--
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29685
[https://github.com/apache/spark/pull/29685]

> The error is confusing when resource .amount not provided 
> --
>
> Key: SPARK-32824
> URL: https://issues.apache.org/jira/browse/SPARK-32824
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> If the user forgets to specify the .amount when specifying a resource, the 
> error that comes out is confusing, we should improve.
>  
> $ $SPARK_HOME/bin/spark-shell  --master spark://host9:7077 --conf 
> spark.executor.resource.gpu=1
>  
> {code:java}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.propertiesSetting default log level to 
> "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error 
> initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1 at java.lang.String.substring(String.java:1967) at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150)
>  at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158)
>  at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>  at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>  at org.apache.spark.SparkContext.(SparkContext.scala:528) at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at 
> $line3.$read$$iw$$iw.(:15) at 
> $line3.$read$$iw.(:42) at 
> $line3.$read.(:44){code}
> '



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32824) The error is confusing when resource .amount not provided

2020-09-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32824:


Assignee: Thomas Graves

> The error is confusing when resource .amount not provided 
> --
>
> Key: SPARK-32824
> URL: https://issues.apache.org/jira/browse/SPARK-32824
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> If the user forgets to specify the .amount when specifying a resource, the 
> error that comes out is confusing, we should improve.
>  
> $ $SPARK_HOME/bin/spark-shell  --master spark://host9:7077 --conf 
> spark.executor.resource.gpu=1
>  
> {code:java}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.propertiesSetting default log level to 
> "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error 
> initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1 at java.lang.String.substring(String.java:1967) at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150)
>  at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158)
>  at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>  at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>  at org.apache.spark.SparkContext.(SparkContext.scala:528) at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at 
> $line3.$read$$iw$$iw.(:15) at 
> $line3.$read$$iw.(:42) at 
> $line3.$read.(:44){code}
> '



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32638) WidenSetOperationTypes in subquery attribute missing

2020-09-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32638:
-
Fix Version/s: 3.0.2

> WidenSetOperationTypes in subquery  attribute  missing
> --
>
> Key: SPARK-32638
> URL: https://issues.apache.org/jira/browse/SPARK-32638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Guojian Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> I am migrating sql from mysql to spark sql, meet a very strange case. Below 
> is code to reproduce the exception:
>  
> {code:java}
> val spark = SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .getOrCreate()
> spark.sparkContext.setLogLevel("TRACE")
> val DecimalType = DataTypes.createDecimalType(20, 2)
> val schema = StructType(List(
>  StructField("a", DecimalType, true)
> ))
> val dataList = new util.ArrayList[Row]()
> val df=spark.createDataFrame(dataList,schema)
> df.printSchema()
> df.createTempView("test")
> val sql=
>  """
>  |SELECT t.kpi_04 FROM
>  |(
>  | SELECT a as `kpi_04` FROM test
>  | UNION ALL
>  | SELECT a+a as `kpi_04` FROM test
>  |) t
>  |
>  """.stripMargin
> spark.sql(sql)
> {code}
>  
> Exception Message:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved 
> attribute(s) kpi_04#2 missing from kpi_04#4 in operator !Project [kpi_04#2]. 
> Attribute(s) with the same name appear in the operation: kpi_04. Please check 
> if the right attribute(s) are used.;;
> !Project [kpi_04#2]
> +- SubqueryAlias t
>  +- Union
>  :- Project [cast(kpi_04#2 as decimal(21,2)) AS kpi_04#4]
>  : +- Project [a#0 AS kpi_04#2]
>  : +- SubqueryAlias test
>  : +- LocalRelation , [a#0]
>  +- Project [kpi_04#3]
>  +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
>  +- SubqueryAlias test
>  +- LocalRelation , [a#0]{code}
>  
>  
> Base the trace log ,seemly the WidenSetOperationTypes add new outer project 
> layer. It caused the parent query lose the reference to subquery. 
>  
>  
> {code:java}
>  
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes ===
> !'Project [kpi_04#2] !Project [kpi_04#2]
> !+- 'SubqueryAlias t +- SubqueryAlias t
> ! +- 'Union +- Union
> ! :- Project [a#0 AS kpi_04#2] :- Project [cast(kpi_04#2 as decimal(21,2)) AS 
> kpi_04#4]
> ! : +- SubqueryAlias test : +- Project [a#0 AS kpi_04#2]
> ! : +- LocalRelation , [a#0] : +- SubqueryAlias test
> ! +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3] : +- LocalRelation , [a#0]
> ! +- SubqueryAlias test +- Project [kpi_04#3]
> ! +- LocalRelation , [a#0] +- Project 
> [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
> ! +- SubqueryAlias test
> ! +- LocalRelation , [a#0]
> {code}
>  
>   in the source code ,WidenSetOperationTypes.scala. it is  a intent behavior, 
> but  possibly  miss this edge case. 
> I hope someone can help me out to fix it . 
>  
>  
> {code:java}
> if (targetTypes.nonEmpty) {
>  // Add an extra Project if the targetTypes are different from the original 
> types.
>  children.map(widenTypes(_, targetTypes))
> } else {
>  // Unable to find a target type to widen, then just return the original set.
>  children
> }{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32812) Run tests script for Python fails in certain environments

2020-09-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32812:
--
Fix Version/s: (was: 2.4.8)
   2.4.7

> Run tests script for Python fails in certain environments
> -
>
> Key: SPARK-32812
> URL: https://issues.apache.org/jira/browse/SPARK-32812
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> When running PySpark test in the local environment with "python/run-tests" 
> command, the following error could occur.
>  {code}
> Traceback (most recent call last):
>  File "", line 1, in 
> ...
> raise RuntimeError('''
> RuntimeError:
>  An attempt has been made to start a new process before the
>  current process has finished its bootstrapping phase.
> This probably means that you are not using fork to start your
>  child processes and you have forgotten to use the proper idiom
>  in the main module:
> if __name__ == '__main__':
>  freeze_support()
>  ...
> The "freeze_support()" line can be omitted if the program
>  is not going to be frozen to produce an executable.
> Traceback (most recent call last):
> ...
>  raise EOFError
> EOFError
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32810:
--
Fix Version/s: (was: 2.4.8)
   2.4.7

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32810:
-

Assignee: Maxim Gekk  (was: Apache Spark)

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32312:


Assignee: (was: Apache Spark)

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32312:


Assignee: Apache Spark

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192422#comment-17192422
 ] 

Apache Spark commented on SPARK-32312:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/29686

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32824) The error is confusing when resource .amount not provided

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192394#comment-17192394
 ] 

Apache Spark commented on SPARK-32824:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/29685

> The error is confusing when resource .amount not provided 
> --
>
> Key: SPARK-32824
> URL: https://issues.apache.org/jira/browse/SPARK-32824
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> If the user forgets to specify the .amount when specifying a resource, the 
> error that comes out is confusing, we should improve.
>  
> $ $SPARK_HOME/bin/spark-shell  --master spark://host9:7077 --conf 
> spark.executor.resource.gpu=1
>  
> {code:java}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.propertiesSetting default log level to 
> "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error 
> initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1 at java.lang.String.substring(String.java:1967) at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150)
>  at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158)
>  at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>  at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>  at org.apache.spark.SparkContext.(SparkContext.scala:528) at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at 
> $line3.$read$$iw$$iw.(:15) at 
> $line3.$read$$iw.(:42) at 
> $line3.$read.(:44){code}
> '



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32824) The error is confusing when resource .amount not provided

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32824:


Assignee: Apache Spark

> The error is confusing when resource .amount not provided 
> --
>
> Key: SPARK-32824
> URL: https://issues.apache.org/jira/browse/SPARK-32824
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> If the user forgets to specify the .amount when specifying a resource, the 
> error that comes out is confusing, we should improve.
>  
> $ $SPARK_HOME/bin/spark-shell  --master spark://host9:7077 --conf 
> spark.executor.resource.gpu=1
>  
> {code:java}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.propertiesSetting default log level to 
> "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error 
> initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1 at java.lang.String.substring(String.java:1967) at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150)
>  at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158)
>  at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>  at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>  at org.apache.spark.SparkContext.(SparkContext.scala:528) at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at 
> $line3.$read$$iw$$iw.(:15) at 
> $line3.$read$$iw.(:42) at 
> $line3.$read.(:44){code}
> '



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32824) The error is confusing when resource .amount not provided

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192393#comment-17192393
 ] 

Apache Spark commented on SPARK-32824:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/29685

> The error is confusing when resource .amount not provided 
> --
>
> Key: SPARK-32824
> URL: https://issues.apache.org/jira/browse/SPARK-32824
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> If the user forgets to specify the .amount when specifying a resource, the 
> error that comes out is confusing, we should improve.
>  
> $ $SPARK_HOME/bin/spark-shell  --master spark://host9:7077 --conf 
> spark.executor.resource.gpu=1
>  
> {code:java}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.propertiesSetting default log level to 
> "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error 
> initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1 at java.lang.String.substring(String.java:1967) at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150)
>  at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158)
>  at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>  at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>  at org.apache.spark.SparkContext.(SparkContext.scala:528) at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at 
> $line3.$read$$iw$$iw.(:15) at 
> $line3.$read$$iw.(:42) at 
> $line3.$read.(:44){code}
> '



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32824) The error is confusing when resource .amount not provided

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32824:


Assignee: (was: Apache Spark)

> The error is confusing when resource .amount not provided 
> --
>
> Key: SPARK-32824
> URL: https://issues.apache.org/jira/browse/SPARK-32824
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> If the user forgets to specify the .amount when specifying a resource, the 
> error that comes out is confusing, we should improve.
>  
> $ $SPARK_HOME/bin/spark-shell  --master spark://host9:7077 --conf 
> spark.executor.resource.gpu=1
>  
> {code:java}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.propertiesSetting default log level to 
> "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error 
> initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1 at java.lang.String.substring(String.java:1967) at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150)
>  at 
> org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158)
>  at 
> org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
>  at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
>  at org.apache.spark.SparkContext.(SparkContext.scala:528) at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at 
> $line3.$read$$iw$$iw.(:15) at 
> $line3.$read$$iw.(:42) at 
> $line3.$read.(:44){code}
> '



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32825) CTE support on MSSQL

2020-09-08 Thread Ankit Sinha (Jira)

Ankit Sinha created SPARK-32825:
---

 Summary: CTE support on MSSQL
 Key: SPARK-32825
 URL: https://issues.apache.org/jira/browse/SPARK-32825
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.6
Reporter: Ankit Sinha


The issue in detail along with stack trace is described over 
[here|https://github.com/microsoft/mssql-jdbc/issues/1340]. Summary is that 
WITH CTE clause does not work with MSSQL. This may be because WITH CTE is not 
supported in FROM clause of MSSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192343#comment-17192343
 ] 

Apache Spark commented on SPARK-32810:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29684

> CSV/JSON data sources should avoid globbing paths when inferring schema
> ---
>
> Key: SPARK-32810
> URL: https://issues.apache.org/jira/browse/SPARK-32810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> The problem is that when the user doesn't specify the schema when reading a 
> CSV table, The CSV file format and data source needs to infer schema, and it 
> does so by creating a base DataSource relation, and there's a mismatch: 
> *FileFormat.inferSchema* expects actual file paths without glob patterns, but 
> *DataSource.paths* expects file paths in glob patterns.
>  An example is demonstrated below:
> {code:java}
> ^
> | DataSource.resolveRelationtries to glob again (incorrectly) on 
> glob pattern """[abc].csv"""
> | DataSource.apply  ^
> |   CSVDataSource.inferSchema   |
> | CSVFileFormat.inferSchema |
> |   ... |
> |   DataSource.resolveRelation  globbed into """[abc].csv""", should 
> be treated as verbatim path, not as glob pattern
> |   DataSource.apply^
> | DataFrameReader.load  |
> |   input """\[abc\].csv"""
> {code}
> The same problem exists in the JSON data source as well. Ditto for MLlib's 
> LibSVM data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32135) Show Spark Driver name on Spark history web page

2020-09-08 Thread Gaurangi Saxena (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurangi Saxena updated SPARK-32135:

Description: 
Our service dynamically creates short-lived YARN clusters in cloud. Spark 
applications run on these dynamically created clusters. Log data for these 
applications is stored on a remote file-system. We configure a static instance 
of SparkHistoryServer to view information on jobs that ran on these clusters.

Since we are using a single History Server for multiple clusters, it will be 
useful for users to have information on where the jobs were executed. We would 
like to display this information on the main web-page, instead of having users 
go through multiple links to retrieve this information.

  was:We would like to see spark driver host on the history server web page


> Show Spark Driver name on Spark history web page
> 
>
> Key: SPARK-32135
> URL: https://issues.apache.org/jira/browse/SPARK-32135
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
> Attachments: image-2020-09-02-12-37-55-860.png
>
>
> Our service dynamically creates short-lived YARN clusters in cloud. Spark 
> applications run on these dynamically created clusters. Log data for these 
> applications is stored on a remote file-system. We configure a static 
> instance of SparkHistoryServer to view information on jobs that ran on these 
> clusters.
> Since we are using a single History Server for multiple clusters, it will be 
> useful for users to have information on where the jobs were executed. We 
> would like to display this information on the main web-page, instead of 
> having users go through multiple links to retrieve this information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32097) Allow reading history log files from multiple directories

2020-09-08 Thread Gaurangi Saxena (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurangi Saxena updated SPARK-32097:

Description: Our service dynamically creates short-lived YARN clusters in 
cloud. Spark applications run on these dynamically created clusters. Log data 
for these applications is stored on a remote file-system. We want a static 
instance of SparkHistoryServer to view information on jobs that ran on these 
clusters. We use glob because we cannot have a static list of directories where 
the log files reside.   (was: We would like to configure SparkHistoryServer to 
display applications from multiple clusters/environments. Data displayed on 
this UI comes from directory configured as log-directory. It would be nice if 
this log-directory also accepted regex. This way we will be able to read and 
display applications from multiple directories.

 )

> Allow reading history log files from multiple directories
> -
>
> Key: SPARK-32097
> URL: https://issues.apache.org/jira/browse/SPARK-32097
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Gaurangi Saxena
>Priority: Minor
>
> Our service dynamically creates short-lived YARN clusters in cloud. Spark 
> applications run on these dynamically created clusters. Log data for these 
> applications is stored on a remote file-system. We want a static instance of 
> SparkHistoryServer to view information on jobs that ran on these clusters. We 
> use glob because we cannot have a static list of directories where the log 
> files reside. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32823:


Assignee: Apache Spark

> Standalone Master UI resources in use wrong
> ---
>
> Key: SPARK-32823
> URL: https://issues.apache.org/jira/browse/SPARK-32823
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> I was using the standalone deployment with workers with GPUs and the master 
> ui was wrong for:
>  * *Resources in use:* 0 / 4 gpu
> In this case I had 2 workers, each with 4 gpus, so this total should have 
> been 8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192295#comment-17192295
 ] 

Apache Spark commented on SPARK-32823:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/29683

> Standalone Master UI resources in use wrong
> ---
>
> Key: SPARK-32823
> URL: https://issues.apache.org/jira/browse/SPARK-32823
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was using the standalone deployment with workers with GPUs and the master 
> ui was wrong for:
>  * *Resources in use:* 0 / 4 gpu
> In this case I had 2 workers, each with 4 gpus, so this total should have 
> been 8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32823:


Assignee: (was: Apache Spark)

> Standalone Master UI resources in use wrong
> ---
>
> Key: SPARK-32823
> URL: https://issues.apache.org/jira/browse/SPARK-32823
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was using the standalone deployment with workers with GPUs and the master 
> ui was wrong for:
>  * *Resources in use:* 0 / 4 gpu
> In this case I had 2 workers, each with 4 gpus, so this total should have 
> been 8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2020-09-08 Thread Avner Livne (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280
 ] 

Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:45 PM:
--

for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${file.getName.toString}"
removePath(compacted_file, fs)
val lines = sc.textFile(file.toString)
val reduced_lines = lines.mapPartitions({
p => 
implicit val formats = DefaultFormats
p.collect({
case "v1" => "v1"
case x if { 
parse(x).extract[SinkFileStatus].modificationTime > ttl 
} => x
})
}).coalesce(1)
println(s"removing ${lines.count - reduced_lines.count} lines from 
${file.toString}...")
reduced_lines.saveAsTextFile(compacted_file)
FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, 
false, sc.hadoopConfiguration)
removePath(compacted_file, fs)
}

/**
  * get last compacted files if exists
  */
def getLastCompactFile(path: Path) = {
fs.listFiles(path, 
true).toList.sortBy(_.getModificationTime).reverse.collectFirst({
case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) 
=> 
x.getPath
})
}

val my_folder = "/my/root/spark/structerd/streaming/output/folder"
val metadata_folder = new Path(s"$my_folder/_spark_metadata"))
getLastCompactFile(metadata_folder).map(x => compact(x, 20))

val df = spark
  .readStream
  .format("kafka") ///. whatever stream you like

df.writeStream
  .trigger(Trigger.ProcessingTime(30))
  .format("parquet")
  .outputMode(OutputMode.Append())
  .option("checkpointLocation", "/my/checkpoint/path")
  .option("path", my_folder)
  .start()
{code}


this example will retain SinkFileStatus from the last 20 days and will purge 
everything else
I run this code on driver startup - but it can certainly run async in some 
sidecar cronjob 

tested on spark 3.0.0 writing parquet files 



was (Author: sta...@gmail.com):
for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${fil

[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2020-09-08 Thread Avner Livne (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280
 ] 

Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:45 PM:
--

for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${file.getName.toString}"
removePath(compacted_file, fs)
val lines = sc.textFile(file.toString)
val reduced_lines = lines.mapPartitions({
p => 
implicit val formats = DefaultFormats
p.collect({
case "v1" => "v1"
case x if { 
parse(x).extract[SinkFileStatus].modificationTime > ttl 
} => x
})
}).coalesce(1)
println(s"removing ${lines.count - reduced_lines.count} lines from 
${file.toString}...")
reduced_lines.saveAsTextFile(compacted_file)
FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, 
false, sc.hadoopConfiguration)
removePath(compacted_file, fs)
}

/**
  * get last compacted files if exists
  */
def getLastCompactFile(path: Path) = {
fs.listFiles(path, 
true).toList.sortBy(_.getModificationTime).reverse.collectFirst({
case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) 
=> 
x.getPath
})
}

val my_folder = "/my/root/spark/structerd/streaming/output/folder"
val metadata_folder = new Path(s"$my_folder/_spark_metadata"))
getLastCompactFile(metadata_folder).map(x => compact(x, 20))

val df = spark
  .readStream
  .format("kafka") ///. whatever stream you like

df.writeStream
  .trigger(Trigger.ProcessingTime(30))
  .format("parquet")
  .outputMode(OutputMode.Append())
  .option("checkpointLocation", "/my/checkpoint/path")
  .option("path", my_folder )
  .start()
{code}


this example will retain SinkFileStatus from the last 20 days and will purge 
everything else
I run this code on driver startup - but it can certainly run async in some 
sidecar cronjob 

tested on spark 3.0.0 writing parquet files 



was (Author: sta...@gmail.com):
for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${fi

[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2020-09-08 Thread Avner Livne (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280
 ] 

Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:44 PM:
--

for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${file.getName.toString}"
removePath(compacted_file, fs)
val lines = sc.textFile(file.toString)
val reduced_lines = lines.mapPartitions({
p => 
implicit val formats = DefaultFormats
p.collect({
case "v1" => "v1"
case x if { 
parse(x).extract[SinkFileStatus].modificationTime > ttl 
} => x
})
}).coalesce(1)
println(s"removing ${lines.count - reduced_lines.count} lines from 
${file.toString}...")
reduced_lines.saveAsTextFile(compacted_file)
FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, 
false, sc.hadoopConfiguration)
removePath(compacted_file, fs)
}

/**
  * get last compacted files if exists
  */
def getLastCompactFile(path: Path) = {
fs.listFiles(path, 
true).toList.sortBy(_.getModificationTime).reverse.collectFirst({
case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) 
=> 
x.getPath
})
}

val my_folder = "/my/root/spark/structerd/streaming/output/folder"
val metadata_folder = new Path(s"$my_folder/_spark_metadata"))
getLastCompactFile(metadata_folder).map(x => compact(x, 20))

val df = spark
  .readStream
  .format("kafka") ///. whatever stream you like

df3
  .writeStream
  .trigger(Trigger.ProcessingTime(30))
  .format("parquet")
  .outputMode(OutputMode.Append())
  .option("checkpointLocation", "/my/checkpoint/path")
  .option("path", my_folder )
  .start()
{code}


this example will retain SinkFileStatus from the last 20 days and will purge 
everything else
I run this code on driver startup - but it can certainly run async in some 
sidecar cronjob 

tested on spark 3.0.0 writing parquet files 



was (Author: sta...@gmail.com):
for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/

[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2020-09-08 Thread Avner Livne (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280
 ] 

Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:42 PM:
--

for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${file.getName.toString}"
removePath(compacted_file, fs)
val lines = sc.textFile(file.toString)
val reduced_lines = lines.mapPartitions({
p => 
implicit val formats = DefaultFormats
p.collect({
case "v1" => "v1"
case x if { 
parse(x).extract[SinkFileStatus].modificationTime > ttl 
} => x
})
}).coalesce(1)
println(s"removing ${lines.count - reduced_lines.count} lines from 
${file.toString}...")
reduced_lines.saveAsTextFile(compacted_file)
FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, 
false, sc.hadoopConfiguration)
removePath(compacted_file, fs)
}

/**
  * get last compacted files if exists
  */
def getLastCompactFile(path: Path) = {
fs.listFiles(path, 
true).toList.sortBy(_.getModificationTime).reverse.collectFirst({
case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) 
=> 
x.getPath
})
}

val my_folder = "/my/root/spark/structerd/streaming/output/folder"
val metadata_folder = new Path(s"$my_folder/_spark_metadata"))
getLastCompactFile(metadata_folder).map(x => compact(x, 20))

val df = spark
  .readStream
  .format("kafka") ///. whatever stream you like
{code}


this example will retain SinkFileStatus from the last 20 days and will purge 
everything else
I run this code on driver startup - but it can certainly run async in some 
sidecar cronjob 

tested on spark 3.0.0 writing parquet files 



was (Author: sta...@gmail.com):
for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${file.getName.toString}"
removePath(compacted_file, fs)
val lines = sc.textFile(file.toString)
val reduced_lines = lines.mapPartitions({
p => 
implicit val formats = DefaultFormats
p.collect({

[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2020-09-08 Thread Avner Livne (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280
 ] 

Avner Livne edited comment on SPARK-24295 at 9/8/20, 3:42 PM:
--

for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${file.getName.toString}"
removePath(compacted_file, fs)
val lines = sc.textFile(file.toString)
val reduced_lines = lines.mapPartitions({
p => 
implicit val formats = DefaultFormats
p.collect({
case "v1" => "v1"
case x if { 
parse(x).extract[SinkFileStatus].modificationTime > ttl 
} => x
})
}).coalesce(1)
println(s"removing ${lines.count - reduced_lines.count} lines from 
${file.toString}...")
reduced_lines.saveAsTextFile(compacted_file)
FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, 
false, sc.hadoopConfiguration)
removePath(compacted_file, fs)
}

/**
  * get last compacted files if exists
  */
def getLastCompactFile(path: Path) = {
fs.listFiles(path, 
true).toList.sortBy(_.getModificationTime).reverse.collectFirst({
case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) 
=> 
x.getPath
})
}

val my_folder = "/my/root/spark/structerd/streaming/output/folder"
val metadata_folder = new Path(s"$my_folder/_spark_metadata"))
getLastCompactFile(metadata_folder).map(x => compact(x, 20))

val df = spark
  .readStream
  .format("kafka") ///. whatever stream you like
{code}


this example will retain SinkFileStatus from the last 20 days and will purge 
everything else
I run this code on driver startup - but it can certainly run async in some 
sidecar cronjob 

tested on spark 3.0.0 writing parquet files 



was (Author: sta...@gmail.com):
for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${file.getName.toString}"
removePath(compacted_file, fs)
val lines = sc.textFile(file.toString)
val reduced_lines = lines.mapPartitions({
p => 
implicit val formats = DefaultFormats
p.collect({

[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2020-09-08 Thread Avner Livne (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192280#comment-17192280
 ] 

Avner Livne commented on SPARK-24295:
-

for those looking for a temporary workaround:

run this code before you start the streaming:

{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.execution.streaming._

/**
  * regex to find last compact file
  */
val file_pattern = """(hdfs://.*/_spark_metadata/\d+\.compact)""".r.unanchored
val fs = FileSystem.get(sc.hadoopConfiguration)

/**
  * implicit hadoop RemoteIterator convertor
  */
implicit def convertToScalaIterator[T](underlying: RemoteIterator[T]): 
Iterator[T] = {
case class wrapper(underlying: RemoteIterator[T]) extends Iterator[T] {
  override def hasNext: Boolean = underlying.hasNext
  override def next: T = underlying.next
}
wrapper(underlying)
  }

/**
  * delete file or folder recursively 
  */
def removePath(dstPath: String, fs: FileSystem): Unit = {
val path = new Path(dstPath)
if (fs.exists(path)) {
println(s"deleting ${dstPath}...")
fs.delete(path, true)
}
}

/**
  * remove json entries older than `days` from compact file
  * preserve `v1` at the head of the file
  * re write the small file back to the original destination
  */
def compact(file: Path, days: Int = 20) = {
val ttl = new java.util.Date().getTime - 
java.util.concurrent.TimeUnit.DAYS.toMillis(days)
val compacted_file = s"/tmp/${file.getName.toString}"
removePath(compacted_file, fs)
val lines = sc.textFile(file.toString)
val reduced_lines = lines.mapPartitions({
p => 
implicit val formats = DefaultFormats
p.collect({
case "v1" => "v1"
case x if { 
parse(x).extract[SinkFileStatus].modificationTime > ttl 
} => x
})
}).coalesce(1)
println(s"removing ${lines.count - reduced_lines.count} lines from 
${file.toString}...")
reduced_lines.saveAsTextFile(compacted_file)
FileUtil.copy(fs, new Path(compacted_file + "/part-0"), fs, file, 
false, sc.hadoopConfiguration)
removePath(compacted_file, fs)
}

/**
  * get last compacted files if exists
  */
def getLastCompactFile(path: Path) = {
fs.listFiles(path, 
true).toList.sortBy(_.getModificationTime).reverse.collectFirst({
case x if (file_pattern.findFirstMatchIn(x.getPath.toString).isDefined) 
=> 
x.getPath
})
}

val my_folder = "/my/root/spark/structerd/streaming/output/folder"
val metadata_folder = new Path(s"$my_folder/_spark_metadata"))
getLastCompactFile(metadata_folder).map(x => compact(x, 20))

val df1 = spark
  .readStream
  .format("kafka") ///. whatever stream you like
{code}


this example will retain SinkFileStatus from the last 20 days and will purge 
everything else
I run this code on driver startup - but it can certainly run async in some 
sidecar cronjob 

tested on spark 3.0.0 writing parquet files 


> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32824) The error is confusing when resource .amount not provided

2020-09-08 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-32824:
-

 Summary: The error is confusing when resource .amount not provided 
 Key: SPARK-32824
 URL: https://issues.apache.org/jira/browse/SPARK-32824
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


If the user forgets to specify the .amount when specifying a resource, the 
error that comes out is confusing, we should improve.

 

$ $SPARK_HOME/bin/spark-shell  --master spark://host9:7077 --conf 
spark.executor.resource.gpu=1

 
{code:java}
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.propertiesSetting default log level to 
"WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).20/09/08 08:19:35 ERROR SparkContext: Error initializing 
SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of 
range: -1 at java.lang.String.substring(String.java:1967) at 
org.apache.spark.resource.ResourceUtils$.$anonfun$listResourceIds$1(ResourceUtils.scala:151)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) 
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
org.apache.spark.resource.ResourceUtils$.listResourceIds(ResourceUtils.scala:150)
 at 
org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:158)
 at 
org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773) 
at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
 at org.apache.spark.SparkContext.(SparkContext.scala:528) at 
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555) at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:930)
 at scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106) at 
$line3.$read$$iw$$iw.(:15) at 
$line3.$read$$iw.(:42) at $line3.$read.(:44){code}
'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-32823:
-

 Summary: Standalone Master UI resources in use wrong
 Key: SPARK-32823
 URL: https://issues.apache.org/jira/browse/SPARK-32823
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Thomas Graves


I was using the standalone deployment with workers with GPUs and the master ui 
was wrong for:
 * *Resources in use:* 0 / 4 gpu

In this case I had 2 workers, each with 4 gpus, so this total should have been 
8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32823) Standalone Master UI resources in use wrong

2020-09-08 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192256#comment-17192256
 ] 

Thomas Graves commented on SPARK-32823:
---

I'm looking into this.

> Standalone Master UI resources in use wrong
> ---
>
> Key: SPARK-32823
> URL: https://issues.apache.org/jira/browse/SPARK-32823
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was using the standalone deployment with workers with GPUs and the master 
> ui was wrong for:
>  * *Resources in use:* 0 / 4 gpu
> In this case I had 2 workers, each with 4 gpus, so this total should have 
> been 8.  It seems like its just looking at a single worker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192252#comment-17192252
 ] 

Apache Spark commented on SPARK-32753:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29682

> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0, 3.0.2
>
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192251#comment-17192251
 ] 

Apache Spark commented on SPARK-32753:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29682

> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0, 3.0.2
>
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32815) Fix LibSVM data source loading error on file paths with glob metacharacters

2020-09-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32815:

Fix Version/s: 3.0.2
   2.4.8

> Fix LibSVM data source loading error on file paths with glob metacharacters
> ---
>
> Key: SPARK-32815
> URL: https://issues.apache.org/jira/browse/SPARK-32815
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.8, 3.1.0, 3.0.2
>
>
> SPARK-32810 fixed a long standing bug in a few Spark built-in data sources 
> that fails to read files whose names contain glob metacharacters, such as [, 
> ], \{, }, etc.
> CSV and JSON data source on the Spark side were affected. We've also noticed 
> that the LibSVM data source had the same code pattern that leads to the bug, 
> so the fix https://github.com/apache/spark/pull/29659 included a fix for that 
> data source as well, but it did not include a test for the LibSVM data source.
> This ticket tracks adding a test case for LibSVM, similar to the ones for 
> CSV/JSON, to verify whether or not the fix works as intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32815) Fix LibSVM data source loading error on file paths with glob metacharacters

2020-09-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32815.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29670
[https://github.com/apache/spark/pull/29670]

> Fix LibSVM data source loading error on file paths with glob metacharacters
> ---
>
> Key: SPARK-32815
> URL: https://issues.apache.org/jira/browse/SPARK-32815
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-32810 fixed a long standing bug in a few Spark built-in data sources 
> that fails to read files whose names contain glob metacharacters, such as [, 
> ], \{, }, etc.
> CSV and JSON data source on the Spark side were affected. We've also noticed 
> that the LibSVM data source had the same code pattern that leads to the bug, 
> so the fix https://github.com/apache/spark/pull/29659 included a fix for that 
> data source as well, but it did not include a test for the LibSVM data source.
> This ticket tracks adding a test case for LibSVM, similar to the ones for 
> CSV/JSON, to verify whether or not the fix works as intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32815) Fix LibSVM data source loading error on file paths with glob metacharacters

2020-09-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32815:
---

Assignee: Maxim Gekk

> Fix LibSVM data source loading error on file paths with glob metacharacters
> ---
>
> Key: SPARK-32815
> URL: https://issues.apache.org/jira/browse/SPARK-32815
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> SPARK-32810 fixed a long standing bug in a few Spark built-in data sources 
> that fails to read files whose names contain glob metacharacters, such as [, 
> ], \{, }, etc.
> CSV and JSON data source on the Spark side were affected. We've also noticed 
> that the LibSVM data source had the same code pattern that leads to the bug, 
> so the fix https://github.com/apache/spark/pull/29659 included a fix for that 
> data source as well, but it did not include a test for the LibSVM data source.
> This ticket tracks adding a test case for LibSVM, similar to the ones for 
> CSV/JSON, to verify whether or not the fix works as intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-09-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32753:

Fix Version/s: 3.0.2

> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0, 3.0.2
>
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32817) DPP throws error when broadcast side is empty

2020-09-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32817:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> DPP throws error when broadcast side is empty
> -
>
> Key: SPARK-32817
> URL: https://issues.apache.org/jira/browse/SPARK-32817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> In `SubqueryBroadcastExec.relationFuture`, if the `broadcastRelation` is an 
> `EmptyHashedRelation`, then `broadcastRelation.keys()` will throw 
> `UnsupportedOperationException`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32817) DPP throws error when broadcast side is empty

2020-09-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32817.
--
Fix Version/s: 3.1.0
 Assignee: Zhenhua Wang
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29671

> DPP throws error when broadcast side is empty
> -
>
> Key: SPARK-32817
> URL: https://issues.apache.org/jira/browse/SPARK-32817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> In `SubqueryBroadcastExec.relationFuture`, if the `broadcastRelation` is an 
> `EmptyHashedRelation`, then `broadcastRelation.keys()` will throw 
> `UnsupportedOperationException`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192150#comment-17192150
 ] 

Apache Spark commented on SPARK-32822:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29681

> Change the number of partitions to zero when a range is empty with 
> WholeStageCodegen disabled or falled back
> 
>
> Key: SPARK-32822
> URL: https://issues.apache.org/jira/browse/SPARK-32822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> If WholeStageCodegen effects, the number of partitions of an empty range will 
> be changed to zero. But it doesn't changed when WholeStageCodegen is disabled 
> or falled back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32822:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Change the number of partitions to zero when a range is empty with 
> WholeStageCodegen disabled or falled back
> 
>
> Key: SPARK-32822
> URL: https://issues.apache.org/jira/browse/SPARK-32822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> If WholeStageCodegen effects, the number of partitions of an empty range will 
> be changed to zero. But it doesn't changed when WholeStageCodegen is disabled 
> or falled back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32822:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Change the number of partitions to zero when a range is empty with 
> WholeStageCodegen disabled or falled back
> 
>
> Key: SPARK-32822
> URL: https://issues.apache.org/jira/browse/SPARK-32822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> If WholeStageCodegen effects, the number of partitions of an empty range will 
> be changed to zero. But it doesn't changed when WholeStageCodegen is disabled 
> or falled back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec

2020-09-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32748.
--
Resolution: Won't Fix

> Support local property propagation in SubqueryBroadcastExec
> ---
>
> Key: SPARK-32748
> URL: https://issues.apache.org/jira/browse/SPARK-32748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>
> Since SPARK-22590, local property propagation is supported through 
> `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
> `SubqueryExec` when computing `relationFuture`.
> The propagation is missed in `SubqueryBroadcastExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec

2020-09-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reopened SPARK-32748:
--

> Support local property propagation in SubqueryBroadcastExec
> ---
>
> Key: SPARK-32748
> URL: https://issues.apache.org/jira/browse/SPARK-32748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>
> Since SPARK-22590, local property propagation is supported through 
> `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
> `SubqueryExec` when computing `relationFuture`.
> The propagation is missed in `SubqueryBroadcastExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec

2020-09-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32748:
-
Fix Version/s: (was: 3.1.0)

> Support local property propagation in SubqueryBroadcastExec
> ---
>
> Key: SPARK-32748
> URL: https://issues.apache.org/jira/browse/SPARK-32748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>
> Since SPARK-22590, local property propagation is supported through 
> `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
> `SubqueryExec` when computing `relationFuture`.
> The propagation is missed in `SubqueryBroadcastExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back

2020-09-08 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32822:
---
Summary: Change the number of partitions to zero when a range is empty with 
WholeStageCodegen disabled or falled back  (was: Change the number of partition 
to zero when a range is empty with WholeStageCodegen disabled or falled back)

> Change the number of partitions to zero when a range is empty with 
> WholeStageCodegen disabled or falled back
> 
>
> Key: SPARK-32822
> URL: https://issues.apache.org/jira/browse/SPARK-32822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> If WholeStageCodegen effects, the number of partition of an empty range will 
> be changed to zero. But it doesn't changed when WholeStageCodegen is disabled 
> or falled back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32822) Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back

2020-09-08 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32822:
---
Description: If WholeStageCodegen effects, the number of partitions of an 
empty range will be changed to zero. But it doesn't changed when 
WholeStageCodegen is disabled or falled back.  (was: If WholeStageCodegen 
effects, the number of partition of an empty range will be changed to zero. But 
it doesn't changed when WholeStageCodegen is disabled or falled back.)

> Change the number of partitions to zero when a range is empty with 
> WholeStageCodegen disabled or falled back
> 
>
> Key: SPARK-32822
> URL: https://issues.apache.org/jira/browse/SPARK-32822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> If WholeStageCodegen effects, the number of partitions of an empty range will 
> be changed to zero. But it doesn't changed when WholeStageCodegen is disabled 
> or falled back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32822) Change the number of partition to zero when a range is empty with WholeStageCodegen disabled or falled back

2020-09-08 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-32822:
--

 Summary: Change the number of partition to zero when a range is 
empty with WholeStageCodegen disabled or falled back
 Key: SPARK-32822
 URL: https://issues.apache.org/jira/browse/SPARK-32822
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


If WholeStageCodegen effects, the number of partition of an empty range will be 
changed to zero. But it doesn't changed when WholeStageCodegen is disabled or 
falled back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192138#comment-17192138
 ] 

Jungtaek Lim commented on SPARK-32821:
--

Let's leave the fix version field be empty - the field will be filled when the 
issue is resolved with specific patch.

It's a challenging work to bind all Dataset APIs with SQL statement, especially 
SS, given the fact Streaming SQL doesn't have a standard.

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>
> current only support dsl style as below: 
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32821:
-
Fix Version/s: (was: 3.0.1)

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>  Labels: 2.1.0
>
> current only support dsl style as below: 
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32821:
-
Labels:   (was: 2.1.0)

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>
> current only support dsl style as below: 
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Description: 
current only support dsl style as below: 

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 
but not support group by with window in sql style as below:

"select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 minute') 
with watermark 1 minute from tableX group by ts_field"

 

  was:
 

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 
but not support group by with window in sql style as below:

"select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 minute') 
with watermark 1 minute from tableX group by ts_field"

 


> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>  Labels: 2.1.0
> Fix For: 3.0.1
>
>
> current only support dsl style as below: 
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Affects Version/s: 2.2.0
   2.3.0
   2.4.0

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Johnny Bai
>Priority: Major
>  Labels: 2.1.0
> Fix For: 3.0.1
>
>
>  
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Labels: 2.1.0  (was: )

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Johnny Bai
>Priority: Major
>  Labels: 2.1.0
> Fix For: 3.0.1
>
>
>  
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Affects Version/s: (was: 3.0.0)
   2.1.0

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Johnny Bai
>Priority: Major
> Fix For: 3.0.1
>
>
>  
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Target Version/s:   (was: 2.4.3)

> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Johnny Bai
>Priority: Major
> Fix For: 3.0.1
>
>
>  
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32811) Replace IN predicate of continuous range with boundary checks

2020-09-08 Thread Vu Ho (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vu Ho updated SPARK-32811:
--
Description: 
This expression 
{code:java}
select a from t where a in (1, 2, 3, 3, 4){code}
can be translated to 
{code:java}
select a from t where a >= 1 and a <= 4 {code}
This would speed up parquet row group filter (currently or(or(or(or(or(eq(x, 
1), eq(x, 2)), eq(x, 3), eq(x, 4. and make query more compact

 

  was:
This expression 
{code:java}
select a from t where a in (1, 2, 3, 3, 4){code}
should be translated to 
{code:java}
select a from t where a >= 1 and a <= 4 {code}


> Replace IN predicate of continuous range with boundary checks
> -
>
> Key: SPARK-32811
> URL: https://issues.apache.org/jira/browse/SPARK-32811
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Vu Ho
>Priority: Minor
>
> This expression 
> {code:java}
> select a from t where a in (1, 2, 3, 3, 4){code}
> can be translated to 
> {code:java}
> select a from t where a >= 1 and a <= 4 {code}
> This would speed up parquet row group filter (currently or(or(or(or(or(eq(x, 
> 1), eq(x, 2)), eq(x, 3), eq(x, 4. and make query more compact
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Description: 
 

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 
but not support group by with window in sql style as below:

"select ts_field,count(*) as cnt over window(ts_field, '1 minute', '1 minute') 
with watermark 1 minute from tableX group by ts_field"

 

  was:
 

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 
but not support group by with window in sql style as below:

"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with 
watermark 1 minute from tableX group by ts_field"

 


> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Johnny Bai
>Priority: Major
> Fix For: 3.0.1
>
>
>  
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Description: 
 

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 
but not support group by with window in sql style as below:

"select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 minute') 
with watermark 1 minute from tableX group by ts_field"

 

  was:
 

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 
but not support group by with window in sql style as below:

"select ts_field,count(*) as cnt over window(ts_field, '1 minute', '1 minute') 
with watermark 1 minute from tableX group by ts_field"

 


> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Johnny Bai
>Priority: Major
> Fix For: 3.0.1
>
>
>  
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Description: 
 

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: 
String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 
but not support group by with window in sql style as below:

"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with 
watermark 1 minute from tableX group by ts_field"

 

  was:
 

 

import spark.implicits._

val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, 
word: String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 

{{but not support group by with window in sql style as below:}}

{{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with 
watermark 1 minute from tableX group by ts_field"}}




> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Johnny Bai
>Priority: Major
> Fix For: 3.0.1
>
>
>  
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> but not support group by with window in sql style as below:
> "select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with 
> watermark 1 minute from tableX group by ts_field"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Description: 
 

 

import spark.implicits._

val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, 
word: String }

// Group the data by window and word and compute the count of each group

val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()

 

{{but not support group by with window in sql style as below:}}

{{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with 
watermark 1 minute from tableX group by ts_field"}}



  was:
 

{{import spark.implicits._}}

{{val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, 
word: String }}}

{{// Group the data by window and word and compute the count of each group}}

{{val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()}}

{{}}

{{but not support group by with window in sql style as below:}}

{{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with 
watermark 1 minute from tableX group by ts_field"}}

{{}}


> cannot group by with window in sql sentence for structured streaming with 
> watermark
> ---
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Johnny Bai
>Priority: Major
> Fix For: 3.0.1
>
>
>  
>  
> import spark.implicits._
> val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each 
> group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
>  
> {{but not support group by with window in sql style as below:}}
> {{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') 
> with watermark 1 minute from tableX group by ts_field"}}
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32638) WidenSetOperationTypes in subquery attribute missing

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192111#comment-17192111
 ] 

Apache Spark commented on SPARK-32638:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29680

> WidenSetOperationTypes in subquery  attribute  missing
> --
>
> Key: SPARK-32638
> URL: https://issues.apache.org/jira/browse/SPARK-32638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Guojian Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> I am migrating sql from mysql to spark sql, meet a very strange case. Below 
> is code to reproduce the exception:
>  
> {code:java}
> val spark = SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .getOrCreate()
> spark.sparkContext.setLogLevel("TRACE")
> val DecimalType = DataTypes.createDecimalType(20, 2)
> val schema = StructType(List(
>  StructField("a", DecimalType, true)
> ))
> val dataList = new util.ArrayList[Row]()
> val df=spark.createDataFrame(dataList,schema)
> df.printSchema()
> df.createTempView("test")
> val sql=
>  """
>  |SELECT t.kpi_04 FROM
>  |(
>  | SELECT a as `kpi_04` FROM test
>  | UNION ALL
>  | SELECT a+a as `kpi_04` FROM test
>  |) t
>  |
>  """.stripMargin
> spark.sql(sql)
> {code}
>  
> Exception Message:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved 
> attribute(s) kpi_04#2 missing from kpi_04#4 in operator !Project [kpi_04#2]. 
> Attribute(s) with the same name appear in the operation: kpi_04. Please check 
> if the right attribute(s) are used.;;
> !Project [kpi_04#2]
> +- SubqueryAlias t
>  +- Union
>  :- Project [cast(kpi_04#2 as decimal(21,2)) AS kpi_04#4]
>  : +- Project [a#0 AS kpi_04#2]
>  : +- SubqueryAlias test
>  : +- LocalRelation , [a#0]
>  +- Project [kpi_04#3]
>  +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
>  +- SubqueryAlias test
>  +- LocalRelation , [a#0]{code}
>  
>  
> Base the trace log ,seemly the WidenSetOperationTypes add new outer project 
> layer. It caused the parent query lose the reference to subquery. 
>  
>  
> {code:java}
>  
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes ===
> !'Project [kpi_04#2] !Project [kpi_04#2]
> !+- 'SubqueryAlias t +- SubqueryAlias t
> ! +- 'Union +- Union
> ! :- Project [a#0 AS kpi_04#2] :- Project [cast(kpi_04#2 as decimal(21,2)) AS 
> kpi_04#4]
> ! : +- SubqueryAlias test : +- Project [a#0 AS kpi_04#2]
> ! : +- LocalRelation , [a#0] : +- SubqueryAlias test
> ! +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3] : +- LocalRelation , [a#0]
> ! +- SubqueryAlias test +- Project [kpi_04#3]
> ! +- LocalRelation , [a#0] +- Project 
> [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
> ! +- SubqueryAlias test
> ! +- LocalRelation , [a#0]
> {code}
>  
>   in the source code ,WidenSetOperationTypes.scala. it is  a intent behavior, 
> but  possibly  miss this edge case. 
> I hope someone can help me out to fix it . 
>  
>  
> {code:java}
> if (targetTypes.nonEmpty) {
>  // Add an extra Project if the targetTypes are different from the original 
> types.
>  children.map(widenTypes(_, targetTypes))
> } else {
>  // Unable to find a target type to widen, then just return the original set.
>  children
> }{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32638) WidenSetOperationTypes in subquery attribute missing

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192110#comment-17192110
 ] 

Apache Spark commented on SPARK-32638:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29680

> WidenSetOperationTypes in subquery  attribute  missing
> --
>
> Key: SPARK-32638
> URL: https://issues.apache.org/jira/browse/SPARK-32638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Guojian Li
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> I am migrating sql from mysql to spark sql, meet a very strange case. Below 
> is code to reproduce the exception:
>  
> {code:java}
> val spark = SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .getOrCreate()
> spark.sparkContext.setLogLevel("TRACE")
> val DecimalType = DataTypes.createDecimalType(20, 2)
> val schema = StructType(List(
>  StructField("a", DecimalType, true)
> ))
> val dataList = new util.ArrayList[Row]()
> val df=spark.createDataFrame(dataList,schema)
> df.printSchema()
> df.createTempView("test")
> val sql=
>  """
>  |SELECT t.kpi_04 FROM
>  |(
>  | SELECT a as `kpi_04` FROM test
>  | UNION ALL
>  | SELECT a+a as `kpi_04` FROM test
>  |) t
>  |
>  """.stripMargin
> spark.sql(sql)
> {code}
>  
> Exception Message:
>  
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved 
> attribute(s) kpi_04#2 missing from kpi_04#4 in operator !Project [kpi_04#2]. 
> Attribute(s) with the same name appear in the operation: kpi_04. Please check 
> if the right attribute(s) are used.;;
> !Project [kpi_04#2]
> +- SubqueryAlias t
>  +- Union
>  :- Project [cast(kpi_04#2 as decimal(21,2)) AS kpi_04#4]
>  : +- Project [a#0 AS kpi_04#2]
>  : +- SubqueryAlias test
>  : +- LocalRelation , [a#0]
>  +- Project [kpi_04#3]
>  +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
>  +- SubqueryAlias test
>  +- LocalRelation , [a#0]{code}
>  
>  
> Base the trace log ,seemly the WidenSetOperationTypes add new outer project 
> layer. It caused the parent query lose the reference to subquery. 
>  
>  
> {code:java}
>  
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes ===
> !'Project [kpi_04#2] !Project [kpi_04#2]
> !+- 'SubqueryAlias t +- SubqueryAlias t
> ! +- 'Union +- Union
> ! :- Project [a#0 AS kpi_04#2] :- Project [cast(kpi_04#2 as decimal(21,2)) AS 
> kpi_04#4]
> ! : +- SubqueryAlias test : +- Project [a#0 AS kpi_04#2]
> ! : +- LocalRelation , [a#0] : +- SubqueryAlias test
> ! +- Project [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3] : +- LocalRelation , [a#0]
> ! +- SubqueryAlias test +- Project [kpi_04#3]
> ! +- LocalRelation , [a#0] +- Project 
> [CheckOverflow((promote_precision(cast(a#0 as decimal(21,2))) + 
> promote_precision(cast(a#0 as decimal(21,2, DecimalType(21,2)) AS 
> kpi_04#3]
> ! +- SubqueryAlias test
> ! +- LocalRelation , [a#0]
> {code}
>  
>   in the source code ,WidenSetOperationTypes.scala. it is  a intent behavior, 
> but  possibly  miss this edge case. 
> I hope someone can help me out to fix it . 
>  
>  
> {code:java}
> if (targetTypes.nonEmpty) {
>  // Add an extra Project if the targetTypes are different from the original 
> types.
>  children.map(widenTypes(_, targetTypes))
> } else {
>  // Unable to find a target type to widen, then just return the original set.
>  children
> }{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32182) Getting Started - Quickstart

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192103#comment-17192103
 ] 

Apache Spark commented on SPARK-32182:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29679

> Getting Started - Quickstart
> 
>
> Key: SPARK-32182
> URL: https://issues.apache.org/jira/browse/SPARK-32182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Example:
> https://koalas.readthedocs.io/en/latest/getting_started/10min.html
> https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html
> https://pandas.pydata.org/docs/getting_started/10min.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32182) Getting Started - Quickstart

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192102#comment-17192102
 ] 

Apache Spark commented on SPARK-32182:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29679

> Getting Started - Quickstart
> 
>
> Key: SPARK-32182
> URL: https://issues.apache.org/jira/browse/SPARK-32182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Example:
> https://koalas.readthedocs.io/en/latest/getting_started/10min.html
> https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html
> https://pandas.pydata.org/docs/getting_started/10min.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32821) cannot group by with window in sql sentence for structured streaming with watermark

2020-09-08 Thread Johnny Bai (Jira)

Johnny Bai created SPARK-32821:
--

 Summary: cannot group by with window in sql sentence for 
structured streaming with watermark
 Key: SPARK-32821
 URL: https://issues.apache.org/jira/browse/SPARK-32821
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Johnny Bai
 Fix For: 3.0.1


 

{{import spark.implicits._}}

{{val words = ... // streaming DataFrame of schema \{ timestamp: Timestamp, 
word: String }}}

{{// Group the data by window and word and compute the count of each group}}

{{val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
minutes"),$"word").count()}}

{{}}

{{but not support group by with window in sql style as below:}}

{{"select ts_field,count(*) over window(ts_field, '1 minute', '1 minute') with 
watermark 1 minute from tableX group by ts_field"}}

{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32182) Getting Started - Quickstart

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192101#comment-17192101
 ] 

Apache Spark commented on SPARK-32182:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29679

> Getting Started - Quickstart
> 
>
> Key: SPARK-32182
> URL: https://issues.apache.org/jira/browse/SPARK-32182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Example:
> https://koalas.readthedocs.io/en/latest/getting_started/10min.html
> https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html
> https://pandas.pydata.org/docs/getting_started/10min.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32204) Binder Integration

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192100#comment-17192100
 ] 

Apache Spark commented on SPARK-32204:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29679

> Binder Integration
> --
>
> Key: SPARK-32204
> URL: https://issues.apache.org/jira/browse/SPARK-32204
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> For example,
> https://github.com/databricks/koalas
> https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32204) Binder Integration

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192097#comment-17192097
 ] 

Apache Spark commented on SPARK-32204:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29679

> Binder Integration
> --
>
> Key: SPARK-32204
> URL: https://issues.apache.org/jira/browse/SPARK-32204
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> For example,
> https://github.com/databricks/koalas
> https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32815) Fix LibSVM data source loading error on file paths with glob metacharacters

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192093#comment-17192093
 ] 

Apache Spark commented on SPARK-32815:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29678

> Fix LibSVM data source loading error on file paths with glob metacharacters
> ---
>
> Key: SPARK-32815
> URL: https://issues.apache.org/jira/browse/SPARK-32815
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SPARK-32810 fixed a long standing bug in a few Spark built-in data sources 
> that fails to read files whose names contain glob metacharacters, such as [, 
> ], \{, }, etc.
> CSV and JSON data source on the Spark side were affected. We've also noticed 
> that the LibSVM data source had the same code pattern that leads to the bug, 
> so the fix https://github.com/apache/spark/pull/29659 included a fix for that 
> data source as well, but it did not include a test for the LibSVM data source.
> This ticket tracks adding a test case for LibSVM, similar to the ones for 
> CSV/JSON, to verify whether or not the fix works as intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32811) Replace IN predicate of continuous range with boundary checks

2020-09-08 Thread Vu Ho (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vu Ho updated SPARK-32811:
--
Priority: Minor  (was: Major)

> Replace IN predicate of continuous range with boundary checks
> -
>
> Key: SPARK-32811
> URL: https://issues.apache.org/jira/browse/SPARK-32811
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Vu Ho
>Priority: Minor
>
> This expression 
> {code:java}
> select a from t where a in (1, 2, 3, 3, 4){code}
> should be translated to 
> {code:java}
> select a from t where a >= 1 and a <= 4 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32820) Remove redundant shuffle exchanges inserted by EnsureRequirements

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32820:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Remove redundant shuffle exchanges inserted by EnsureRequirements
> -
>
> Key: SPARK-32820
> URL: https://issues.apache.org/jira/browse/SPARK-32820
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Redundant repartition operations are removed by CollapseRepartition rule but 
> EnsureRequirements can insert another HashPartitioning or RangePartitioning 
> immediately after the repartition, leading adjacent ShuffleExchanges will be 
> in the physical plan.
>  
> {code:java}
> val ordered = spark.range(1, 100).repartitionByRange(10, 
> $"id".desc).orderBy($"id")
> ordered.explain(true)
> ...
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200), true, [id=#15]
>+- Exchange rangepartitioning(id#0L DESC NULLS LAST, 10), false, [id=#14]
>   +- *(1) Range (1, 100, step=1, splits=12){code}
> {code:java}
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 0)
> val left = Seq(1,2,3).toDF.repartition(10, $"value")
> val right = Seq(1,2,3).toDF
> val joined = left.join(right, left("value") + 1 === right("value")
> joined.explain(true)
> ...
> == Physical Plan ==
> *(3) SortMergeJoin [(value#7 + 1)], [value#12], Inner
> :- *(1) Sort [(value#7 + 1) ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning((value#7 + 1), 200), true, [id=#67]
> : +- Exchange hashpartitioning(value#7, 10), false, [id=#63]
> :+- LocalTableScan [value#7]
> +- *(2) Sort [value#12 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(value#12, 200), true, [id=#68]
>   +- LocalTableScan [value#12]{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32820) Remove redundant shuffle exchanges inserted by EnsureRequirements

2020-09-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32820:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Remove redundant shuffle exchanges inserted by EnsureRequirements
> -
>
> Key: SPARK-32820
> URL: https://issues.apache.org/jira/browse/SPARK-32820
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> Redundant repartition operations are removed by CollapseRepartition rule but 
> EnsureRequirements can insert another HashPartitioning or RangePartitioning 
> immediately after the repartition, leading adjacent ShuffleExchanges will be 
> in the physical plan.
>  
> {code:java}
> val ordered = spark.range(1, 100).repartitionByRange(10, 
> $"id".desc).orderBy($"id")
> ordered.explain(true)
> ...
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200), true, [id=#15]
>+- Exchange rangepartitioning(id#0L DESC NULLS LAST, 10), false, [id=#14]
>   +- *(1) Range (1, 100, step=1, splits=12){code}
> {code:java}
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 0)
> val left = Seq(1,2,3).toDF.repartition(10, $"value")
> val right = Seq(1,2,3).toDF
> val joined = left.join(right, left("value") + 1 === right("value")
> joined.explain(true)
> ...
> == Physical Plan ==
> *(3) SortMergeJoin [(value#7 + 1)], [value#12], Inner
> :- *(1) Sort [(value#7 + 1) ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning((value#7 + 1), 200), true, [id=#67]
> : +- Exchange hashpartitioning(value#7, 10), false, [id=#63]
> :+- LocalTableScan [value#7]
> +- *(2) Sort [value#12 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(value#12, 200), true, [id=#68]
>   +- LocalTableScan [value#12]{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32820) Remove redundant shuffle exchanges inserted by EnsureRequirements

2020-09-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192081#comment-17192081
 ] 

Apache Spark commented on SPARK-32820:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29677

> Remove redundant shuffle exchanges inserted by EnsureRequirements
> -
>
> Key: SPARK-32820
> URL: https://issues.apache.org/jira/browse/SPARK-32820
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Redundant repartition operations are removed by CollapseRepartition rule but 
> EnsureRequirements can insert another HashPartitioning or RangePartitioning 
> immediately after the repartition, leading adjacent ShuffleExchanges will be 
> in the physical plan.
>  
> {code:java}
> val ordered = spark.range(1, 100).repartitionByRange(10, 
> $"id".desc).orderBy($"id")
> ordered.explain(true)
> ...
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200), true, [id=#15]
>+- Exchange rangepartitioning(id#0L DESC NULLS LAST, 10), false, [id=#14]
>   +- *(1) Range (1, 100, step=1, splits=12){code}
> {code:java}
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 0)
> val left = Seq(1,2,3).toDF.repartition(10, $"value")
> val right = Seq(1,2,3).toDF
> val joined = left.join(right, left("value") + 1 === right("value")
> joined.explain(true)
> ...
> == Physical Plan ==
> *(3) SortMergeJoin [(value#7 + 1)], [value#12], Inner
> :- *(1) Sort [(value#7 + 1) ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning((value#7 + 1), 200), true, [id=#67]
> : +- Exchange hashpartitioning(value#7, 10), false, [id=#63]
> :+- LocalTableScan [value#7]
> +- *(2) Sort [value#12 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(value#12, 200), true, [id=#68]
>   +- LocalTableScan [value#12]{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 129 matches

Mail list logo