from:"Noritaka Sekiyama \(Jira\)"

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--
Description: 
We have observed that Driver OOM happens in query planning phase with empty 
tables when we ran specific patterns of queries.
h2. Issue details

If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', 
then the query fails in planning phase due to Driver OOM, more specifically, 
"java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}.

If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to 
{{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.

 

This issue happened even with empty table, and it happened before actual data 
load. This seems like an issue in catalyst side.
h2. Reproduction step

Attaching script and query to reproduce the issue.
 * create_sanitized_tables.py: Script to create table definitions
 ** No need to place any data files as this happens with empty location.
 * test_and_twodays_simplified.sql: Query to reproduce the issue

Here's the typical stacktrace:

~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~
~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~
~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~
~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~
~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~
~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)~
~at scala.Option.getOrElse(Option.scala:189)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~
~at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
 Source)~
~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~
~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~
~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~
~at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
 Source)~
~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~
~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~
~at scala.collection.Iterator.foreach(Iterator.scala:943)~
~at scala.collection.Iterator.foreach$(Iterator.scala:943)~
~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~
~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~
~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~
~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~
~GC overhead limit exceeded~
~java.lang.OutOfMemoryError: GC overhead limit exceeded~
~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~
~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~
~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~
~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~
~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~
~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)~
~at scala.Option.getOrElse(Option.scala:189)~
~at

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--
Description: 
We have observed that Driver OOM happens in query planning phase with empty 
tables when we ran specific patterns of queries.
h2. Issue details

If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', 
then the query fails in planning phase due to Driver OOM, more specifically, 
"java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}.

If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to 
{{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.

 

This issue happened even with empty table, and it happened before actual data 
load. This seems like an issue in catalyst side.
h2. Reproduction step

Attaching script and query to reproduce the issue.
 * create_sanitized_tables.py: Script to create table definitions
 * test_and_twodays_simplified.sql: Query to reproduce the issue

Here's the typical stacktrace:

~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~
~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~
~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~
~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~
~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~
~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)~
~at scala.Option.getOrElse(Option.scala:189)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~
~at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
 Source)~
~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~
~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~
~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~
~at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
 Source)~
~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~
~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~
~at scala.collection.Iterator.foreach(Iterator.scala:943)~
~at scala.collection.Iterator.foreach$(Iterator.scala:943)~
~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~
~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~
~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~
~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~
~GC overhead limit exceeded~
~java.lang.OutOfMemoryError: GC overhead limit exceeded~
~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~
~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~
~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~
~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~
~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~
~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)~
~at scala.Option.getOrElse(Option.scala:189)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~
~at

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--
Attachment: test_and_twodays_simplified.sql

> Driver OOM happens in query planning phase with empty tables
> 
>
> Key: SPARK-46981
> URL: https://issues.apache.org/jira/browse/SPARK-46981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: * OSS Spark 3.5.0
>  * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0)
>  * AWS Glue Spark 3.3.0 (Glue version 4.0)
>Reporter: Noritaka Sekiyama
>Priority: Major
> Attachments: create_sanitized_tables.py, 
> test_and_twodays_simplified.sql
>
>
> We have observed that Driver OOM happens in query planning phase with empty 
> tables when we ran specific patterns of queries.
> h2. Issue details
> If we run the query with where condition {{pt>='20231004' and pt<='20231004', 
> then the query fails in planning phase due to Driver OOM, more specifically, 
> }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit 
> exceeded{}}}{{{}{}}}.
> If we change the where condition from {{pt>='20231004' and pt<='20231004'}} 
> to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.
>  
> This issue happened even with empty table, and it happened before actual data 
> load. This seems like an issue in catalyst side.
> h2. Reproduction step
> Attaching script and query to reproduce the issue.
>  * create_sanitized_tables.py: Script to create table definitions
>  * test_and_twodays_simplified.sql: Query to reproduce the issue
> Here's the typical stacktrace:
> {{  at scala.collection.immutable.Vector.iterator(Vector.scala:100)
> at scala.collection.immutable.Vector.iterator(Vector.scala:69)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at 
> scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
> at 
> scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
> at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
>  Source)
> at scala.Option.getOrElse(Option.scala:189)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
> at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
> at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
>  Source)
> at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
> at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
> at scala.collection.Iterator.foreach(Iterator.scala:943)
> at scala.collection.Iterator.foreach$(Iterator.scala:943)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
> at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
> at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
> GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at scala.collection.immutable.Vector.iterator(Vector.scala:100)
> at scala.collection.immutable.Vector.iterator(Vector.scala:69)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at

[jira] [Created] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)

Noritaka Sekiyama created SPARK-46981:
-

 Summary: Driver OOM happens in query planning phase with empty 
tables
 Key: SPARK-46981
 URL: https://issues.apache.org/jira/browse/SPARK-46981
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
 Environment: * OSS Spark 3.5.0
 * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0)
 * AWS Glue Spark 3.3.0 (Glue version 4.0)
Reporter: Noritaka Sekiyama
 Attachments: create_sanitized_tables.py

We have observed that Driver OOM happens in query planning phase with empty 
tables when we ran specific patterns of queries.
h2. Issue details

If we run the query with where condition {{pt>='20231004' and pt<='20231004', 
then the query fails in planning phase due to Driver OOM, more specifically, 
}}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit 
exceeded{}}}{{{}{}}}.

If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to 
{{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.

 

This issue happened even with empty table, and it happened before actual data 
load. This seems like an issue in catalyst side.
h2. Reproduction step

Attaching script and query to reproduce the issue.
 * create_sanitized_tables.py: Script to create table definitions
 * test_and_twodays_simplified.sql: Query to reproduce the issue

Here's the typical stacktrace:

{{  at scala.collection.immutable.Vector.iterator(Vector.scala:100)
at scala.collection.immutable.Vector.iterator(Vector.scala:69)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
 Source)
at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.Vector.iterator(Vector.scala:100)
at scala.collection.immutable.Vector.iterator(Vector.scala:69)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--
Attachment: create_sanitized_tables.py

> Driver OOM happens in query planning phase with empty tables
> 
>
> Key: SPARK-46981
> URL: https://issues.apache.org/jira/browse/SPARK-46981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: * OSS Spark 3.5.0
>  * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0)
>  * AWS Glue Spark 3.3.0 (Glue version 4.0)
>Reporter: Noritaka Sekiyama
>Priority: Major
> Attachments: create_sanitized_tables.py
>
>
> We have observed that Driver OOM happens in query planning phase with empty 
> tables when we ran specific patterns of queries.
> h2. Issue details
> If we run the query with where condition {{pt>='20231004' and pt<='20231004', 
> then the query fails in planning phase due to Driver OOM, more specifically, 
> }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit 
> exceeded{}}}{{{}{}}}.
> If we change the where condition from {{pt>='20231004' and pt<='20231004'}} 
> to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.
>  
> This issue happened even with empty table, and it happened before actual data 
> load. This seems like an issue in catalyst side.
> h2. Reproduction step
> Attaching script and query to reproduce the issue.
>  * create_sanitized_tables.py: Script to create table definitions
>  * test_and_twodays_simplified.sql: Query to reproduce the issue
> Here's the typical stacktrace:
> {{  at scala.collection.immutable.Vector.iterator(Vector.scala:100)
> at scala.collection.immutable.Vector.iterator(Vector.scala:69)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at 
> scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
> at 
> scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
> at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
>  Source)
> at scala.Option.getOrElse(Option.scala:189)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
> at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
> at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
>  Source)
> at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
> at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
> at scala.collection.Iterator.foreach(Iterator.scala:943)
> at scala.collection.Iterator.foreach$(Iterator.scala:943)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
> at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
> at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
> GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at scala.collection.immutable.Vector.iterator(Vector.scala:100)
> at scala.collection.immutable.Vector.iterator(Vector.scala:69)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at

[jira] [Updated] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics

2020-10-27 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-33266:
--
Description: 
Sometimes we need to identify performance bottlenecks, for example, how long it 
took to read from data store, how long it took to write into another data store.

It would be great if we can have total duration, read duration, and write 
duration as task level metrics.

Currently it seems that both `InputMetrics` and `OutputMetrics` do not have 
duration related metrics.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58]

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56]

 

On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. 
We might need similar metrics for input/output.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala]

 

  was:
Sometimes we need to identify performance bottlenecks, for example, how long it 
took to read from data store, how long it took to write into another data store.

It would be great if we can have total duration, read duration, and write 
duration as task level metrics.

Currently it seems that both `InputMetrics` and `OutputMetrics` do not have 
duration related metrics.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58]

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56
]

On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. 
We might need similar metrics for input/output.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala]

 


> Add total duration, read duration, and write duration as task level metrics
> ---
>
> Key: SPARK-33266
> URL: https://issues.apache.org/jira/browse/SPARK-33266
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Sometimes we need to identify performance bottlenecks, for example, how long 
> it took to read from data store, how long it took to write into another data 
> store.
> It would be great if we can have total duration, read duration, and write 
> duration as task level metrics.
> Currently it seems that both `InputMetrics` and `OutputMetrics` do not have 
> duration related metrics.
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58]
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56]
>  
> On the other hand, other metrics such as `ShuffleWriteMetrics` has write 
> time. We might need similar metrics for input/output.
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics

2020-10-27 Thread Noritaka Sekiyama (Jira)

Noritaka Sekiyama created SPARK-33266:
-

 Summary: Add total duration, read duration, and write duration as 
task level metrics
 Key: SPARK-33266
 URL: https://issues.apache.org/jira/browse/SPARK-33266
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Noritaka Sekiyama


Sometimes we need to identify performance bottlenecks, for example, how long it 
took to read from data store, how long it took to write into another data store.

It would be great if we can have total duration, read duration, and write 
duration as task level metrics.

Currently it seems that both `InputMetrics` and `OutputMetrics` do not have 
duration related metrics.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/InputMetrics.scala#L42-L58]

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/OutputMetrics.scala#L41-L56
]

On the other hand, other metrics such as `ShuffleWriteMetrics` has write time. 
We might need similar metrics for input/output.

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ShuffleWriteMetrics.scala]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

2020-07-28 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32432:
--
Description: 
Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
analytic engines including prestodb and prestosql.

Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
with ORC/Parquet files in Apache Spark (and Apache Hive).

On the other hand, prestodb and prestosql support SymlinkTextInputFormat with 
ORC/Parquet files.

This issue is to add support for reading ORC/Parquet files with 
SymlinkTextInputFormat in Apache Spark.

 

Related links
 * Hive's SymlinkTextInputFormat: 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java]
 * prestosql's implementation to add support for reading avro files with 
SymlinkTextInputFormat: 
[https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java]

  was:
Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
analytic engines including prestodb and prestosql.

Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
with ORC/Parquet files in Apache Spark.

This issue is to add support for reading ORC/Parquet files with 
SymlinkTextInputFormat.


> Add support for reading ORC/Parquet files with SymlinkTextInputFormat
> -
>
> Key: SPARK-32432
> URL: https://issues.apache.org/jira/browse/SPARK-32432
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
> analytic engines including prestodb and prestosql.
> Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
> with ORC/Parquet files in Apache Spark (and Apache Hive).
> On the other hand, prestodb and prestosql support SymlinkTextInputFormat with 
> ORC/Parquet files.
> This issue is to add support for reading ORC/Parquet files with 
> SymlinkTextInputFormat in Apache Spark.
>  
> Related links
>  * Hive's SymlinkTextInputFormat: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java]
>  * prestosql's implementation to add support for reading avro files with 
> SymlinkTextInputFormat: 
> [https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

2020-07-24 Thread Noritaka Sekiyama (Jira)

Noritaka Sekiyama created SPARK-32432:
-

 Summary: Add support for reading ORC/Parquet files with 
SymlinkTextInputFormat
 Key: SPARK-32432
 URL: https://issues.apache.org/jira/browse/SPARK-32432
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Noritaka Sekiyama


Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
analytic engines including prestodb and prestosql.

Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
with ORC/Parquet files in Apache Spark.

This issue is to add support for reading ORC/Parquet files with 
SymlinkTextInputFormat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time

2020-06-28 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32112:
--
Description: 
Repartition/coalesce is very important to optimize Spark application's 
performance, however, a lot of users are struggling with determining the number 
of partitions.
 This issue is to add a easier way to repartition/coalesce DataFrames based on 
the number of parallel tasks that Spark can process at the same time.

It will help Spark users to determine the optimal number of partitions.

Expected use-cases:
 - repartition with the calculated parallel tasks

Notes:
 - `SparkContext.maxNumConcurrentTasks` might help but it cannot be accessed by 
Spark apps.
 - `SparkContext.getExecutorMemoryStatus` might help to calculate the number of 
available slots to process tasks.

  was:
Repartition/coalesce is very important to optimize Spark application's 
performance, however, a lot of users are struggling with determining the number 
of partitions.
 This issue is to add a easier way to repartition/coalesce DataFrames based on 
the number of parallel tasks that Spark can process at the same time.

It will help Spark users to determine the optimal number of partitions.

Expected use-cases:
 - repartition with the calculated parallel tasks

 

There is `SparkContext.maxNumConcurrentTasks` but it cannot be accessed by 
Spark apps.


> Easier way to repartition/coalesce DataFrames based on the number of parallel 
> tasks that Spark can process at the same time
> ---
>
> Key: SPARK-32112
> URL: https://issues.apache.org/jira/browse/SPARK-32112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Repartition/coalesce is very important to optimize Spark application's 
> performance, however, a lot of users are struggling with determining the 
> number of partitions.
>  This issue is to add a easier way to repartition/coalesce DataFrames based 
> on the number of parallel tasks that Spark can process at the same time.
> It will help Spark users to determine the optimal number of partitions.
> Expected use-cases:
>  - repartition with the calculated parallel tasks
> Notes:
>  - `SparkContext.maxNumConcurrentTasks` might help but it cannot be accessed 
> by Spark apps.
>  - `SparkContext.getExecutorMemoryStatus` might help to calculate the number 
> of available slots to process tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time

2020-06-27 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32112:
--
Description: 
Repartition/coalesce is very important to optimize Spark application's 
performance, however, a lot of users are struggling with determining the number 
of partitions.
 This issue is to add a easier way to repartition/coalesce DataFrames based on 
the number of parallel tasks that Spark can process at the same time.

It will help Spark users to determine the optimal number of partitions.

Expected use-cases:
 - repartition with the calculated parallel tasks

 

There is `SparkContext.maxNumConcurrentTasks` but it cannot be accessed by 
Spark apps.

  was:
Repartition/coalesce is very important to optimize Spark application's 
performance, however, a lot of users are struggling with determining the number 
of partitions.
This issue is to add a method to calculate the number of parallel tasks that 
Spark can process at the same time.

It will help Spark users to determine the optimal number of partitions.

Expected use-cases:

- repartition with the calculated parallel tasks


> Easier way to repartition/coalesce DataFrames based on the number of parallel 
> tasks that Spark can process at the same time
> ---
>
> Key: SPARK-32112
> URL: https://issues.apache.org/jira/browse/SPARK-32112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Repartition/coalesce is very important to optimize Spark application's 
> performance, however, a lot of users are struggling with determining the 
> number of partitions.
>  This issue is to add a easier way to repartition/coalesce DataFrames based 
> on the number of parallel tasks that Spark can process at the same time.
> It will help Spark users to determine the optimal number of partitions.
> Expected use-cases:
>  - repartition with the calculated parallel tasks
>  
> There is `SparkContext.maxNumConcurrentTasks` but it cannot be accessed by 
> Spark apps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time

2020-06-27 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32112:
--
Summary: Easier way to repartition/coalesce DataFrames based on the number 
of parallel tasks that Spark can process at the same time  (was: Add a method 
to calculate the number of parallel tasks that Spark can process at the same 
time)

> Easier way to repartition/coalesce DataFrames based on the number of parallel 
> tasks that Spark can process at the same time
> ---
>
> Key: SPARK-32112
> URL: https://issues.apache.org/jira/browse/SPARK-32112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Repartition/coalesce is very important to optimize Spark application's 
> performance, however, a lot of users are struggling with determining the 
> number of partitions.
> This issue is to add a method to calculate the number of parallel tasks that 
> Spark can process at the same time.
> It will help Spark users to determine the optimal number of partitions.
> Expected use-cases:
> - repartition with the calculated parallel tasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32112) Add a method to calculate the number of parallel tasks that Spark can process at the same time

2020-06-26 Thread Noritaka Sekiyama (Jira)

Noritaka Sekiyama created SPARK-32112:
-

 Summary: Add a method to calculate the number of parallel tasks 
that Spark can process at the same time
 Key: SPARK-32112
 URL: https://issues.apache.org/jira/browse/SPARK-32112
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Noritaka Sekiyama


Repartition/coalesce is very important to optimize Spark application's 
performance, however, a lot of users are struggling with determining the number 
of partitions.
This issue is to add a method to calculate the number of parallel tasks that 
Spark can process at the same time.

It will help Spark users to determine the optimal number of partitions.

Expected use-cases:

- repartition with the calculated parallel tasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

2020-06-22 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32013:
--
Description: 
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 However, this query is only for reading data, and it does not support the 
common examples listed above.

On the other hand, there is `sessionInitStatement` option available before 
writing data from DataFrame.
 This option is to run custom SQL in order to implement session initialization 
code. Since it runs per session, it cannot be used for non-idempotent 
operations.

 

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]

  was:
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 However, this query is only for reading data, and it does not support the 
common examples listed above.

On the other hand, there is `sessionInitStatement` option available before 
writing data from DataFrame.
This option is to run custom SQL in order to implement session initialization 
code. Since it runs per session, it cannot be used for write operations.

 

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]


> Support query execution before/after reading/writing over JDBC
> --
>
> Key: SPARK-32013
> URL: https://issues.apache.org/jira/browse/SPARK-32013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> For ETL workload, there is a common requirement to perform SQL statement 
> before/after reading/writing over JDBC.
>  Here's examples;
>  - Create a view with specific conditions
>  - Delete/Update some records
>  - Truncate a table (it is already possible in `truncate` option)
>  - Execute stored procedure (it is also requested in SPARK-32014)
> Currently `query` options is available to specify SQL statement against JDBC 
> datasource when loading data as DataFrame.
>  However, this query is only for reading data, and it does not support the 
> common examples listed above.
> On the other hand, there is `sessionInitStatement` option available before 
> writing data from DataFrame.
>  This option is to run custom SQL in order to implement session 
> initialization code. Since it runs per session, it cannot be used for 
> non-idempotent operations.
>  
> If Spark can support executing SQL statement against JDBC datasources 
> before/after reading/writing over JDBC, it can cover a lot of common 
> use-cases.
> Note: Databricks' old Redshift connector has similar option like `preactions` 
> and `postactions`. [https://github.com/databricks/spark-redshift]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

2020-06-22 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32013:
--
Description: 
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 However, this query is only for reading data, and it does not support the 
common examples listed above.

On the other hand, there is `sessionInitStatement` option available before 
writing data from DataFrame.
This option is to run custom SQL in order to implement session initialization 
code. Since it runs per session, it cannot be used for write operations.

 

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]

  was:
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
 However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]


> Support query execution before/after reading/writing over JDBC
> --
>
> Key: SPARK-32013
> URL: https://issues.apache.org/jira/browse/SPARK-32013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> For ETL workload, there is a common requirement to perform SQL statement 
> before/after reading/writing over JDBC.
>  Here's examples;
>  - Create a view with specific conditions
>  - Delete/Update some records
>  - Truncate a table (it is already possible in `truncate` option)
>  - Execute stored procedure (it is also requested in SPARK-32014)
> Currently `query` options is available to specify SQL statement against JDBC 
> datasource when loading data as DataFrame.
>  However, this query is only for reading data, and it does not support the 
> common examples listed above.
> On the other hand, there is `sessionInitStatement` option available before 
> writing data from DataFrame.
> This option is to run custom SQL in order to implement session initialization 
> code. Since it runs per session, it cannot be used for write operations.
>  
> If Spark can support executing SQL statement against JDBC datasources 
> before/after reading/writing over JDBC, it can cover a lot of common 
> use-cases.
> Note: Databricks' old Redshift connector has similar option like `preactions` 
> and `postactions`. [https://github.com/databricks/spark-redshift]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

2020-06-17 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32013:
--
Description: 
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure (it is also requested in SPARK-32014)

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
 However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. [https://github.com/databricks/spark-redshift]

  was:
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
 However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. https://github.com/databricks/spark-redshift


> Support query execution before/after reading/writing over JDBC
> --
>
> Key: SPARK-32013
> URL: https://issues.apache.org/jira/browse/SPARK-32013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> For ETL workload, there is a common requirement to perform SQL statement 
> before/after reading/writing over JDBC.
>  Here's examples;
>  - Create a view with specific conditions
>  - Delete/Update some records
>  - Truncate a table (it is already possible in `truncate` option)
>  - Execute stored procedure (it is also requested in SPARK-32014)
> Currently `query` options is available to specify SQL statement against JDBC 
> datasource when loading data as DataFrame.
>  [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
>  However, this query is only for reading data, and it does not support the 
> common examples listed above.
> If Spark can support executing SQL statement against JDBC datasources 
> before/after reading/writing over JDBC, it can cover a lot of common 
> use-cases.
> Note: Databricks' old Redshift connector has similar option like `preactions` 
> and `postactions`. [https://github.com/databricks/spark-redshift]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

2020-06-17 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32013:
--
Description: 
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
 Here's examples;
 - Create a view with specific conditions
 - Delete/Update some records
 - Truncate a table (it is already possible in `truncate` option)
 - Execute stored procedure

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
 [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
 However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`. https://github.com/databricks/spark-redshift

  was:
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
Here's examples;
- Create a view with specific conditions
- Delete/Update some records
- Truncate a table (it is already possible in `truncate` option)
- Execute stored procedure

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`.



> Support query execution before/after reading/writing over JDBC
> --
>
> Key: SPARK-32013
> URL: https://issues.apache.org/jira/browse/SPARK-32013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> For ETL workload, there is a common requirement to perform SQL statement 
> before/after reading/writing over JDBC.
>  Here's examples;
>  - Create a view with specific conditions
>  - Delete/Update some records
>  - Truncate a table (it is already possible in `truncate` option)
>  - Execute stored procedure
> Currently `query` options is available to specify SQL statement against JDBC 
> datasource when loading data as DataFrame.
>  [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
>  However, this query is only for reading data, and it does not support the 
> common examples listed above.
> If Spark can support executing SQL statement against JDBC datasources 
> before/after reading/writing over JDBC, it can cover a lot of common 
> use-cases.
> Note: Databricks' old Redshift connector has similar option like `preactions` 
> and `postactions`. https://github.com/databricks/spark-redshift



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

2020-06-17 Thread Noritaka Sekiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32013:
--
Description: 
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
Here's examples;
- Create a view with specific conditions
- Delete/Update some records
- Truncate a table (it is already possible in `truncate` option)
- Execute stored procedure

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.

Note: Databricks' old Redshift connector has similar option like `preactions` 
and `postactions`.


  was:
For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
Here's examples;
- Create a view with specific conditions
- Delete/Update some records
- Truncate a table (it is already possible in `truncate` option)
- Execute stored procedure

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.



> Support query execution before/after reading/writing over JDBC
> --
>
> Key: SPARK-32013
> URL: https://issues.apache.org/jira/browse/SPARK-32013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> For ETL workload, there is a common requirement to perform SQL statement 
> before/after reading/writing over JDBC.
> Here's examples;
> - Create a view with specific conditions
> - Delete/Update some records
> - Truncate a table (it is already possible in `truncate` option)
> - Execute stored procedure
> Currently `query` options is available to specify SQL statement against JDBC 
> datasource when loading data as DataFrame.
> https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
> However, this query is only for reading data, and it does not support the 
> common examples listed above.
> If Spark can support executing SQL statement against JDBC datasources 
> before/after reading/writing over JDBC, it can cover a lot of common 
> use-cases.
> Note: Databricks' old Redshift connector has similar option like `preactions` 
> and `postactions`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32013) Support query execution before/after reading/writing over JDBC

2020-06-17 Thread Noritaka Sekiyama (Jira)

Noritaka Sekiyama created SPARK-32013:
-

 Summary: Support query execution before/after reading/writing over 
JDBC
 Key: SPARK-32013
 URL: https://issues.apache.org/jira/browse/SPARK-32013
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Noritaka Sekiyama


For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
Here's examples;
- Create a view with specific conditions
- Delete/Update some records
- Truncate a table (it is already possible in `truncate` option)
- Execute stored procedure

Currently `query` options is available to specify SQL statement against JDBC 
datasource when loading data as DataFrame.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
However, this query is only for reading data, and it does not support the 
common examples listed above.

If Spark can support executing SQL statement against JDBC datasources 
before/after reading/writing over JDBC, it can cover a lot of common use-cases.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28069) Switch log directory from Spark UI without restarting history server

2019-06-16 Thread Noritaka Sekiyama (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-28069:
--
Description: 
History server polls the directory specified in spark.history.fs.logDirectory, 
and displays the event logs based on the data included in the directory.

In the Cloud, event logs might be written into different directories from 
different clusters.
Currently it is not so easy to launch history server for multiple clusters 
because there is no ways to switch log directory after starting the history 
server.
If users can switch log directory from Spark UI without restarting history 
server, it would be beneficial for users who wants to use Cloud storage for 
event log directory.

Suggested implementation: 
* History server polls the directory specified in spark.history.fs.logDirectory 
when it started.
* Add a button of "Switch log directory" at the top page of Web UI.
* The directory which users selected is stored only in memory. (It won't be 
persistent configuration.)
* Next time when history server started, the directory specified in 
spark.history.fs.logDirectory is used.


  was:
History server polls the directory specified in spark.history.fs.logDirectory, 
and displays the event logs based on the data included in the directory.

In the Cloud, event logs might be written into different directories from 
different clusters.
Currently it is not so easy to launch history server for multiple clusters 
because there is no ways to switch log directory after starting the history 
server.
If users can switch log directory from Spark UI without restarting history 
server, it would be beneficial for users who wants to use Cloud storage for 
event log directory.

Suggested implementation
* History server polls the directory specified in spark.history.fs.logDirectory 
when it started.
* Add a button of "Switch log directory" at the top page of Web UI.
* The directory which users selected is stored only in memory. (It won't be 
persistent configuration.)
* Next time when history server started, the directory specified in 
spark.history.fs.logDirectory is used.



> Switch log directory from Spark UI without restarting history server
> 
>
> Key: SPARK-28069
> URL: https://issues.apache.org/jira/browse/SPARK-28069
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: Noritaka Sekiyama
>Priority: Minor
>
> History server polls the directory specified in 
> spark.history.fs.logDirectory, and displays the event logs based on the data 
> included in the directory.
> In the Cloud, event logs might be written into different directories from 
> different clusters.
> Currently it is not so easy to launch history server for multiple clusters 
> because there is no ways to switch log directory after starting the history 
> server.
> If users can switch log directory from Spark UI without restarting history 
> server, it would be beneficial for users who wants to use Cloud storage for 
> event log directory.
> Suggested implementation: 
> * History server polls the directory specified in 
> spark.history.fs.logDirectory when it started.
> * Add a button of "Switch log directory" at the top page of Web UI.
> * The directory which users selected is stored only in memory. (It won't be 
> persistent configuration.)
> * Next time when history server started, the directory specified in 
> spark.history.fs.logDirectory is used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28069) Switch log directory from Spark UI without restarting history server

2019-06-16 Thread Noritaka Sekiyama (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-28069:
--
Description: 
History server polls the directory specified in spark.history.fs.logDirectory, 
and displays the event logs based on the data included in the directory.

In the Cloud, event logs might be written into different directories from 
different clusters.
Currently it is not so easy to launch history server for multiple clusters 
because there is no ways to switch log directory after starting the history 
server.
If users can switch log directory from Spark UI without restarting history 
server, it would be beneficial for users who wants to use Cloud storage for 
event log directory.

Suggested implementation
* History server polls the directory specified in spark.history.fs.logDirectory 
when it started.
* Add a button of "Switch log directory" at the top page of Web UI.
* The directory which users selected is stored only in memory. (It won't be 
persistent configuration.)
* Next time when history server started, the directory specified in 
spark.history.fs.logDirectory is used.


  was:
History server polls the directory specified in spark.history.fs.logDirectory, 
and displays the event logs based on the data included in the directory.

In the Cloud, event logs might be written into different directories from 
different clusters.
Currently it is not so easy to launch history server for multiple clusters 
because there is no ways to switch log directory after starting the history 
server.
If users can switch log directory from Spark UI without restarting history 
server, it would be beneficial for users who wants to use Cloud storage for 
event log directory.



> Switch log directory from Spark UI without restarting history server
> 
>
> Key: SPARK-28069
> URL: https://issues.apache.org/jira/browse/SPARK-28069
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: Noritaka Sekiyama
>Priority: Minor
>
> History server polls the directory specified in 
> spark.history.fs.logDirectory, and displays the event logs based on the data 
> included in the directory.
> In the Cloud, event logs might be written into different directories from 
> different clusters.
> Currently it is not so easy to launch history server for multiple clusters 
> because there is no ways to switch log directory after starting the history 
> server.
> If users can switch log directory from Spark UI without restarting history 
> server, it would be beneficial for users who wants to use Cloud storage for 
> event log directory.
> Suggested implementation
> * History server polls the directory specified in 
> spark.history.fs.logDirectory when it started.
> * Add a button of "Switch log directory" at the top page of Web UI.
> * The directory which users selected is stored only in memory. (It won't be 
> persistent configuration.)
> * Next time when history server started, the directory specified in 
> spark.history.fs.logDirectory is used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28069) Switch log directory from Spark UI without restarting history server

2019-06-16 Thread Noritaka Sekiyama (JIRA)

Noritaka Sekiyama created SPARK-28069:
-

 Summary: Switch log directory from Spark UI without restarting 
history server
 Key: SPARK-28069
 URL: https://issues.apache.org/jira/browse/SPARK-28069
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.3
Reporter: Noritaka Sekiyama


History server polls the directory specified in spark.history.fs.logDirectory, 
and displays the event logs based on the data included in the directory.

In the Cloud, event logs might be written into different directories from 
different clusters.
Currently it is not so easy to launch history server for multiple clusters 
because there is no ways to switch log directory after starting the history 
server.
If users can switch log directory from Spark UI without restarting history 
server, it would be beneficial for users who wants to use Cloud storage for 
event log directory.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21514) Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also

2019-06-16 Thread Noritaka Sekiyama (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865228#comment-16865228
 ] 

Noritaka Sekiyama commented on SPARK-21514:
---

To move data from S3 (s3a) to HDFS, there is a problem. 
Current implementation of Hive 1.2 does not support data movement across 
different file systems (Hive 2.0 supports it).

If we try to implement this without Hive version upgrade, it means we need to 
backport some implementation from Hive 2.0. When I tried, the patch included so 
much diff.

It would be better to upgrade Hive version at first, then I can submit the 
patch without backporting.

> Hive has updated with new support for S3 and InsertIntoHiveTable.scala should 
> update also
> -
>
> Key: SPARK-21514
> URL: https://issues.apache.org/jira/browse/SPARK-21514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Javier Ros
>Priority: Major
>
> Hive has updated adding new parameters to optimize the usage of S3, now you 
> can avoid the usage of S3 as the stagingdir using the parameters 
> hive.blobstore.supported.schemes & hive.blobstore.optimizations.enabled.
> The InsertIntoHiveTable.scala file should be updated with the same 
> improvement to match the behavior of Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21514) Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also

2018-12-18 Thread Noritaka Sekiyama (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724763#comment-16724763
 ] 

Noritaka Sekiyama commented on SPARK-21514:
---

I'm working on fixing this. Will update once I have done.
Please let me know if anyone is already working on this.

> Hive has updated with new support for S3 and InsertIntoHiveTable.scala should 
> update also
> -
>
> Key: SPARK-21514
> URL: https://issues.apache.org/jira/browse/SPARK-21514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Javier Ros
>Priority: Major
>
> Hive has updated adding new parameters to optimize the usage of S3, now you 
> can avoid the usage of S3 as the stagingdir using the parameters 
> hive.blobstore.supported.schemes & hive.blobstore.optimizations.enabled.
> The InsertIntoHiveTable.scala file should be updated with the same 
> improvement to match the behavior of Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18432) Fix HDFS block size in programming guide

2016-11-13 Thread Noritaka Sekiyama (JIRA)

Noritaka Sekiyama created SPARK-18432:
-

 Summary: Fix HDFS block size in programming guide
 Key: SPARK-18432
 URL: https://issues.apache.org/jira/browse/SPARK-18432
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.0.1
Reporter: Noritaka Sekiyama
Priority: Minor


http://spark.apache.org/docs/latest/programming-guide.html
"By default, Spark creates one partition for each block of the file (blocks 
being 64MB by default in HDFS)"

Currently default block size in HDFS is 128MB.
The default value has been already increased in Hadoop 2.2.0 (the oldest 
supported version of Spark). https://issues.apache.org/jira/browse/HDFS-4053

Since it looks confusing explanation, I'd like to fix the value from 64MB to 
128MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

[jira] [Created] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

[jira] [Updated] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics

[jira] [Created] (SPARK-33266) Add total duration, read duration, and write duration as task level metrics

[jira] [Updated] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

[jira] [Created] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time

[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time

[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time

[jira] [Created] (SPARK-32112) Add a method to calculate the number of parallel tasks that Spark can process at the same time

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

[jira] [Updated] (SPARK-32013) Support query execution before/after reading/writing over JDBC

[jira] [Created] (SPARK-32013) Support query execution before/after reading/writing over JDBC

[jira] [Updated] (SPARK-28069) Switch log directory from Spark UI without restarting history server

[jira] [Updated] (SPARK-28069) Switch log directory from Spark UI without restarting history server

[jira] [Created] (SPARK-28069) Switch log directory from Spark UI without restarting history server

[jira] [Commented] (SPARK-21514) Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also

[jira] [Commented] (SPARK-21514) Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also

[jira] [Created] (SPARK-18432) Fix HDFS block size in programming guide

25 matches

Site Navigation

Mail list logo

Footer information