[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming

2023-03-02 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695828#comment-17695828
 ] 

Gustavo Martin commented on SPARK-26314:


My team just stumbled upon this problem :(

I was hoping Spark would be making use of the AVRO capabilities for finding the 
right schema associated with some event when using a Schema Registy.

 

> support Confluent encoded Avro in Spark Structured Streaming
> 
>
> Key: SPARK-26314
> URL: https://issues.apache.org/jira/browse/SPARK-26314
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: David Ahern
>Priority: Major
>
> As Avro has now been added as a first class citizen,
> [https://spark.apache.org/docs/latest/sql-data-sources-avro.html]
> please make Confluent encoded avro work out of the box with Spark Structured 
> Streaming
> as described in this link, Avro messages on Kafka encoded with confluent 
> serializer also need to be decoded with confluent.  It would be great if this 
> worked out of the box
> [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain]
> here are details on the Confluent encoding
> [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id]
> It's been a year since i worked on anything to do with Avro and Spark 
> Structured Streaming, but i had to take an approach such as this when getting 
> it to work.  This is what i  used as a reference at that time
> [https://github.com/tubular/confluent-spark-avro]
> Also, here is another link i found that someone has done in the meantime
> [https://github.com/AbsaOSS/ABRiS]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37625) update log4j to 2.15

2021-12-13 Thread Gustavo Martin (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37625 ]


Gustavo Martin deleted comment on SPARK-37625:


was (Author: gumartinm):
What about other Spark versions?

Will there be any patch?

> update log4j to 2.15 
> -
>
> Key: SPARK-37625
> URL: https://issues.apache.org/jira/browse/SPARK-37625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: weifeng zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37625) update log4j to 2.15

2021-12-13 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458398#comment-17458398
 ] 

Gustavo Martin edited comment on SPARK-37625 at 12/13/21, 1:49 PM:
---

What about other Spark versions?

Will there be any patch?


was (Author: gumartinm):
What about other Spark versions?

> update log4j to 2.15 
> -
>
> Key: SPARK-37625
> URL: https://issues.apache.org/jira/browse/SPARK-37625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: weifeng zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37625) update log4j to 2.15

2021-12-13 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458398#comment-17458398
 ] 

Gustavo Martin commented on SPARK-37625:


What about other Spark versions?

> update log4j to 2.15 
> -
>
> Key: SPARK-37625
> URL: https://issues.apache.org/jira/browse/SPARK-37625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: weifeng zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-01 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740
 ] 

Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:48 AM:
---

Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with dynamic partition overwrite because it 
has an amazing performance when writing loads of JSON partitioned data.


was (Author: gumartinm):
Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of JSON partitioned data.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-01 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740
 ] 

Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:35 AM:
---

Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of JSON partitioned data.


was (Author: gumartinm):
Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of partitioned data.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-01 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740
 ] 

Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:34 AM:
---

Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of partitioned data.


was (Author: gumartinm):
Thank you ver much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of partitioned data.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-01 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740
 ] 

Gustavo Martin commented on SPARK-23977:


Thank you ver much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of partitioned data.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23086) Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog

2020-12-09 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246641#comment-17246641
 ] 

Gustavo Martin edited comment on SPARK-23086 at 12/9/20, 4:13 PM:
--

AFAIK the same synchronization point exists in Spark 2.4.X and Spark 3.X

It has not been resolved :( 



was (Author: gumartinm):
AFAIK the same synchronization point exists in Spark 2.4.X and Spark 3.X


> Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog
> --
>
> Key: SPARK-23086
> URL: https://issues.apache.org/jira/browse/SPARK-23086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: * Spark 2.2.1
>Reporter: pin_zhang
>Priority: Major
>  Labels: bulk-closed
>
> * Hive metastore is mysql
> * Set hive.server2.thrift.max.worker.threads=500
> create table test (id string ) partitioned by (index int) stored as  
> parquet;
> insert into test  partition (index=1) values('id1');
>  * 100 Clients run SQL“select * from table” on table
>  * Many clients (97%) blocked at HiveExternalCatalog.withClient
>  * Is synchronized expected when only run query against tables?   
> "pool-21-thread-65" #1178 prio=5 os_prio=0 tid=0x2aaac8e06800 nid=0x1e70 
> waiting for monitor entry [0x4e19a000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   - waiting to lock <0xc06a3ba8> (a 
> org.apache.spark.sql.hive.HiveExternalCatalog)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:674)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:667)
>   - locked <0xc41ab748> (a 
> org.apache.spark.sql.hive.HiveSessionCatalog)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:646)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:601)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:631)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:624)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:624)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:570)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>   - locked <0xff491c48> (a 
> org.apache.spark.sql.execution.QueryExecution)
>   at 
> 

[jira] [Commented] (SPARK-23086) Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog

2020-12-09 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246641#comment-17246641
 ] 

Gustavo Martin commented on SPARK-23086:


AFAIK the same synchronization point exists in Spark 2.4.X and Spark 3.X


> Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog
> --
>
> Key: SPARK-23086
> URL: https://issues.apache.org/jira/browse/SPARK-23086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: * Spark 2.2.1
>Reporter: pin_zhang
>Priority: Major
>  Labels: bulk-closed
>
> * Hive metastore is mysql
> * Set hive.server2.thrift.max.worker.threads=500
> create table test (id string ) partitioned by (index int) stored as  
> parquet;
> insert into test  partition (index=1) values('id1');
>  * 100 Clients run SQL“select * from table” on table
>  * Many clients (97%) blocked at HiveExternalCatalog.withClient
>  * Is synchronized expected when only run query against tables?   
> "pool-21-thread-65" #1178 prio=5 os_prio=0 tid=0x2aaac8e06800 nid=0x1e70 
> waiting for monitor entry [0x4e19a000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   - waiting to lock <0xc06a3ba8> (a 
> org.apache.spark.sql.hive.HiveExternalCatalog)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:674)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:667)
>   - locked <0xc41ab748> (a 
> org.apache.spark.sql.hive.HiveSessionCatalog)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:646)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:601)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:631)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:624)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:624)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:570)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>   - locked <0xff491c48> (a 
> org.apache.spark.sql.execution.QueryExecution)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
>   at 

[jira] [Issue Comment Deleted] (SPARK-23086) Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog

2020-12-09 Thread Gustavo Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gustavo Martin updated SPARK-23086:
---
Comment: was deleted

(was: Why was this ticket closed?

AFAIK the same synchronization point exists in Spark 2.4.X.

Has it been resolved in Spark 3.X?

)

> Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog
> --
>
> Key: SPARK-23086
> URL: https://issues.apache.org/jira/browse/SPARK-23086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: * Spark 2.2.1
>Reporter: pin_zhang
>Priority: Major
>  Labels: bulk-closed
>
> * Hive metastore is mysql
> * Set hive.server2.thrift.max.worker.threads=500
> create table test (id string ) partitioned by (index int) stored as  
> parquet;
> insert into test  partition (index=1) values('id1');
>  * 100 Clients run SQL“select * from table” on table
>  * Many clients (97%) blocked at HiveExternalCatalog.withClient
>  * Is synchronized expected when only run query against tables?   
> "pool-21-thread-65" #1178 prio=5 os_prio=0 tid=0x2aaac8e06800 nid=0x1e70 
> waiting for monitor entry [0x4e19a000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   - waiting to lock <0xc06a3ba8> (a 
> org.apache.spark.sql.hive.HiveExternalCatalog)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:674)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:667)
>   - locked <0xc41ab748> (a 
> org.apache.spark.sql.hive.HiveSessionCatalog)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:646)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:601)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:631)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:624)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:624)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:570)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>   - locked <0xff491c48> (a 
> org.apache.spark.sql.execution.QueryExecution)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
>   at 
> 

[jira] [Commented] (SPARK-23086) Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog

2020-12-09 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246626#comment-17246626
 ] 

Gustavo Martin commented on SPARK-23086:


Why was this ticket closed?

AFAIK the same synchronization point exists in Spark 2.4.X.

Has it been resolved in Spark 3.X?



> Spark SQL cannot support high concurrency for lock in HiveMetastoreCatalog
> --
>
> Key: SPARK-23086
> URL: https://issues.apache.org/jira/browse/SPARK-23086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: * Spark 2.2.1
>Reporter: pin_zhang
>Priority: Major
>  Labels: bulk-closed
>
> * Hive metastore is mysql
> * Set hive.server2.thrift.max.worker.threads=500
> create table test (id string ) partitioned by (index int) stored as  
> parquet;
> insert into test  partition (index=1) values('id1');
>  * 100 Clients run SQL“select * from table” on table
>  * Many clients (97%) blocked at HiveExternalCatalog.withClient
>  * Is synchronized expected when only run query against tables?   
> "pool-21-thread-65" #1178 prio=5 os_prio=0 tid=0x2aaac8e06800 nid=0x1e70 
> waiting for monitor entry [0x4e19a000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   - waiting to lock <0xc06a3ba8> (a 
> org.apache.spark.sql.hive.HiveExternalCatalog)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:674)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:667)
>   - locked <0xc41ab748> (a 
> org.apache.spark.sql.hive.HiveSessionCatalog)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:646)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:601)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:631)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:624)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:59)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:624)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:570)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
>   - locked <0xff491c48> (a 
> org.apache.spark.sql.execution.QueryExecution)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
>   at 
> 

[jira] [Updated] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Gustavo Martin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gustavo Martin updated SPARK-33463:
---
Fix Version/s: (was: 3.0.1)
   3.1.0

> Spark Thrift Server, keep Job Id when using incremental collect
> ---
>
> Key: SPARK-33463
> URL: https://issues.apache.org/jira/browse/SPARK-33463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1
>Reporter: Gustavo Martin
>Priority: Major
> Fix For: 3.1.0
>
>
> When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost 
> and tracing queries in Spark Thrift Server ends up being too complicated.
> By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233124#comment-17233124
 ] 

Gustavo Martin commented on SPARK-33463:


PR: https://github.com/apache/spark/pull/30390

> Spark Thrift Server, keep Job Id when using incremental collect
> ---
>
> Key: SPARK-33463
> URL: https://issues.apache.org/jira/browse/SPARK-33463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.1
>Reporter: Gustavo Martin
>Priority: Major
> Fix For: 3.0.1
>
>
> When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost 
> and tracing queries in Spark Thrift Server ends up being too complicated.
> By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33463) Spark Thrift Server, keep Job Id when using incremental collect

2020-11-16 Thread Gustavo Martin (Jira)
Gustavo Martin created SPARK-33463:
--

 Summary: Spark Thrift Server, keep Job Id when using incremental 
collect
 Key: SPARK-33463
 URL: https://issues.apache.org/jira/browse/SPARK-33463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.4.7, 2.3.4
Reporter: Gustavo Martin
 Fix For: 3.0.1


When enabling *spark.sql.thriftServer.incrementalCollect* Job Ids get lost and 
tracing queries in Spark Thrift Server ends up being too complicated.

By fixing the Job Id, queries and Spark jobs are unequivocally related one to 
each other.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org