[jira] [Updated] (SPARK-45800) FileOutputCommiter race condition

2023-11-05 Thread Sayed Mohammad Hossein Torabi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sayed Mohammad Hossein Torabi updated SPARK-45800:
--
Attachment: spark.log

> FileOutputCommiter race condition
> -
>
> Key: SPARK-45800
> URL: https://issues.apache.org/jira/browse/SPARK-45800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3
>Reporter: Sayed Mohammad Hossein Torabi
>Priority: Major
> Attachments: main.py, spark.log
>
>
> Race condition happens when multiple spark jobs write to a table 
> simultaneously on the *FileOutputCommiter* side.
> Code example and log output attached to the ticket



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45800) FileOutputCommiter race condition

2023-11-05 Thread Sayed Mohammad Hossein Torabi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sayed Mohammad Hossein Torabi updated SPARK-45800:
--
Description: 
Race condition happens when multiple spark jobs write to a table simultaneously 
on the *FileOutputCommiter* side.

Code example and log output attached to the ticket

  was:
Race condition happens when multiple spark jobs write to a table simultaneously 
on the *FileOutputCommiter* side

Here is an example:
{code:sql}
import tempfile
from concurrent.futures import ThreadPoolExecutor


from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, rand
from pyspark.sql.types import IntegerType

TABLE_NAME = "test_table"
table_directory = tempfile.TemporaryDirectory()


spark = (
SparkSession.builder.master("local[*]")
.config("spark.sql.shuffle.partition", 1)
.config("spark.driver.host", "localhost")
.config("spark.sql.session.timeZone", "Europe/Amsterdam")
.config("spark.sql.source.partitionOverwriteMode", "dynamic")
.enableHiveSupport()
.getOrCreate()
)
spark.sql(
f"""
CREATE EXTERNAL TABLE IF NOT EXISTS {TABLE_NAME} (
value INT
)
PARTITIONED BY (
date STRING
)
STORED AS PARQUET
LOCATION '{table_directory.name}'
TBLPROPERTIES (
"parquet.compress"="SNAPPY"
)
"""
)

with ThreadPoolExecutor(max_workers=3) as executor:
executor.map(
lambda date: (
spark.range(1e6)
.withColumn("value", (rand(seed=42) * 100).cast(IntegerType()))
.withColumn("date", lit(date))
.select("value", "date")
.write.insertInto(TABLE_NAME, True)
),
[
"1970-01-01",
"1970-01-02",
"1970-01-03",
],
)

spark.sql(f"DROP TABLE {TABLE_NAME}")
table_directory.cleanup()
spark.stop(){code}
Log:
{code:java}
3/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
not exist
23/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/11/06 00:12:45 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 2.3.0
23/11/06 00:12:45 WARN ObjectStore: setMetaStoreSchemaVersion called but 
recording version is disabled: version = 2.3.0, comment = Set by MetaStore 
blcksrx@127.0.0.1
23/11/06 00:12:45 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 67.58% for 10 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 

[jira] [Created] (SPARK-45800) FileOutputCommiter race condition

2023-11-05 Thread Sayed Mohammad Hossein Torabi (Jira)
Sayed Mohammad Hossein Torabi created SPARK-45800:
-

 Summary: FileOutputCommiter race condition
 Key: SPARK-45800
 URL: https://issues.apache.org/jira/browse/SPARK-45800
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.3
Reporter: Sayed Mohammad Hossein Torabi
 Attachments: main.py

Race condition happens when multiple spark jobs write to a table simultaneously 
on the *FileOutputCommiter* side

Here is an example:
{code:sql}
import tempfile
from concurrent.futures import ThreadPoolExecutor


from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, rand
from pyspark.sql.types import IntegerType

TABLE_NAME = "test_table"
table_directory = tempfile.TemporaryDirectory()


spark = (
SparkSession.builder.master("local[*]")
.config("spark.sql.shuffle.partition", 1)
.config("spark.driver.host", "localhost")
.config("spark.sql.session.timeZone", "Europe/Amsterdam")
.config("spark.sql.source.partitionOverwriteMode", "dynamic")
.enableHiveSupport()
.getOrCreate()
)
spark.sql(
f"""
CREATE EXTERNAL TABLE IF NOT EXISTS {TABLE_NAME} (
value INT
)
PARTITIONED BY (
date STRING
)
STORED AS PARQUET
LOCATION '{table_directory.name}'
TBLPROPERTIES (
"parquet.compress"="SNAPPY"
)
"""
)

with ThreadPoolExecutor(max_workers=3) as executor:
executor.map(
lambda date: (
spark.range(1e6)
.withColumn("value", (rand(seed=42) * 100).cast(IntegerType()))
.withColumn("date", lit(date))
.select("value", "date")
.write.insertInto(TABLE_NAME, True)
),
[
"1970-01-01",
"1970-01-02",
"1970-01-03",
],
)

spark.sql(f"DROP TABLE {TABLE_NAME}")
table_directory.cleanup()
spark.stop(){code}
Log:
{code:java}
3/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
not exist
23/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/11/06 00:12:45 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 2.3.0
23/11/06 00:12:45 WARN ObjectStore: setMetaStoreSchemaVersion called but 
recording version is disabled: version = 2.3.0, comment = Set by MetaStore 
blcksrx@127.0.0.1
23/11/06 00:12:45 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
not exist
23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does 
not exist
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 67.58% for 10 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
23/11/06 00:12:47 WARN MemoryManager

[jira] [Updated] (SPARK-45800) FileOutputCommiter race condition

2023-11-05 Thread Sayed Mohammad Hossein Torabi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sayed Mohammad Hossein Torabi updated SPARK-45800:
--
Attachment: main.py

> FileOutputCommiter race condition
> -
>
> Key: SPARK-45800
> URL: https://issues.apache.org/jira/browse/SPARK-45800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3
>Reporter: Sayed Mohammad Hossein Torabi
>Priority: Major
> Attachments: main.py
>
>
> Race condition happens when multiple spark jobs write to a table 
> simultaneously on the *FileOutputCommiter* side
> Here is an example:
> {code:sql}
> import tempfile
> from concurrent.futures import ThreadPoolExecutor
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import lit, rand
> from pyspark.sql.types import IntegerType
> TABLE_NAME = "test_table"
> table_directory = tempfile.TemporaryDirectory()
> spark = (
> SparkSession.builder.master("local[*]")
> .config("spark.sql.shuffle.partition", 1)
> .config("spark.driver.host", "localhost")
> .config("spark.sql.session.timeZone", "Europe/Amsterdam")
> .config("spark.sql.source.partitionOverwriteMode", "dynamic")
> .enableHiveSupport()
> .getOrCreate()
> )
> spark.sql(
> f"""
> CREATE EXTERNAL TABLE IF NOT EXISTS {TABLE_NAME} (
> value INT
> )
> PARTITIONED BY (
> date STRING
> )
> STORED AS PARQUET
> LOCATION '{table_directory.name}'
> TBLPROPERTIES (
> "parquet.compress"="SNAPPY"
> )
> """
> )
> with ThreadPoolExecutor(max_workers=3) as executor:
> executor.map(
> lambda date: (
> spark.range(1e6)
> .withColumn("value", (rand(seed=42) * 100).cast(IntegerType()))
> .withColumn("date", lit(date))
> .select("value", "date")
> .write.insertInto(TABLE_NAME, True)
> ),
> [
> "1970-01-01",
> "1970-01-02",
> "1970-01-03",
> ],
> )
> spark.sql(f"DROP TABLE {TABLE_NAME}")
> table_directory.cleanup()
> spark.stop(){code}
> Log:
> {code:java}
> 3/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does 
> not exist
> 23/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
> does not exist
> 23/11/06 00:12:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 2.3.0
> 23/11/06 00:12:45 WARN ObjectStore: setMetaStoreSchemaVersion called but 
> recording version is disabled: version = 2.3.0, comment = Set by MetaStore 
> blcksrx@127.0.0.1
> 23/11/06 00:12:45 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
> does not exist
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
> does not exist
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
> does not exist
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
> does not exist
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
> does not exist
> 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
> does not exist
> 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
> (906,992,014 bytes) of heap memory
> Scaling row group sizes to 96.54% for 7 writers
> 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
> (906,992,014 bytes) of heap memory
> Scaling row group sizes to 84.47% for 8 writers
> 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
> (906,992,014 bytes) of heap memory
> Scaling row group sizes to 75.08% for 9 writers
> 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
> (906,992,014 bytes) of heap memory
> Scaling row group sizes to 67.58% for 10 writers
> 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
> (906,992,014 bytes) of heap memory
> Scaling row group sizes to 75.08% for 9 writers
> 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
> (906,992,014 bytes) of heap memory
> Scaling row group sizes to 84.47% for 8 writers
> 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% 
> (906,992,014 b

[jira] [Created] (SPARK-45422) Update Partition Stats

2023-10-05 Thread Sayed Mohammad Hossein Torabi (Jira)
Sayed Mohammad Hossein Torabi created SPARK-45422:
-

 Summary: Update Partition Stats
 Key: SPARK-45422
 URL: https://issues.apache.org/jira/browse/SPARK-45422
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Sayed Mohammad Hossein Torabi


spark brought *spark.sql.statistics.size.autoUpdate.enabled* and it is a good 
feature for small tables or tables does not contains a lot of files.
It would be great also to introduce a new option that calculates statistics on 
the partition level. In other words, Instead of altering/updating the whole 
table statistics, it only gathers the statistics of the partitions that spark 
writes to the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43325) regexp_extract_all DataFrame API

2023-04-29 Thread Sayed Mohammad Hossein Torabi (Jira)
Sayed Mohammad Hossein Torabi created SPARK-43325:
-

 Summary: regexp_extract_all DataFrame API 
 Key: SPARK-43325
 URL: https://issues.apache.org/jira/browse/SPARK-43325
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.4.0
Reporter: Sayed Mohammad Hossein Torabi
 Fix For: 3.4.1


Implementing the `regexp_extract_all` DataFrame API and make it available for 
both scala/java and Python spark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38263) StructType explode

2022-02-20 Thread Sayed Mohammad Hossein Torabi (Jira)
Sayed Mohammad Hossein Torabi created SPARK-38263:
-

 Summary: StructType explode
 Key: SPARK-38263
 URL: https://issues.apache.org/jira/browse/SPARK-38263
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Sayed Mohammad Hossein Torabi


Currently explode function only supports Array datatypes and Map datatypes but 
not StructType. Supporting StructType helps spark user's to transform datasets 
to a flatten one and this feature would be helpful with dealing semi-structured 
and unstructured datasets.
the idea is to support StructType in the first place and also add `prefix` and 
`postfix` option to it 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34652) Support SchemaRegistry in from_avro method

2021-12-21 Thread Sayed Mohammad Hossein Torabi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sayed Mohammad Hossein Torabi updated SPARK-34652:
--
Component/s: SQL
 (was: PySpark)
 (was: Structured Streaming)

> Support SchemaRegistry in from_avro method
> --
>
> Key: SPARK-34652
> URL: https://issues.apache.org/jira/browse/SPARK-34652
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Sayed Mohammad Hossein Torabi
>Priority: Minor
>
> Confluent Schema Registry provides a serving layer for your metadata. It 
> provides a RESTful interface for storing and retrieving your Avro®, JSON 
> Schema, and Protobuf schemas. 
> It would be nice to implement a new method that uses SchemaRegistry instead 
> of the raw schema.
> In addition, the DataBricks has already implemented this function. just check 
> this link out:
> https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html#example-with-schema-registry
> Maybe it would be a simple and short method but really usefull



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34652) Support SchemaRegistry in from_avro method

2021-12-21 Thread Sayed Mohammad Hossein Torabi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sayed Mohammad Hossein Torabi updated SPARK-34652:
--
Affects Version/s: 3.3.0
   (was: 3.1.1)

> Support SchemaRegistry in from_avro method
> --
>
> Key: SPARK-34652
> URL: https://issues.apache.org/jira/browse/SPARK-34652
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Sayed Mohammad Hossein Torabi
>Priority: Minor
>
> Confluent Schema Registry provides a serving layer for your metadata. It 
> provides a RESTful interface for storing and retrieving your Avro®, JSON 
> Schema, and Protobuf schemas. 
> It would be nice to implement a new method that uses SchemaRegistry instead 
> of the raw schema.
> In addition, the DataBricks has already implemented this function. just check 
> this link out:
> https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html#example-with-schema-registry
> Maybe it would be a simple and short method but really usefull



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34652) Support SchemaRegistry in from_avro method

2021-03-07 Thread Sayed Mohammad Hossein Torabi (Jira)
Sayed Mohammad Hossein Torabi created SPARK-34652:
-

 Summary: Support SchemaRegistry in from_avro method
 Key: SPARK-34652
 URL: https://issues.apache.org/jira/browse/SPARK-34652
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Structured Streaming
Affects Versions: 3.1.1
Reporter: Sayed Mohammad Hossein Torabi


Confluent Schema Registry provides a serving layer for your metadata. It 
provides a RESTful interface for storing and retrieving your Avro®, JSON 
Schema, and Protobuf schemas. 
It would be nice to implement a new method that uses SchemaRegistry instead of 
the raw schema.
In addition, the DataBricks has already implemented this function. just check 
this link out:
https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html#example-with-schema-registry



Maybe it would be a simple and short method but really usefull



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33508) Why the user can not assign other `key.deserializer` and why it should be alway `ByteArrayDeserializer`?

2020-11-21 Thread Sayed Mohammad Hossein Torabi (Jira)
Sayed Mohammad Hossein Torabi created SPARK-33508:
-

 Summary: Why the user can not assign other `key.deserializer` and 
why it should be alway `ByteArrayDeserializer`?
 Key: SPARK-33508
 URL: https://issues.apache.org/jira/browse/SPARK-33508
 Project: Spark
  Issue Type: Question
  Components: Structured Streaming
Affects Versions: 3.0.1
Reporter: Sayed Mohammad Hossein Torabi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32359) Implement max_error metric evaluator for spark regression mllib

2020-07-19 Thread Sayed Mohammad Hossein Torabi (Jira)
Sayed Mohammad Hossein Torabi created SPARK-32359:
-

 Summary: Implement max_error metric evaluator for spark regression 
mllib
 Key: SPARK-32359
 URL: https://issues.apache.org/jira/browse/SPARK-32359
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 3.0.0
Reporter: Sayed Mohammad Hossein Torabi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org