[jira] [Updated] (SPARK-45800) FileOutputCommiter race condition
[ https://issues.apache.org/jira/browse/SPARK-45800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sayed Mohammad Hossein Torabi updated SPARK-45800: -- Attachment: spark.log > FileOutputCommiter race condition > - > > Key: SPARK-45800 > URL: https://issues.apache.org/jira/browse/SPARK-45800 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3 >Reporter: Sayed Mohammad Hossein Torabi >Priority: Major > Attachments: main.py, spark.log > > > Race condition happens when multiple spark jobs write to a table > simultaneously on the *FileOutputCommiter* side. > Code example and log output attached to the ticket -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45800) FileOutputCommiter race condition
[ https://issues.apache.org/jira/browse/SPARK-45800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sayed Mohammad Hossein Torabi updated SPARK-45800: -- Description: Race condition happens when multiple spark jobs write to a table simultaneously on the *FileOutputCommiter* side. Code example and log output attached to the ticket was: Race condition happens when multiple spark jobs write to a table simultaneously on the *FileOutputCommiter* side Here is an example: {code:sql} import tempfile from concurrent.futures import ThreadPoolExecutor from pyspark.sql import SparkSession from pyspark.sql.functions import lit, rand from pyspark.sql.types import IntegerType TABLE_NAME = "test_table" table_directory = tempfile.TemporaryDirectory() spark = ( SparkSession.builder.master("local[*]") .config("spark.sql.shuffle.partition", 1) .config("spark.driver.host", "localhost") .config("spark.sql.session.timeZone", "Europe/Amsterdam") .config("spark.sql.source.partitionOverwriteMode", "dynamic") .enableHiveSupport() .getOrCreate() ) spark.sql( f""" CREATE EXTERNAL TABLE IF NOT EXISTS {TABLE_NAME} ( value INT ) PARTITIONED BY ( date STRING ) STORED AS PARQUET LOCATION '{table_directory.name}' TBLPROPERTIES ( "parquet.compress"="SNAPPY" ) """ ) with ThreadPoolExecutor(max_workers=3) as executor: executor.map( lambda date: ( spark.range(1e6) .withColumn("value", (rand(seed=42) * 100).cast(IntegerType())) .withColumn("date", lit(date)) .select("value", "date") .write.insertInto(TABLE_NAME, True) ), [ "1970-01-01", "1970-01-02", "1970-01-03", ], ) spark.sql(f"DROP TABLE {TABLE_NAME}") table_directory.cleanup() spark.stop(){code} Log: {code:java} 3/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/11/06 00:12:45 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 23/11/06 00:12:45 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore blcksrx@127.0.0.1 23/11/06 00:12:45 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 84.47% for 8 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 75.08% for 9 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 67.58% for 10 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 75.08% for 9 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 84.47% for 8 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 84.47% for 8 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00%
[jira] [Created] (SPARK-45800) FileOutputCommiter race condition
Sayed Mohammad Hossein Torabi created SPARK-45800: - Summary: FileOutputCommiter race condition Key: SPARK-45800 URL: https://issues.apache.org/jira/browse/SPARK-45800 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.3 Reporter: Sayed Mohammad Hossein Torabi Attachments: main.py Race condition happens when multiple spark jobs write to a table simultaneously on the *FileOutputCommiter* side Here is an example: {code:sql} import tempfile from concurrent.futures import ThreadPoolExecutor from pyspark.sql import SparkSession from pyspark.sql.functions import lit, rand from pyspark.sql.types import IntegerType TABLE_NAME = "test_table" table_directory = tempfile.TemporaryDirectory() spark = ( SparkSession.builder.master("local[*]") .config("spark.sql.shuffle.partition", 1) .config("spark.driver.host", "localhost") .config("spark.sql.session.timeZone", "Europe/Amsterdam") .config("spark.sql.source.partitionOverwriteMode", "dynamic") .enableHiveSupport() .getOrCreate() ) spark.sql( f""" CREATE EXTERNAL TABLE IF NOT EXISTS {TABLE_NAME} ( value INT ) PARTITIONED BY ( date STRING ) STORED AS PARQUET LOCATION '{table_directory.name}' TBLPROPERTIES ( "parquet.compress"="SNAPPY" ) """ ) with ThreadPoolExecutor(max_workers=3) as executor: executor.map( lambda date: ( spark.range(1e6) .withColumn("value", (rand(seed=42) * 100).cast(IntegerType())) .withColumn("date", lit(date)) .select("value", "date") .write.insertInto(TABLE_NAME, True) ), [ "1970-01-01", "1970-01-02", "1970-01-03", ], ) spark.sql(f"DROP TABLE {TABLE_NAME}") table_directory.cleanup() spark.stop(){code} Log: {code:java} 3/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/11/06 00:12:45 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 23/11/06 00:12:45 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore blcksrx@127.0.0.1 23/11/06 00:12:45 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 84.47% for 8 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 75.08% for 9 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 67.58% for 10 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 75.08% for 9 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 84.47% for 8 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 96.54% for 7 writers 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory Scaling row group sizes to 84.47% for 8 writers 23/11/06 00:12:47 WARN MemoryManager
[jira] [Updated] (SPARK-45800) FileOutputCommiter race condition
[ https://issues.apache.org/jira/browse/SPARK-45800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sayed Mohammad Hossein Torabi updated SPARK-45800: -- Attachment: main.py > FileOutputCommiter race condition > - > > Key: SPARK-45800 > URL: https://issues.apache.org/jira/browse/SPARK-45800 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3 >Reporter: Sayed Mohammad Hossein Torabi >Priority: Major > Attachments: main.py > > > Race condition happens when multiple spark jobs write to a table > simultaneously on the *FileOutputCommiter* side > Here is an example: > {code:sql} > import tempfile > from concurrent.futures import ThreadPoolExecutor > from pyspark.sql import SparkSession > from pyspark.sql.functions import lit, rand > from pyspark.sql.types import IntegerType > TABLE_NAME = "test_table" > table_directory = tempfile.TemporaryDirectory() > spark = ( > SparkSession.builder.master("local[*]") > .config("spark.sql.shuffle.partition", 1) > .config("spark.driver.host", "localhost") > .config("spark.sql.session.timeZone", "Europe/Amsterdam") > .config("spark.sql.source.partitionOverwriteMode", "dynamic") > .enableHiveSupport() > .getOrCreate() > ) > spark.sql( > f""" > CREATE EXTERNAL TABLE IF NOT EXISTS {TABLE_NAME} ( > value INT > ) > PARTITIONED BY ( > date STRING > ) > STORED AS PARQUET > LOCATION '{table_directory.name}' > TBLPROPERTIES ( > "parquet.compress"="SNAPPY" > ) > """ > ) > with ThreadPoolExecutor(max_workers=3) as executor: > executor.map( > lambda date: ( > spark.range(1e6) > .withColumn("value", (rand(seed=42) * 100).cast(IntegerType())) > .withColumn("date", lit(date)) > .select("value", "date") > .write.insertInto(TABLE_NAME, True) > ), > [ > "1970-01-01", > "1970-01-02", > "1970-01-03", > ], > ) > spark.sql(f"DROP TABLE {TABLE_NAME}") > table_directory.cleanup() > spark.stop(){code} > Log: > {code:java} > 3/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does > not exist > 23/11/06 00:12:44 WARN HiveConf: HiveConf of name hive.stats.retries.wait > does not exist > 23/11/06 00:12:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 2.3.0 > 23/11/06 00:12:45 WARN ObjectStore: setMetaStoreSchemaVersion called but > recording version is disabled: version = 2.3.0, comment = Set by MetaStore > blcksrx@127.0.0.1 > 23/11/06 00:12:45 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name > hive.internal.ss.authz.settings.applied.marker does not exist > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout > does not exist > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait > does not exist > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name > hive.internal.ss.authz.settings.applied.marker does not exist > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout > does not exist > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait > does not exist > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name > hive.internal.ss.authz.settings.applied.marker does not exist > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout > does not exist > 23/11/06 00:12:45 WARN HiveConf: HiveConf of name hive.stats.retries.wait > does not exist > 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% > (906,992,014 bytes) of heap memory > Scaling row group sizes to 96.54% for 7 writers > 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% > (906,992,014 bytes) of heap memory > Scaling row group sizes to 84.47% for 8 writers > 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% > (906,992,014 bytes) of heap memory > Scaling row group sizes to 75.08% for 9 writers > 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% > (906,992,014 bytes) of heap memory > Scaling row group sizes to 67.58% for 10 writers > 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% > (906,992,014 bytes) of heap memory > Scaling row group sizes to 75.08% for 9 writers > 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% > (906,992,014 bytes) of heap memory > Scaling row group sizes to 84.47% for 8 writers > 23/11/06 00:12:47 WARN MemoryManager: Total allocation exceeds 95.00% > (906,992,014 b
[jira] [Created] (SPARK-45422) Update Partition Stats
Sayed Mohammad Hossein Torabi created SPARK-45422: - Summary: Update Partition Stats Key: SPARK-45422 URL: https://issues.apache.org/jira/browse/SPARK-45422 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Sayed Mohammad Hossein Torabi spark brought *spark.sql.statistics.size.autoUpdate.enabled* and it is a good feature for small tables or tables does not contains a lot of files. It would be great also to introduce a new option that calculates statistics on the partition level. In other words, Instead of altering/updating the whole table statistics, it only gathers the statistics of the partitions that spark writes to the table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43325) regexp_extract_all DataFrame API
Sayed Mohammad Hossein Torabi created SPARK-43325: - Summary: regexp_extract_all DataFrame API Key: SPARK-43325 URL: https://issues.apache.org/jira/browse/SPARK-43325 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.4.0 Reporter: Sayed Mohammad Hossein Torabi Fix For: 3.4.1 Implementing the `regexp_extract_all` DataFrame API and make it available for both scala/java and Python spark -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38263) StructType explode
Sayed Mohammad Hossein Torabi created SPARK-38263: - Summary: StructType explode Key: SPARK-38263 URL: https://issues.apache.org/jira/browse/SPARK-38263 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Sayed Mohammad Hossein Torabi Currently explode function only supports Array datatypes and Map datatypes but not StructType. Supporting StructType helps spark user's to transform datasets to a flatten one and this feature would be helpful with dealing semi-structured and unstructured datasets. the idea is to support StructType in the first place and also add `prefix` and `postfix` option to it -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34652) Support SchemaRegistry in from_avro method
[ https://issues.apache.org/jira/browse/SPARK-34652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sayed Mohammad Hossein Torabi updated SPARK-34652: -- Component/s: SQL (was: PySpark) (was: Structured Streaming) > Support SchemaRegistry in from_avro method > -- > > Key: SPARK-34652 > URL: https://issues.apache.org/jira/browse/SPARK-34652 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Sayed Mohammad Hossein Torabi >Priority: Minor > > Confluent Schema Registry provides a serving layer for your metadata. It > provides a RESTful interface for storing and retrieving your Avro®, JSON > Schema, and Protobuf schemas. > It would be nice to implement a new method that uses SchemaRegistry instead > of the raw schema. > In addition, the DataBricks has already implemented this function. just check > this link out: > https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html#example-with-schema-registry > Maybe it would be a simple and short method but really usefull -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34652) Support SchemaRegistry in from_avro method
[ https://issues.apache.org/jira/browse/SPARK-34652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sayed Mohammad Hossein Torabi updated SPARK-34652: -- Affects Version/s: 3.3.0 (was: 3.1.1) > Support SchemaRegistry in from_avro method > -- > > Key: SPARK-34652 > URL: https://issues.apache.org/jira/browse/SPARK-34652 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Sayed Mohammad Hossein Torabi >Priority: Minor > > Confluent Schema Registry provides a serving layer for your metadata. It > provides a RESTful interface for storing and retrieving your Avro®, JSON > Schema, and Protobuf schemas. > It would be nice to implement a new method that uses SchemaRegistry instead > of the raw schema. > In addition, the DataBricks has already implemented this function. just check > this link out: > https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html#example-with-schema-registry > Maybe it would be a simple and short method but really usefull -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34652) Support SchemaRegistry in from_avro method
Sayed Mohammad Hossein Torabi created SPARK-34652: - Summary: Support SchemaRegistry in from_avro method Key: SPARK-34652 URL: https://issues.apache.org/jira/browse/SPARK-34652 Project: Spark Issue Type: Improvement Components: PySpark, Structured Streaming Affects Versions: 3.1.1 Reporter: Sayed Mohammad Hossein Torabi Confluent Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving your Avro®, JSON Schema, and Protobuf schemas. It would be nice to implement a new method that uses SchemaRegistry instead of the raw schema. In addition, the DataBricks has already implemented this function. just check this link out: https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html#example-with-schema-registry Maybe it would be a simple and short method but really usefull -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33508) Why the user can not assign other `key.deserializer` and why it should be alway `ByteArrayDeserializer`?
Sayed Mohammad Hossein Torabi created SPARK-33508: - Summary: Why the user can not assign other `key.deserializer` and why it should be alway `ByteArrayDeserializer`? Key: SPARK-33508 URL: https://issues.apache.org/jira/browse/SPARK-33508 Project: Spark Issue Type: Question Components: Structured Streaming Affects Versions: 3.0.1 Reporter: Sayed Mohammad Hossein Torabi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32359) Implement max_error metric evaluator for spark regression mllib
Sayed Mohammad Hossein Torabi created SPARK-32359: - Summary: Implement max_error metric evaluator for spark regression mllib Key: SPARK-32359 URL: https://issues.apache.org/jira/browse/SPARK-32359 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 3.0.0 Reporter: Sayed Mohammad Hossein Torabi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org