[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

Ivan Sadikov (Jira) Mon, 18 Mar 2024 18:35:40 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-46990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828143#comment-17828143
 ]


Ivan Sadikov commented on SPARK-46990:
--------------------------------------

Opened PR https://github.com/apache/spark/pull/45578.

> Regression: Unable to load empty avro files emitted by event-hubs
> -----------------------------------------------------------------
>
>                 Key: SPARK-46990
>                 URL: https://issues.apache.org/jira/browse/SPARK-46990
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.5.0
>         Environment: Databricks 14.0 - 14.3 (spark 3.5.0)
>            Reporter: Kamil Kandzia
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: second=02.avro
>
>
> In azure, I use databricks and event-hubs. Up until spark version 3.4.1 (in 
> databricks as 13.3 LTS) empty avro files emitted by event-hubs can be read. 
> Since version 3.5.0, it is impossible to load these files (even if I have 
> multiple avro files to load and one of them is empty, it can't perform an 
> operation like count or save). I tested this on databricks versions 14.0, 
> 14.1, 14.2, 14.3 and it doesn't work properly in any of them.
> I use the following code:
>  
> {code:java}
> df = spark.read.format("avro") \                 
> .load('abfss://<container>@<storage>.dfs.core.windows.net/<evh-namespace>/<evh>/0/2024/02/05/22/46/10.avro')
>     
> df.count() <- in this operation the spark hangs{code}
> I am sending a fragment of logs from databricks and query plan:
> {code:java}
> 24/02/06 10:03:10 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 1; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 24/02/06 10:03:11 INFO InMemoryFileIndex: It took 9 ms to list leaf files for 
> 1 paths.
> 24/02/06 10:03:11 INFO ProgressReporter$: Removed result fetcher for 
> 2734305632140666820_7640723027790427455_4f56f528d4a44796a98821713778d5f9
> 24/02/06 10:03:12 INFO ProgressReporter$: Added result fetcher for 
> 2734305632140666820_6526693737104909881_a07acddb350f44a284cac52db0b2fb21
> 24/02/06 10:03:12 INFO ClusterLoadMonitor: Added query with execution ID:38. 
> Current active queries:1
> 24/02/06 10:03:12 INFO FileSourceStrategy: Pushed Filters: 
> 24/02/06 10:03:12 INFO FileSourceStrategy: Post-Scan Filters: 
> 24/02/06 10:03:12 INFO CodeGenerator: Code generated in 10.636308 ms
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34 stored as values in 
> memory (estimated size 409.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes 
> in memory (estimated size 14.5 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on <IP_ADDRESS_2>:43781 (size: 14.5 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 34 from 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63
> 24/02/06 10:03:12 INFO FileSourceScanExec: Planning scan with bin packing, 
> max split size: 4194304 bytes, max partition size: 4194304, open cost is 
> considered as scanning 4194304 bytes.
> 24/02/06 10:03:12 INFO DAGScheduler: Registering RDD 104 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) as input 
> to shuffle 11
> 24/02/06 10:03:12 INFO DAGScheduler: Got map stage job 22 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) with 1 
> output partitions
> 24/02/06 10:03:12 INFO DAGScheduler: Final stage: ShuffleMapStage 31 
> ($anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63)
> 24/02/06 10:03:12 INFO DAGScheduler: Parents of final stage: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Missing parents: List()
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting ShuffleMapStage 31 
> (MapPartitionsRDD[104] at $anonfun$withThreadLocalCaptured$5 at 
> LexicalThreadLocal.scala:63), which has no missing parents
> 24/02/06 10:03:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 31 (MapPartitionsRDD[104] at 
> $anonfun$withThreadLocalCaptured$5 at LexicalThreadLocal.scala:63) (first 15 
> tasks are for partitions Vector(0))
> 24/02/06 10:03:12 INFO TaskSchedulerImpl: Adding task set 31.0 with 1 tasks 
> resource profile 0
> 24/02/06 10:03:12 INFO TaskSetManager: TaskSet 31.0 using PreferredLocationsV1
> 24/02/06 10:03:12 WARN FairSchedulableBuilder: A job was submitted with 
> scheduler pool 2734305632140666820, which has not been configured. This can 
> happen when the file that pools are read from isn't set, or when that file 
> doesn't contain 2734305632140666820. Created 2734305632140666820 with default 
> configuration (schedulingMode: FIFO, minShare: 0, weight: 1)
> 24/02/06 10:03:12 INFO FairSchedulableBuilder: Added task set TaskSet_31.0 
> tasks to pool 2734305632140666820
> 24/02/06 10:03:12 INFO TaskSetManager: Starting task 0.0 in stage 31.0 (TID 
> 449) (<IP_ADDRESS>, executor 3, partition 0, PROCESS_LOCAL, 
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_35 stored as values in 
> memory (estimated size 137.2 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO MemoryStore: Block broadcast_35_piece0 stored as bytes 
> in memory (estimated size 41.3 KiB, free 3.3 GiB)
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_35_piece0 in memory 
> on <IP_ADDRESS_2>:43781 (size: 41.3 KiB, free: 3.3 GiB)
> 24/02/06 10:03:12 INFO SparkContext: Created broadcast 35 from broadcast at 
> TaskSetManager.scala:723
> 24/02/06 10:03:12 INFO BlockManagerInfo: Added broadcast_35_piece0 in memory 
> on <IP_ADDRESS>:40825 (size: 41.3 KiB, free: 3.6 GiB)
> 24/02/06 10:03:13 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on <IP_ADDRESS>:40825 (size: 17.6 KiB, free: 3.6 GiB)
> 24/02/06 10:03:14 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 0.0, New Ema: 1.0 
> 24/02/06 10:03:15 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory 
> on <IP_ADDRESS>:40825 (size: 14.5 KiB, free: 3.6 GiB)
> 24/02/06 10:03:17 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:20 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:23 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:26 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:29 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:32 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:35 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:38 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:41 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:44 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:47 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:50 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:53 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:56 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:03:58 INFO DataSourceFactory$: DataSource Jdbc URL: 
> jdbc:mariadb://<DELETED FOR JIRA PURPOSES>
> 24/02/06 10:03:58 INFO HikariDataSource: metastore-monitor - Starting...
> 24/02/06 10:03:58 INFO HikariDataSource: metastore-monitor - Start completed.
> 24/02/06 10:03:58 INFO HikariDataSource: metastore-monitor - Shutdown 
> initiated...
> 24/02/06 10:03:58 INFO HikariDataSource: metastore-monitor - Shutdown 
> completed.
> 24/02/06 10:03:58 INFO MetastoreMonitor: Metastore healthcheck successful 
> (connection duration = 302 milliseconds)
> 24/02/06 10:03:59 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:02 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:05 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:08 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:11 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:13 INFO HiveMetaStore: 1: get_database: default
> 24/02/06 10:04:13 INFO audit: ugi=root    ip=unknown-ip-addr    
> cmd=get_database: default    
> 24/02/06 10:04:13 INFO DriverCorral: DBFS health check ok
> 24/02/06 10:04:13 INFO DriverCorral: Metastore health check ok
> 24/02/06 10:04:14 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:17 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:20 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:23 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:26 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:29 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:32 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:35 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:38 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:41 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:44 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:47 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:50 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:53 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:56 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:04:59 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:02 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:05 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:08 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:11 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:14 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:17 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:20 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:23 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:26 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:29 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:32 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:35 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:38 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> 24/02/06 10:05:41 INFO ClusterLoadAvgHelper: Current cluster load: 1, Old 
> Ema: 1.0, New Ema: 1.0 
> == Parsed Logical Plan ==
> Relation 
> [SequenceNumber#451L,Offset#452,EnqueuedTimeUtc#453,SystemProperties#454,Properties#455,Body#456]
>  avro== Analyzed Logical Plan ==
> SequenceNumber: bigint, Offset: string, EnqueuedTimeUtc: string, 
> SystemProperties: 
> map<string,struct<member0:bigint,member1:double,member2:string,member3:binary>>,
>  Properties: 
> map<string,struct<member0:bigint,member1:double,member2:string,member3:binary>>,
>  Body: binary
> Relation 
> [SequenceNumber#451L,Offset#452,EnqueuedTimeUtc#453,SystemProperties#454,Properties#455,Body#456]
>  avro== Optimized Logical Plan ==
> Relation 
> [SequenceNumber#451L,Offset#452,EnqueuedTimeUtc#453,SystemProperties#454,Properties#455,Body#456]
>  avro== Physical Plan ==
> FileScan avro 
> [SequenceNumber#451L,Offset#452,EnqueuedTimeUtc#453,SystemProperties#454,Properties#455,Body#456]
>  Batched: false, DataFilters: [], Format: Avro, Location: InMemoryFileIndex(1 
> paths)[abfss://<container>@<storage>.dfs.core..., PartitionFilters: [], 
> PushedFilters: [], ReadSchema: 
> struct<SequenceNumber:bigint,Offset:string,EnqueuedTimeUtc:string,SystemProperties:map<string,str...
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46990) Regression: Unable to load empty avro files emitted by event-hubs

Reply via email to