Hi folks, I'm contributing to the OpenLineage project, specifically the Apache Spark integration. My current focus is on extending the project to support data lineage extraction for Spark Streaming, beginning with Apache Kafka sources and sinks.
I've encountered an obstacle when attempting to access information essential for lineage extraction from Apache Kafka-related classes within the OpenLineage Spark code base. Specifically, I need to access details like Kafka topic names and bootstrap servers from objects like StreamingDataSourceV2Relation. While I can successfully access these details if the Kafka JARs are placed directly in the 'spark/jars' directory, I'm unable to do so when using the `--packages` option for dependency management. This creates a significant obstacle for users who rely on `--packages` for their Spark applications. I've taken initial steps to investigate (viewable in this GitHub PR <https://github.com/OpenLineage/OpenLineage/pull/2647>, the class in question is *StreamingDataSourceV2RelationVisitor*), but I'd greatly appreciate any insights or guidance on the following: *1. Understanding the Issue:* Are there known reasons within Spark that could explain this difference in behavior when loading dependencies via `--packages` versus placing JARs directly? *2. Alternative Approaches:* Are there recommended techniques or patterns to access the necessary Kafka class information within a SparkListener extension, especially when dependencies are managed via `--packages`? I'm eager to find a solution that avoids heavy reliance on reflection. Thank you for your time and assistance! Kind regards, Damien