Yes it's a resaonable argument, that putting N more external integration modules on the default spark-submit classpath might bring in more third-party dependencies that clash or something. I think the convenience factor isn't a big deal; users can also just write a dependence on said module in their own app, once. It does seem like we could at least *ship* the binary bits in "external-jars/' or something; they're not even compiled in the binary distro. And it also means users have to make sure the version of spark-kafka they integrate works with their cluster, which means not just making sure their app matches the user-facing API of spark-kafka, but ensuring that the spark-kafka module's interface to spark works -- whatever internal details there may be there.
On Sat, Aug 4, 2018 at 9:15 PM Matei Zaharia <matei.zaha...@gmail.com> wrote: > I think that traditionally, the reason *not* to include these has been if > they brought additional dependencies that users don’t really need, but that > might clash with what the users have in their own app. Maybe this used to > be the case for Kafka. We could analyze it and include it by default, or > perhaps make it easier to add it in spark-submit and spark-shell. I feel > that in an IDE, it won’t be a huge problem because you just add it once, > but it is annoying for spark-submit. > > Matei > > > On Aug 4, 2018, at 2:19 PM, Sean Owen <sro...@gmail.com> wrote: > > > > Hm OK I am crazy then. I think I never noticed it because I had always > used a distro that did actually supply this on the classpath. > > Well ... I think it would be reasonable to include these things (at > least, Kafka integration) by default in the binary distro. I'll update the > JIRA to reflect that this is at best a Wish. > > > > On Sat, Aug 4, 2018 at 4:17 PM Jacek Laskowski <ja...@japila.pl> wrote: > > Hi Sean, > > > > It's been for years I'd say that you had to specify --packages to get > the Kafka-related jars on the classpath. I simply got used to this > annoyance (as did others). Could it be that it's an external package > (although an integral part of Spark)?! > > > > I'm very glad you've brought it up since I think Kafka data source is so > important that it should be included in spark-shell and spark-submit by > default. THANKS! > > > > Pozdrawiam, > > Jacek Laskowski > > ---- > > https://about.me/JacekLaskowski > > Mastering Spark SQL https://bit.ly/mastering-spark-sql > > Spark Structured Streaming https://bit.ly/spark-structured-streaming > > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > > Follow me at https://twitter.com/jaceklaskowski > > > > On Sat, Aug 4, 2018 at 9:56 PM, Sean Owen <sro...@gmail.com> wrote: > > Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- > I provisionally marked this a Blocker, as if it's correct, then the release > is missing an important piece and we'll want to remedy that ASAP. I still > have this feeling I am missing something. The classes really aren't there > in the release but ... *nobody* noticed all this time? I guess maybe > Spark-Kafka users may be using a vendor distro that does package these bits. > > > > > > On Sat, Aug 4, 2018 at 10:48 AM Sean Owen <sro...@gmail.com> wrote: > > I was debugging why a Kafka-based streaming app doesn't seem to find > Kafka-related integration classes when run standalone from our latest 2.3.1 > release, and noticed that there doesn't seem to be any Kafka-related jars > from Spark in the distro. In jars/, I see: > > > > spark-catalyst_2.11-2.3.1.jar > > spark-core_2.11-2.3.1.jar > > spark-graphx_2.11-2.3.1.jar > > spark-hive-thriftserver_2.11-2.3.1.jar > > spark-hive_2.11-2.3.1.jar > > spark-kubernetes_2.11-2.3.1.jar > > spark-kvstore_2.11-2.3.1.jar > > spark-launcher_2.11-2.3.1.jar > > spark-mesos_2.11-2.3.1.jar > > spark-mllib-local_2.11-2.3.1.jar > > spark-mllib_2.11-2.3.1.jar > > spark-network-common_2.11-2.3.1.jar > > spark-network-shuffle_2.11-2.3.1.jar > > spark-repl_2.11-2.3.1.jar > > spark-sketch_2.11-2.3.1.jar > > spark-sql_2.11-2.3.1.jar > > spark-streaming_2.11-2.3.1.jar > > spark-tags_2.11-2.3.1.jar > > spark-unsafe_2.11-2.3.1.jar > > spark-yarn_2.11-2.3.1.jar > > > > I checked make-distribution.sh, and it copies a bunch of JARs into the > distro, but does not seem to touch the kafka modules. > > > > Am I crazy or missing something obvious -- those should be in the > release, right? > > > >