Shixiong, Your suggestion works if I use the pyspark-shell directly. In this case I want to setup a Spark Session from within my Jupyter Notebook.
My question/issue is related to this SO question: https://stackoverflow.com/questions/35762459/add-jar-to-standalone-pyspark so basically I want to add --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 to the my python code that creates the session Something like.... >>>> # Spin up a local Spark Session spark = SparkSession.builder.appName('my_awesome')\ .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0')\ .getOrCreate() >>> Unfortunately this doesn't actually work.. :) I'm sure it's straightforward to have Kafka work with PySpark... I'm just naive about how the packages get loaded... On Wed, Aug 23, 2017 at 4:51 PM, Shixiong(Ryan) Zhu <shixi...@databricks.com > wrote: > You can use `bin/pyspark --packages > org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0` > to start "pyspark". If you want to use "spark-submit", you also need to > provide your Python file. > > On Wed, Aug 23, 2017 at 1:41 PM, Brian Wylie <briford.wy...@gmail.com> > wrote: > >> Hi All, >> >> I'm trying the new hotness of using Kafka and Structured Streaming. >> >> Resources that I've looked at >> - https://spark.apache.org/docs/latest/streaming-programming-guide.html >> - https://databricks.com/blog/2016/07/28/structured-streamin >> g-in-apache-spark.html >> - https://spark.apache.org/docs/latest/streaming-custom-receivers.html >> - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/St >> ructured%20Streaming%20using%20Python%20DataFrames%20API.html >> >> My setup is a bit weird (yes.. yes.. I know...) >> - Eventually I'll just use a DataBricks cluster and life will be bliss :) >> - But for now I want to test/try stuff out on my little Mac Laptop >> >> The newest version of PySpark will install a local Spark server with a >> simple: >> $ pip install pyspark >> >> This is very nice. I've put together a little notebook using that kewl >> feature: >> - https://github.com/Kitware/BroThon/blob/master/notebooks/B >> ro_to_Spark_Cheesy.ipynb >> >> So the next step is the setup/use a Kafka message queue and that went >> well/works fine. >> >> $ kafka-console-consumer --bootstrap-server localhost:9092 --topic dns >> >> *I get messages spitting out....* >> >> {"ts":1503513688.232274,"uid":"CdA64S2Z6Xh555","id.orig_h":"192.168.1.7","id.orig_p":58528,"id.resp_h":"192.168.1.1","id.resp_p":53,"proto":"udp","trans_id":43933,"rtt":0.02226,"query":"brian.wylie.is.awesome.tk","qclass":1,"qclass_name":"C_INTERNET","qtype":1,"qtype_name":"A","rcode":0,"rcode_name":"NOERROR","AA":false,"TC":false,"RD":true,"RA":true,"Z":0,"answers":["17.188.137.55","17.188.142.54","17.188.138.55","17.188.141.184","17.188.129.50","17.188.128.178","17.188.129.178","17.188.141.56"],"TTLs":[25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0],"rejected":false} >> >> >> Okay, finally getting to my question: >> - Local spark server (good) >> - Local kafka server and messages getting produced (good) >> - Trying to this line of PySpark code (not good) >> >> # Setup connection to Kafka Stream dns_events = >> spark.readStream.format('kafka')\ >> .option('kafka.bootstrap.servers', 'localhost:9092')\ >> .option('subscribe', 'dns')\ >> .option('startingOffsets', 'latest')\ >> .load() >> >> >> fails with: >> java.lang.ClassNotFoundException: Failed to find data source: kafka. >> Please find packages at http://spark.apache.org/third-party-projects.html >> >> I've looked that the URL listed... and poking around I can see that maybe >> I need the kafka jar file as part of my local server. >> >> I lamely tried this: >> $ spark-submit --packages org.apache.spark:spark-sql-kaf >> ka-0-10_2.11:2.2.0 >> >> Exception in thread "main" java.lang.IllegalArgumentException: Missing >> application resource. at org.apache.spark.launcher.Comm >> andBuilderUtils.checkArgument(CommandBuilderUtils.java:241) at >> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSpa >> rkSubmitArgs(SparkSubmitCommandBuilder.java:160) at >> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSpa >> rkSubmitCommand(SparkSubmitCommandBuilder.java:274) at >> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCom >> mand(SparkSubmitCommandBuilder.java:151) at >> org.apache.spark.launcher.Main.main(Main.java:86) >> >> >> Anyway, all my code/versions/etc are in this notebook: >> - https://github.com/Kitware/BroThon/blob/master/notebooks/Bro >> _to_Spark.ipynb >> >> I'd be tremendously appreciative of some super nice, smart person if they >> could point me in the right direction :) >> >> -Brian Wylie >> > >