
Your suggestion works if I use the pyspark-shell directly. In this case I
want to setup a Spark Session from within my Jupyter Notebook.

My question/issue is related to this SO question:

so basically I want to add  --packages
to the my python code that creates the session

Something like....
# Spin up a local Spark Session
spark = SparkSession.builder.appName('my_awesome')\

Unfortunately this doesn't actually work.. :)

I'm sure it's straightforward to have Kafka work with PySpark... I'm just
naive about how the packages get loaded...

On Wed, Aug 23, 2017 at 4:51 PM, Shixiong(Ryan) Zhu <
> wrote:

> You can use `bin/pyspark --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0`
> to start "pyspark". If you want to use "spark-submit", you also need to
> provide your Python file.
> On Wed, Aug 23, 2017 at 1:41 PM, Brian Wylie <>
> wrote:
>> Hi All,
>> I'm trying the new hotness of using Kafka and Structured Streaming.
>> Resources that I've looked at
>> -
>> -
>> g-in-apache-spark.html
>> -
>> -
>> ructured%20Streaming%20using%20Python%20DataFrames%20API.html
>> My setup is a bit weird (yes.. yes.. I know...)
>> - Eventually I'll just use a DataBricks cluster and life will be bliss :)
>> - But for now I want to test/try stuff out on my little Mac Laptop
>> The newest version of PySpark will install a local Spark server with a
>> simple:
>> $ pip install pyspark
>> This is very nice. I've put together a little notebook using that kewl
>> feature:
>> -
>> ro_to_Spark_Cheesy.ipynb
>> So the next step is the setup/use a Kafka message queue and that went
>> well/works fine.
>> $ kafka-console-consumer --bootstrap-server localhost:9092 --topic dns
>> *I get messages spitting out....*
>> {"ts":1503513688.232274,"uid":"CdA64S2Z6Xh555","id.orig_h":"","id.orig_p":58528,"id.resp_h":"","id.resp_p":53,"proto":"udp","trans_id":43933,"rtt":0.02226,"query":"","qclass":1,"qclass_name":"C_INTERNET","qtype":1,"qtype_name":"A","rcode":0,"rcode_name":"NOERROR","AA":false,"TC":false,"RD":true,"RA":true,"Z":0,"answers":["","","","","","","",""],"TTLs":[25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0],"rejected":false}
>> Okay, finally getting to my question:
>> - Local spark server (good)
>> - Local kafka server and messages getting produced (good)
>> - Trying to this line of PySpark code (not good)
>> # Setup connection to Kafka Stream dns_events = 
>> spark.readStream.format('kafka')\
>>   .option('kafka.bootstrap.servers', 'localhost:9092')\
>>   .option('subscribe', 'dns')\
>>   .option('startingOffsets', 'latest')\
>>   .load()
>> fails with:
>> java.lang.ClassNotFoundException: Failed to find data source: kafka.
>> Please find packages at
>> I've looked that the URL listed... and poking around I can see that maybe
>> I need the kafka jar file as part of my local server.
>> I lamely tried this:
>> $ spark-submit --packages org.apache.spark:spark-sql-kaf
>> ka-0-10_2.11:2.2.0
>> Exception in thread "main" java.lang.IllegalArgumentException: Missing
>> application resource. at org.apache.spark.launcher.Comm
>> andBuilderUtils.checkArgument( at
>> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSpa
>> rkSubmitArgs( at
>> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSpa
>> rkSubmitCommand( at
>> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCom
>> mand( at
>> org.apache.spark.launcher.Main.main(
>> Anyway, all my code/versions/etc are in this notebook:
>> -
>> _to_Spark.ipynb
>> I'd be tremendously appreciative of some super nice, smart person if they
>> could point me in the right direction :)
>> -Brian Wylie

