Re: How to add jars to standalone pyspark program
Thank you for your response, however, I'm afraid I still can't get it to work, this is my code: jar_path = '/home/mj/apps/spark_jars/spark-csv_2.11-1.0.3.jar' spark_config = SparkConf().setMaster('local').setAppName('data_frame_test').set(spark.jars, jar_path) sc = SparkContext(conf=spark_config) I'm still getting this error: ailed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22784.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add jars to standalone pyspark program
I've worked around this by dropping the jars into a directory (spark_jars) and then creating a spark-defaults.conf file in conf containing this: spark.driver.extraClassPath/home/mj/apps/spark_jars/* -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add jars to standalone pyspark program
ah, just noticed that you are using an external package, you can add that like this conf = (SparkConf().set(spark.jars, jar_path)) or if it is a python package: sc.addPyFile() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22688.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add jars to standalone pyspark program
Hi Mark, That does not look like an python path issue, spark-assembly jar should have those packaged, and should make it available for the workers. Have you built the jar yourself? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22687.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add jars to standalone pyspark program
Can you specifiy 'running via PyCharm'. how are you executing the script, with spark-submit? In PySpark I guess you used --jars databricks-csv.jar. With spark-submit you might need the additional --driver-class-path databricks-csv.jar. Both parameters cannot be set via the SparkConf object. Cheers, Fabian On 04/28/2015 10:06 AM, mj wrote: Hi, I'm trying to figure out how to use a third party jar inside a python program which I'm running via PyCharm in order to debug it. I am normally able to run spark code in python such as this: spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_conf) cars = sc.textFile('c:/cars.csv') print cars.count() sc.stop() The code I'm trying to run is below - it uses the databricks spark csv jar. I can get it working fine in pyspark shell using the packages argument, but I can't figure out how to get it to work via PyCharm. from pyspark.sql import SQLContext from pyspark import SparkConf, SparkContext spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_conf) sqlContext = SQLContext(sc) df = sqlContext.load(source=com.databricks.spark.csv, header=true, path = c:/cars.csv, delimiter='\t') df.select(year) The error message I'm getting is: py4j.protocol.Py4JJavaError: An error occurred while calling o20.load. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) I presume I need to set the spark classpath somehow but I'm not sure of the right way to do it. Any advice/guidance would be appreciated. Thanks, Mark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add jars to standalone pyspark program
Its a windows thing. Please escape front slash in string. Basically it is not able to find the file On 28 Apr 2015 22:09, Fabian Böhnlein fabian.boehnl...@gmail.com wrote: Can you specifiy 'running via PyCharm'. how are you executing the script, with spark-submit? In PySpark I guess you used --jars databricks-csv.jar. With spark-submit you might need the additional --driver-class-path databricks-csv.jar. Both parameters cannot be set via the SparkConf object. Cheers, Fabian On 04/28/2015 10:06 AM, mj wrote: Hi, I'm trying to figure out how to use a third party jar inside a python program which I'm running via PyCharm in order to debug it. I am normally able to run spark code in python such as this: spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_conf) cars = sc.textFile('c:/cars.csv') print cars.count() sc.stop() The code I'm trying to run is below - it uses the databricks spark csv jar. I can get it working fine in pyspark shell using the packages argument, but I can't figure out how to get it to work via PyCharm. from pyspark.sql import SQLContext from pyspark import SparkConf, SparkContext spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_conf) sqlContext = SQLContext(sc) df = sqlContext.load(source=com.databricks.spark.csv, header=true, path = c:/cars.csv, delimiter='\t') df.select(year) The error message I'm getting is: py4j.protocol.Py4JJavaError: An error occurred while calling o20.load. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) I presume I need to set the spark classpath somehow but I'm not sure of the right way to do it. Any advice/guidance would be appreciated. Thanks, Mark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to add jars to standalone pyspark program
Hi, I'm trying to figure out how to use a third party jar inside a python program which I'm running via PyCharm in order to debug it. I am normally able to run spark code in python such as this: spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_conf) cars = sc.textFile('c:/cars.csv') print cars.count() sc.stop() The code I'm trying to run is below - it uses the databricks spark csv jar. I can get it working fine in pyspark shell using the packages argument, but I can't figure out how to get it to work via PyCharm. from pyspark.sql import SQLContext from pyspark import SparkConf, SparkContext spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_conf) sqlContext = SQLContext(sc) df = sqlContext.load(source=com.databricks.spark.csv, header=true, path = c:/cars.csv, delimiter='\t') df.select(year) The error message I'm getting is: py4j.protocol.Py4JJavaError: An error occurred while calling o20.load. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) I presume I need to set the spark classpath somehow but I'm not sure of the right way to do it. Any advice/guidance would be appreciated. Thanks, Mark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org