Re: How to add jars to standalone pyspark program

2015-05-06 Thread mj
Thank you for your response, however, I'm afraid I still can't get it to
work, this is my code:

jar_path = '/home/mj/apps/spark_jars/spark-csv_2.11-1.0.3.jar'
spark_config =
SparkConf().setMaster('local').setAppName('data_frame_test').set(spark.jars,
jar_path)
sc = SparkContext(conf=spark_config)

I'm still getting this error:

ailed to load class for data source: com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22784.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add jars to standalone pyspark program

2015-05-06 Thread mj
I've worked around this by dropping the jars into a directory (spark_jars)
and then creating a spark-defaults.conf file in conf containing this:

spark.driver.extraClassPath/home/mj/apps/spark_jars/*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
ah, just noticed that you are using an external package, you can add that
like this

conf = (SparkConf().set(spark.jars, jar_path))

or if it is a python package:

sc.addPyFile()



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
Hi Mark,

That does not look like an python path issue, spark-assembly jar should have
those packaged, and should make it available for the workers. Have you built
the jar yourself?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22687.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add jars to standalone pyspark program

2015-04-28 Thread Fabian Böhnlein
Can you specifiy 'running via PyCharm'. how are you executing the 
script, with spark-submit?


In PySpark I guess you used --jars databricks-csv.jar. With spark-submit 
you might need the additional --driver-class-path databricks-csv.jar.


Both parameters cannot be set via the SparkConf object.

Cheers,
Fabian

On 04/28/2015 10:06 AM, mj wrote:

Hi,

I'm trying to figure out how to use a third party jar inside a python
program which I'm running via PyCharm in order to debug it. I am normally
able to run spark code in python such as this:

 spark_conf = SparkConf().setMaster('local').setAppName('test')
 sc = SparkContext(conf=spark_conf)
 cars = sc.textFile('c:/cars.csv')
 print cars.count()
 sc.stop()

The code I'm trying to run is below - it uses the databricks spark csv jar.
I can get it working fine in pyspark shell using the packages argument, but
I can't figure out how to get it to work via PyCharm.

from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext

spark_conf = SparkConf().setMaster('local').setAppName('test')
sc = SparkContext(conf=spark_conf)

sqlContext = SQLContext(sc)
df = sqlContext.load(source=com.databricks.spark.csv, header=true, path
= c:/cars.csv, delimiter='\t')
df.select(year)

The error message I'm getting is:
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)


I presume I need to set the spark classpath somehow but I'm not sure of the
right way to do it. Any advice/guidance would be appreciated.

Thanks,

Mark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add jars to standalone pyspark program

2015-04-28 Thread ayan guha
Its a windows thing. Please escape front slash in string. Basically it is
not able to find the file
On 28 Apr 2015 22:09, Fabian Böhnlein fabian.boehnl...@gmail.com wrote:

 Can you specifiy 'running via PyCharm'. how are you executing the script,
 with spark-submit?

 In PySpark I guess you used --jars databricks-csv.jar. With spark-submit
 you might need the additional --driver-class-path databricks-csv.jar.

 Both parameters cannot be set via the SparkConf object.

 Cheers,
 Fabian

 On 04/28/2015 10:06 AM, mj wrote:

 Hi,

 I'm trying to figure out how to use a third party jar inside a python
 program which I'm running via PyCharm in order to debug it. I am normally
 able to run spark code in python such as this:

  spark_conf = SparkConf().setMaster('local').setAppName('test')
  sc = SparkContext(conf=spark_conf)
  cars = sc.textFile('c:/cars.csv')
  print cars.count()
  sc.stop()

 The code I'm trying to run is below - it uses the databricks spark csv
 jar.
 I can get it working fine in pyspark shell using the packages argument,
 but
 I can't figure out how to get it to work via PyCharm.

 from pyspark.sql import SQLContext
 from pyspark import SparkConf, SparkContext

 spark_conf = SparkConf().setMaster('local').setAppName('test')
 sc = SparkContext(conf=spark_conf)

 sqlContext = SQLContext(sc)
 df = sqlContext.load(source=com.databricks.spark.csv, header=true,
 path
 = c:/cars.csv, delimiter='\t')
 df.select(year)

 The error message I'm getting is:
 py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
 : java.lang.RuntimeException: Failed to load class for data source:
 com.databricks.spark.csv
 at scala.sys.package$.error(package.scala:27)
 at

 org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205)
 at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
 at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
 at
 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:259)
 at
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)


 I presume I need to set the spark classpath somehow but I'm not sure of
 the
 right way to do it. Any advice/guidance would be appreciated.

 Thanks,

 Mark.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to add jars to standalone pyspark program

2015-04-28 Thread mj
Hi,

I'm trying to figure out how to use a third party jar inside a python
program which I'm running via PyCharm in order to debug it. I am normally
able to run spark code in python such as this:

spark_conf = SparkConf().setMaster('local').setAppName('test')
sc = SparkContext(conf=spark_conf)
cars = sc.textFile('c:/cars.csv')
print cars.count()
sc.stop()

The code I'm trying to run is below - it uses the databricks spark csv jar.
I can get it working fine in pyspark shell using the packages argument, but
I can't figure out how to get it to work via PyCharm.

from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext

spark_conf = SparkConf().setMaster('local').setAppName('test')
sc = SparkContext(conf=spark_conf)

sqlContext = SQLContext(sc)
df = sqlContext.load(source=com.databricks.spark.csv, header=true, path
= c:/cars.csv, delimiter='\t')
df.select(year)

The error message I'm getting is:
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)


I presume I need to set the spark classpath somehow but I'm not sure of the
right way to do it. Any advice/guidance would be appreciated.

Thanks,

Mark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org