Re: How to add jars to standalone pyspark program

2015-05-06 Thread mj
Thank you for your response, however, I'm afraid I still can't get it to
work, this is my code:

jar_path = '/home/mj/apps/spark_jars/spark-csv_2.11-1.0.3.jar'
spark_config =
SparkConf().setMaster('local').setAppName('data_frame_test').set(spark.jars,
jar_path)
sc = SparkContext(conf=spark_config)

I'm still getting this error:

ailed to load class for data source: com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22784.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add jars to standalone pyspark program

2015-05-06 Thread mj
I've worked around this by dropping the jars into a directory (spark_jars)
and then creating a spark-defaults.conf file in conf containing this:

spark.driver.extraClassPath/home/mj/apps/spark_jars/*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to add jars to standalone pyspark program

2015-04-28 Thread mj
Hi,

I'm trying to figure out how to use a third party jar inside a python
program which I'm running via PyCharm in order to debug it. I am normally
able to run spark code in python such as this:

spark_conf = SparkConf().setMaster('local').setAppName('test')
sc = SparkContext(conf=spark_conf)
cars = sc.textFile('c:/cars.csv')
print cars.count()
sc.stop()

The code I'm trying to run is below - it uses the databricks spark csv jar.
I can get it working fine in pyspark shell using the packages argument, but
I can't figure out how to get it to work via PyCharm.

from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext

spark_conf = SparkConf().setMaster('local').setAppName('test')
sc = SparkContext(conf=spark_conf)

sqlContext = SQLContext(sc)
df = sqlContext.load(source=com.databricks.spark.csv, header=true, path
= c:/cars.csv, delimiter='\t')
df.select(year)

The error message I'm getting is:
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)


I presume I need to set the spark classpath somehow but I'm not sure of the
right way to do it. Any advice/guidance would be appreciated.

Thanks,

Mark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Change ivy cache for spark on Windows

2015-04-27 Thread mj
Hi,

I'm having trouble using the --packages option for spark-shell.cmd - I have
to use Windows at work and have been issued a username with a space in it
that means when I use the --packages option it fails with this message:

Exception in thread main java.net.URISyntaxException: Illegal character
in path at index 13: C:/Users/My Name/.ivy2/jars/spark-csv_2.10.jar

The command I'm trying to run is:
.\spark-shell.cmd --packages com.databricks:spark-csv_2.10:1.0.3

I've tried creating an ivysettings.xml file with the content below in my
.ivy2 directory, but spark doesn't seem to pick it up. Does anyone have any
ideas of how to get around this issue?

ivysettings
caches defaultCacheDir=c:\ivy_cache/
/ivysettings




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Change-ivy-cache-for-spark-on-Windows-tp22675.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark 1.1.1 on windows saveAsTextFile - NullPointerException

2014-12-18 Thread mj
Hi,

I'm trying to use pyspark to save a simple rdd to a text file (code below),
but it keeps throwing an error.

- Python Code -
items=[Hello, world]
items2 = sc.parallelize(items)
items2.coalesce(1).saveAsTextFile('c:/tmp/python_out.csv')


- Error --C:\Python27\python.exe C:/Users/Mark
Jones/PycharmProjects/spark_test/spark_error_sample.py
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/12/18 13:00:53 INFO SecurityManager: Changing view acls to: Mark Jones,
14/12/18 13:00:53 INFO SecurityManager: Changing modify acls to: Mark Jones,
14/12/18 13:00:53 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(Mark Jones, );
users with modify permissions: Set(Mark Jones, )
14/12/18 13:00:53 INFO Slf4jLogger: Slf4jLogger started
14/12/18 13:00:53 INFO Remoting: Starting remoting
14/12/18 13:00:53 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@192.168.19.83:54548]
14/12/18 13:00:53 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkDriver@192.168.19.83:54548]
14/12/18 13:00:53 INFO Utils: Successfully started service 'sparkDriver' on
port 54548.
14/12/18 13:00:53 INFO SparkEnv: Registering MapOutputTracker
14/12/18 13:00:53 INFO SparkEnv: Registering BlockManagerMaster
14/12/18 13:00:53 INFO DiskBlockManager: Created local directory at
C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141218130053-1ab9
14/12/18 13:00:53 INFO Utils: Successfully started service 'Connection
manager for block manager' on port 54551.
14/12/18 13:00:53 INFO ConnectionManager: Bound socket to port 54551 with id
= ConnectionManagerId(192.168.19.83,54551)
14/12/18 13:00:53 INFO MemoryStore: MemoryStore started with capacity 265.1
MB
14/12/18 13:00:53 INFO BlockManagerMaster: Trying to register BlockManager
14/12/18 13:00:53 INFO BlockManagerMasterActor: Registering block manager
192.168.19.83:54551 with 265.1 MB RAM
14/12/18 13:00:53 INFO BlockManagerMaster: Registered BlockManager
14/12/18 13:00:53 INFO HttpFileServer: HTTP File server directory is
C:\Users\MARKJO~1\AppData\Local\Temp\spark-a43340e8-2621-46b8-a44e-8874dd178393
14/12/18 13:00:53 INFO HttpServer: Starting HTTP Server
14/12/18 13:00:54 INFO Utils: Successfully started service 'HTTP file
server' on port 54552.
14/12/18 13:00:54 INFO Utils: Successfully started service 'SparkUI' on port
4040.
14/12/18 13:00:54 INFO SparkUI: Started SparkUI at http://192.168.19.83:4040
14/12/18 13:00:54 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/12/18 13:00:54 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.init(Groups.java:77)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at 
org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
at
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
at 
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
at org.apache.spark.SparkContext.init(SparkContext.scala:228)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
14/12/18 13:00:54 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@192.168.19.83:54548/user/HeartbeatReceiver
14/12/18 13:00:55 INFO 

Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread mj
I've got a simple pyspark program that generates two CSV files and then
carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The
program works fine for smaller volumes of records, but when it goes beyond 3
million records for the fact dataset, I get the error below. I'm running
PySpark via PyCharm and the information for my environment is:

OS: Windows 7
Python version: 2.7.9
Spark version: 1.1.1
Java version: 1.8

I've also included the py file I am using. I'd appreciate any help you can
give me, 

MJ.


ERROR MESSAGE
C:\Python27\python.exe C:/Users/Mark
Jones/PycharmProjects/spark_test/spark_error_sample.py
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/12/16 12:48:26 INFO SecurityManager: Changing view acls to: Mark Jones,
14/12/16 12:48:26 INFO SecurityManager: Changing modify acls to: Mark Jones,
14/12/16 12:48:26 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(Mark Jones, );
users with modify permissions: Set(Mark Jones, )
14/12/16 12:48:26 INFO Slf4jLogger: Slf4jLogger started
14/12/16 12:48:27 INFO Remoting: Starting remoting
14/12/16 12:48:27 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@192.168.19.83:51387]
14/12/16 12:48:27 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkDriver@192.168.19.83:51387]
14/12/16 12:48:27 INFO Utils: Successfully started service 'sparkDriver' on
port 51387.
14/12/16 12:48:27 INFO SparkEnv: Registering MapOutputTracker
14/12/16 12:48:27 INFO SparkEnv: Registering BlockManagerMaster
14/12/16 12:48:27 INFO DiskBlockManager: Created local directory at
C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141216124827-11ef
14/12/16 12:48:27 INFO Utils: Successfully started service 'Connection
manager for block manager' on port 51390.
14/12/16 12:48:27 INFO ConnectionManager: Bound socket to port 51390 with id
= ConnectionManagerId(192.168.19.83,51390)
14/12/16 12:48:27 INFO MemoryStore: MemoryStore started with capacity 265.1
MB
14/12/16 12:48:27 INFO BlockManagerMaster: Trying to register BlockManager
14/12/16 12:48:27 INFO BlockManagerMasterActor: Registering block manager
192.168.19.83:51390 with 265.1 MB RAM
14/12/16 12:48:27 INFO BlockManagerMaster: Registered BlockManager
14/12/16 12:48:27 INFO HttpFileServer: HTTP File server directory is
C:\Users\MARKJO~1\AppData\Local\Temp\spark-3b772ca1-dbf7-4eaa-b62c-be5e73036f5d
14/12/16 12:48:27 INFO HttpServer: Starting HTTP Server
14/12/16 12:48:27 INFO Utils: Successfully started service 'HTTP file
server' on port 51391.
14/12/16 12:48:27 INFO Utils: Successfully started service 'SparkUI' on port
4040.
14/12/16 12:48:27 INFO SparkUI: Started SparkUI at http://192.168.19.83:4040
14/12/16 12:48:27 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in
the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.init(Groups.java:77)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at 
org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
at
org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
at 
org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
at org.apache.spark.SparkContext.init(SparkContext.scala:228)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79

Re: Appending an incrental value to each RDD record

2014-12-16 Thread mj
You could try using zipWIthIndex (links below to API docs). For example, in
python:

items =['a','b','c']
items2= sc.parallelize(items)

print(items2.first())

items3=items2.map(lambda x: (x, x+!))

print(items3.first())

items4=items3.zipWithIndex()

print(items4.first())

items5=items4.map(lambda x: (x[1], x[0]))
print(items5.first())


This will give you an output of (0, ('a', 'a!')) - where the 0 is the index.
You could also use a map to increment them up by a value (e.g. if you wanted
to count from 1).

Links
http://spark.apache.org/docs/latest/api/python/index.html
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org