Re: How to add jars to standalone pyspark program
Thank you for your response, however, I'm afraid I still can't get it to work, this is my code: jar_path = '/home/mj/apps/spark_jars/spark-csv_2.11-1.0.3.jar' spark_config = SparkConf().setMaster('local').setAppName('data_frame_test').set(spark.jars, jar_path) sc = SparkContext(conf=spark_config) I'm still getting this error: ailed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22784.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add jars to standalone pyspark program
I've worked around this by dropping the jars into a directory (spark_jars) and then creating a spark-defaults.conf file in conf containing this: spark.driver.extraClassPath/home/mj/apps/spark_jars/* -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to add jars to standalone pyspark program
Hi, I'm trying to figure out how to use a third party jar inside a python program which I'm running via PyCharm in order to debug it. I am normally able to run spark code in python such as this: spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_conf) cars = sc.textFile('c:/cars.csv') print cars.count() sc.stop() The code I'm trying to run is below - it uses the databricks spark csv jar. I can get it working fine in pyspark shell using the packages argument, but I can't figure out how to get it to work via PyCharm. from pyspark.sql import SQLContext from pyspark import SparkConf, SparkContext spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_conf) sqlContext = SQLContext(sc) df = sqlContext.load(source=com.databricks.spark.csv, header=true, path = c:/cars.csv, delimiter='\t') df.select(year) The error message I'm getting is: py4j.protocol.Py4JJavaError: An error occurred while calling o20.load. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:205) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) I presume I need to set the spark classpath somehow but I'm not sure of the right way to do it. Any advice/guidance would be appreciated. Thanks, Mark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Change ivy cache for spark on Windows
Hi, I'm having trouble using the --packages option for spark-shell.cmd - I have to use Windows at work and have been issued a username with a space in it that means when I use the --packages option it fails with this message: Exception in thread main java.net.URISyntaxException: Illegal character in path at index 13: C:/Users/My Name/.ivy2/jars/spark-csv_2.10.jar The command I'm trying to run is: .\spark-shell.cmd --packages com.databricks:spark-csv_2.10:1.0.3 I've tried creating an ivysettings.xml file with the content below in my .ivy2 directory, but spark doesn't seem to pick it up. Does anyone have any ideas of how to get around this issue? ivysettings caches defaultCacheDir=c:\ivy_cache/ /ivysettings -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Change-ivy-cache-for-spark-on-Windows-tp22675.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
pyspark 1.1.1 on windows saveAsTextFile - NullPointerException
Hi, I'm trying to use pyspark to save a simple rdd to a text file (code below), but it keeps throwing an error. - Python Code - items=[Hello, world] items2 = sc.parallelize(items) items2.coalesce(1).saveAsTextFile('c:/tmp/python_out.csv') - Error --C:\Python27\python.exe C:/Users/Mark Jones/PycharmProjects/spark_test/spark_error_sample.py Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/12/18 13:00:53 INFO SecurityManager: Changing view acls to: Mark Jones, 14/12/18 13:00:53 INFO SecurityManager: Changing modify acls to: Mark Jones, 14/12/18 13:00:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Mark Jones, ); users with modify permissions: Set(Mark Jones, ) 14/12/18 13:00:53 INFO Slf4jLogger: Slf4jLogger started 14/12/18 13:00:53 INFO Remoting: Starting remoting 14/12/18 13:00:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.19.83:54548] 14/12/18 13:00:53 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@192.168.19.83:54548] 14/12/18 13:00:53 INFO Utils: Successfully started service 'sparkDriver' on port 54548. 14/12/18 13:00:53 INFO SparkEnv: Registering MapOutputTracker 14/12/18 13:00:53 INFO SparkEnv: Registering BlockManagerMaster 14/12/18 13:00:53 INFO DiskBlockManager: Created local directory at C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141218130053-1ab9 14/12/18 13:00:53 INFO Utils: Successfully started service 'Connection manager for block manager' on port 54551. 14/12/18 13:00:53 INFO ConnectionManager: Bound socket to port 54551 with id = ConnectionManagerId(192.168.19.83,54551) 14/12/18 13:00:53 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 14/12/18 13:00:53 INFO BlockManagerMaster: Trying to register BlockManager 14/12/18 13:00:53 INFO BlockManagerMasterActor: Registering block manager 192.168.19.83:54551 with 265.1 MB RAM 14/12/18 13:00:53 INFO BlockManagerMaster: Registered BlockManager 14/12/18 13:00:53 INFO HttpFileServer: HTTP File server directory is C:\Users\MARKJO~1\AppData\Local\Temp\spark-a43340e8-2621-46b8-a44e-8874dd178393 14/12/18 13:00:53 INFO HttpServer: Starting HTTP Server 14/12/18 13:00:54 INFO Utils: Successfully started service 'HTTP file server' on port 54552. 14/12/18 13:00:54 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/12/18 13:00:54 INFO SparkUI: Started SparkUI at http://192.168.19.83:4040 14/12/18 13:00:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/18 13:00:54 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) 14/12/18 13:00:54 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.19.83:54548/user/HeartbeatReceiver 14/12/18 13:00:55 INFO
Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)
I've got a simple pyspark program that generates two CSV files and then carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The program works fine for smaller volumes of records, but when it goes beyond 3 million records for the fact dataset, I get the error below. I'm running PySpark via PyCharm and the information for my environment is: OS: Windows 7 Python version: 2.7.9 Spark version: 1.1.1 Java version: 1.8 I've also included the py file I am using. I'd appreciate any help you can give me, MJ. ERROR MESSAGE C:\Python27\python.exe C:/Users/Mark Jones/PycharmProjects/spark_test/spark_error_sample.py Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/12/16 12:48:26 INFO SecurityManager: Changing view acls to: Mark Jones, 14/12/16 12:48:26 INFO SecurityManager: Changing modify acls to: Mark Jones, 14/12/16 12:48:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Mark Jones, ); users with modify permissions: Set(Mark Jones, ) 14/12/16 12:48:26 INFO Slf4jLogger: Slf4jLogger started 14/12/16 12:48:27 INFO Remoting: Starting remoting 14/12/16 12:48:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.19.83:51387] 14/12/16 12:48:27 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@192.168.19.83:51387] 14/12/16 12:48:27 INFO Utils: Successfully started service 'sparkDriver' on port 51387. 14/12/16 12:48:27 INFO SparkEnv: Registering MapOutputTracker 14/12/16 12:48:27 INFO SparkEnv: Registering BlockManagerMaster 14/12/16 12:48:27 INFO DiskBlockManager: Created local directory at C:\Users\MARKJO~1\AppData\Local\Temp\spark-local-20141216124827-11ef 14/12/16 12:48:27 INFO Utils: Successfully started service 'Connection manager for block manager' on port 51390. 14/12/16 12:48:27 INFO ConnectionManager: Bound socket to port 51390 with id = ConnectionManagerId(192.168.19.83,51390) 14/12/16 12:48:27 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 14/12/16 12:48:27 INFO BlockManagerMaster: Trying to register BlockManager 14/12/16 12:48:27 INFO BlockManagerMasterActor: Registering block manager 192.168.19.83:51390 with 265.1 MB RAM 14/12/16 12:48:27 INFO BlockManagerMaster: Registered BlockManager 14/12/16 12:48:27 INFO HttpFileServer: HTTP File server directory is C:\Users\MARKJO~1\AppData\Local\Temp\spark-3b772ca1-dbf7-4eaa-b62c-be5e73036f5d 14/12/16 12:48:27 INFO HttpServer: Starting HTTP Server 14/12/16 12:48:27 INFO Utils: Successfully started service 'HTTP file server' on port 51391. 14/12/16 12:48:27 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/12/16 12:48:27 INFO SparkUI: Started SparkUI at http://192.168.19.83:4040 14/12/16 12:48:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79
Re: Appending an incrental value to each RDD record
You could try using zipWIthIndex (links below to API docs). For example, in python: items =['a','b','c'] items2= sc.parallelize(items) print(items2.first()) items3=items2.map(lambda x: (x, x+!)) print(items3.first()) items4=items3.zipWithIndex() print(items4.first()) items5=items4.map(lambda x: (x[1], x[0])) print(items5.first()) This will give you an output of (0, ('a', 'a!')) - where the 0 is the index. You could also use a map to increment them up by a value (e.g. if you wanted to count from 1). Links http://spark.apache.org/docs/latest/api/python/index.html http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20720.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org