GraphFrames and IPython notebook issue - No module named graphframes
Hello, I work locally on my laptop, not using DataBricks Community edition. I downloaded graphframes-0.1.0-spark1.6.jar from http://spark-packages.org/package/graphframes/graphframes and placed it in a folder named spark_extra_jars where I have other jars too. After executing in a terminal: ipython notebook --profile = nbserver I open in the browser http://127.0.0.1:/ and in my IPython notebook I have, among others : jar_path = '/home/camelia/spark_extra_jars/spark-csv_2.11-1.2.0.jar,/home/camelia/spark_extra_jars/commons-csv-1.2.jar,/home/camelia/spark_extra_jars/graphframes-0.1.0-spark1.6.jar,/home/camelia/spark_extra_jars/spark-mongodb_2.10-0.11.0.jar' config = SparkConf().setAppName("graph_analytics").setMaster("local[4]").set("spark.jars", jar_path) I can successfully import the other modules, but when I do import graphframes It gives the error: ImportError Traceback (most recent call last) in () > 1 import graphframes ImportError: No module named graphframes Thank you in advance for any hint on how to import graphframes successfully. Best regards, Camelia
What happens in the master or slave launch ?
Hello, I have the following question: I have two scenarios: 1) in one scenario (if I'm connected on the target node) the master starts successfully. Its log contains: Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp /home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Dspark.deploy.defaultCores=2 -Xms512m -Xmx512m -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ttitania-6 --port 7077 --webui-port 8080 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/10/07 13:23:55 INFO Master: Registered signal handlers for [TERM, HUP, INT] 15/10/07 13:23:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/10/07 13:23:56 INFO SecurityManager: Changing view acls to: camelia 15/10/07 13:23:56 INFO SecurityManager: Changing modify acls to: camelia 15/10/07 13:23:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(camelia); users with modify permissions: Set(camelia) 15/10/07 13:23:56 INFO Slf4jLogger: Slf4jLogger started 15/10/07 13:23:56 INFO Remoting: Starting remoting 15/10/07 13:23:57 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@ttitania-6:7077] 15/10/07 13:23:57 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 15/10/07 13:23:57 INFO Utils: Successfully started service on port 6066. 15/10/07 13:23:57 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066 15/10/07 13:23:57 INFO Master: Starting Spark master at spark://ttitania-6:7077 15/10/07 13:23:57 INFO Master: Running Spark version 1.4.1 15/10/07 13:23:57 INFO Utils: Successfully started service 'MasterUI' on port 8080. 15/10/07 13:23:57 INFO MasterWebUI: Started MasterWebUI at http://129.16.20.156:8080 15/10/07 13:23:57 INFO Master: I have been elected leader! New state: ALIVE 2) in another scenario (trying to start the spark master remotely), when I try to start one master, it only partially starts and then shuts down. Its log is: Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp /home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Dspark.deploy.defaultCores=2 -Xms512m -Xmx512m -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ttitania-6 --port 7077 --webui-port 8080 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/10/07 12:56:31 INFO Master: Registered signal handlers for [TERM, HUP, INT] 15/10/07 12:56:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/10/07 12:56:32 INFO SecurityManager: Changing view acls to: camelia 15/10/07 12:56:32 INFO SecurityManager: Changing modify acls to: camelia 15/10/07 12:56:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(camelia); users with modify permissions: Set(camelia) Can you please explain which might be the problem? I wait impatiently for your reply. Many thanks, Camelia
Need for advice - performance improvement and out of memory resolution
Hello, I am working on a machine learning project, currently using spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention working in Python from an IPython notebook. I face the following problem: when working with a Dataframe created from a CSV file (2.7 GB) with schema inferred (1900 features), the time it takes for Spark to count the 145231 rows is 3:30 minutes using 4 cores. Longer times are recorder for computing one feature's statistics, for example: START AT: 2015-09-21 08:56:41.136947 +---+--+ |summary| VAR_1933| +---+--+ | count|145231| | mean| 8849.839111484464| | stddev|3175.7863998269395| |min| 0| |max| | +---+--+ FINISH AT: 2015-09-21 09:02:49.452260 So, my first question would be what configuration parameters to set in order to improve this performance? I tried some explicit configuration in the IPython notebook, but specifying resources explicitly when creating the Spark configuration resulted in worse performance; I mean : config = SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", jar_path) worked twice faster than: config = SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", jar_path).set("spark.driver.memory", "2g").set("spark.python.worker.memory ", "3g") Secondly, when I do the one hot encoding (I tried two different ways of keeping results) I don't arrive at showing the head(1) of the resulted dataframe. We have the function : def OHE_transform(categ_feat, df_old): outputcolname = categ_feat + "_ohe_index" outputcolvect = categ_feat + "_ohe_vector" stringIndexer = StringIndexer(inputCol=categ_feat, outputCol=outputcolname) indexed = stringIndexer.fit(df_old).transform(df_old) encoder = OneHotEncoder(inputCol=outputcolname, outputCol=outputcolvect) encoded = encoder.transform(indexed) return encoded The two manners for keeping results are depicted below: 1) result = OHE_transform(list_top_feat[0], df_top_categ) for item in list_top_feat[1:]: result = OHE_transform(item, result) result.head(1) 2) df_result = OHE_transform("VAR_A", df_top_categ) df_result_1 = OHE_transform("VAR_B", df_result) df_result_2 = OHE_transform("VAR_C", df_result_1) ... df_result_12 = OHE_transform("VAR_X", df_result_11) df_result_12.head(1) In the first approach, at the third iteration (in the for loop), when it was supposed to print the head(1), the IPython notebook remained in the state "Kernel busy" for several hours and then I interrupted the kernel. The second approach managed to go through all transformations (please note that here I eliminated the intermediary prints of the head(1)), but it gave an "out of memory" error at the only (final result) head(1), that I paste below : === df_result_12.head(1) --- Py4JJavaError Traceback (most recent call last) in () > 1 df_result_12.head(1) /home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in head(self, n) 649 rs = self.head(1) 650 return rs[0] if rs else None --> 651 return self.take(n) 652 653 @ignore_unicode_prefix /home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in take(self, num) 305 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] 306 """ --> 307 return self.limit(num).collect() 308 309 @ignore_unicode_prefix /home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in collect(self) 279 """ 280 with SCCallSiteSync(self._sc) as css: --> 281 port = self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd()) 282 rs = list(_load_from_socket(port, BatchedSerializer(PickleSerializer( 283 cls = _create_cls(self.schema) /home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.',
Need for advice - performance improvement and out of memory resolution
Hello, I am working on a machine learning project, currently using spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention working in Python from an IPython notebook. I face the following problem: when working with a Dataframe created from a CSV file (2.7 GB) with schema inferred (1900 features), the time it takes for Spark to count the 145231 rows is 3:30 minutes using 4 cores. Longer times are recorder for computing one feature's statistics, for example: START AT: 2015-09-21 08:56:41.136947 +---+--+ |summary| VAR_1933| +---+--+ | count|145231| | mean| 8849.839111484464| | stddev|3175.7863998269395| |min| 0| |max| | +---+--+ FINISH AT: 2015-09-21 09:02:49.452260 So, my first question would be what configuration parameters to set in order to improve this performance? I tried some explicit configuration in the IPython notebook, but specifying resources explicitly when creating the Spark configuration resulted in worse performance; I mean : config = SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", jar_path) worked twice faster than: config = SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", jar_path).set("spark.driver.memory", "2g").set("spark.python.worker.memory ", "3g") Secondly, when I do the one hot encoding (I tried two different ways of keeping results) I don't arrive at showing the head(1) of the resulted dataframe. We have the function : def OHE_transform(categ_feat, df_old): outputcolname = categ_feat + "_ohe_index" outputcolvect = categ_feat + "_ohe_vector" stringIndexer = StringIndexer(inputCol=categ_feat, outputCol=outputcolname) indexed = stringIndexer.fit(df_old).transform(df_old) encoder = OneHotEncoder(inputCol=outputcolname, outputCol=outputcolvect) encoded = encoder.transform(indexed) return encoded The two manners for keeping results are depicted below: 1) result = OHE_transform(list_top_feat[0], df_top_categ) for item in list_top_feat[1:]: result = OHE_transform(item, result) result.head(1) 2) df_result = OHE_transform("VAR_A", df_top_categ) df_result_1 = OHE_transform("VAR_B", df_result) df_result_2 = OHE_transform("VAR_C", df_result_1) ... df_result_12 = OHE_transform("VAR_X", df_result_11) df_result_12.head(1) In the first approach, at the third iteration (in the for loop), when it was supposed to print the head(1), the IPython notebook remained in the state "Kernel busy" for several hours and then I interrupted the kernel. The second approach managed to go through all transformations (please note that here I eliminated the intermediary prints of the head(1)), but it gave an "out of memory" error at the only (final result) head(1), that I paste below : === df_result_12.head(1) --- Py4JJavaError Traceback (most recent call last) in () > 1 df_result_12.head(1) /home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in head(self, n) 649 rs = self.head(1) 650 return rs[0] if rs else None --> 651 return self.take(n) 652 653 @ignore_unicode_prefix /home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in take(self, num) 305 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] 306 """ --> 307 return self.limit(num).collect() 308 309 @ignore_unicode_prefix /home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in collect(self) 279 """ 280 with SCCallSiteSync(self._sc) as css: --> 281 port = self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd()) 282 rs = list(_load_from_socket(port, BatchedSerializer(PickleSerializer( 283 cls = _create_cls(self.schema) /home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.',