GraphFrames and IPython notebook issue - No module named graphframes

2016-04-25 Thread Camelia Elena Ciolac
Hello,


I work locally on my laptop, not using DataBricks Community edition.


I downloaded  graphframes-0.1.0-spark1.6.jar from 
http://spark-packages.org/package/graphframes/graphframes

and placed it in a folder  named spark_extra_jars where I have other jars too.


After executing in a terminal:


ipython notebook --profile = nbserver



I open in the browser http://127.0.0.1:/ and in my IPython notebook I have, 
among others :


jar_path = 
'/home/camelia/spark_extra_jars/spark-csv_2.11-1.2.0.jar,/home/camelia/spark_extra_jars/commons-csv-1.2.jar,/home/camelia/spark_extra_jars/graphframes-0.1.0-spark1.6.jar,/home/camelia/spark_extra_jars/spark-mongodb_2.10-0.11.0.jar'


config = 
SparkConf().setAppName("graph_analytics").setMaster("local[4]").set("spark.jars",
 jar_path)

I can successfully import the other modules, but when I do

import graphframes

It gives the error:


ImportError   Traceback (most recent call last)
 in ()
> 1 import graphframes

ImportError: No module named graphframes



Thank you in advance for any hint on how to import graphframes successfully.

Best regards,
Camelia


What happens in the master or slave launch ?

2015-10-07 Thread Camelia Elena Ciolac
Hello,

I have the following question:

I have two scenarios:
1) in one scenario (if I'm connected on the target node) the master starts 
successfully.
Its log contains:

Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp 
/home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
 -Dspark.deploy.defaultCores=2 -Xms512m -Xmx512m -XX:MaxPermSize=256m 
org.apache.spark.deploy.master.Master --ip ttitania-6 --port 7077 --webui-port 
8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/07 13:23:55 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/10/07 13:23:55 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/10/07 13:23:56 INFO SecurityManager: Changing view acls to: camelia
15/10/07 13:23:56 INFO SecurityManager: Changing modify acls to: camelia
15/10/07 13:23:56 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(camelia); users 
with modify permissions: Set(camelia)
15/10/07 13:23:56 INFO Slf4jLogger: Slf4jLogger started
15/10/07 13:23:56 INFO Remoting: Starting remoting
15/10/07 13:23:57 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkMaster@ttitania-6:7077]
15/10/07 13:23:57 INFO Utils: Successfully started service 'sparkMaster' on 
port 7077.
15/10/07 13:23:57 INFO Utils: Successfully started service on port 6066.
15/10/07 13:23:57 INFO StandaloneRestServer: Started REST server for submitting 
applications on port 6066
15/10/07 13:23:57 INFO Master: Starting Spark master at spark://ttitania-6:7077
15/10/07 13:23:57 INFO Master: Running Spark version 1.4.1
15/10/07 13:23:57 INFO Utils: Successfully started service 'MasterUI' on port 
8080.
15/10/07 13:23:57 INFO MasterWebUI: Started MasterWebUI at 
http://129.16.20.156:8080
15/10/07 13:23:57 INFO Master: I have been elected leader! New state: ALIVE



2) in another scenario (trying to start the spark master remotely), when I try 
to start one master, it only partially starts and then shuts down. Its log is:

Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp 
/home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
 -Dspark.deploy.defaultCores=2 -Xms512m -Xmx512m -XX:MaxPermSize=256m 
org.apache.spark.deploy.master.Master --ip ttitania-6 --port 7077 --webui-port 
8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/07 12:56:31 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/10/07 12:56:32 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/10/07 12:56:32 INFO SecurityManager: Changing view acls to: camelia
15/10/07 12:56:32 INFO SecurityManager: Changing modify acls to: camelia
15/10/07 12:56:32 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(camelia); users 
with modify permissions: Set(camelia)



Can you please explain which might be the problem?

I wait impatiently for your reply.
Many thanks,
Camelia


Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Camelia Elena Ciolac
Hello,

I am working on a machine learning project, currently using 
spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on 
a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention 
working in Python from an IPython notebook.


I face the following problem: when working with a Dataframe created from a CSV 
file (2.7 GB) with schema inferred (1900 features), the time it takes for Spark 
to count the 145231 rows is 3:30 minutes using 4 cores. Longer times are 
recorder for computing one feature's statistics, for example:



START AT: 2015-09-21 08:56:41.136947

+---+--+
|summary|  VAR_1933|
+---+--+
|  count|145231|
|   mean| 8849.839111484464|
| stddev|3175.7863998269395|
|min| 0|
|max|  |
+---+--+


FINISH AT: 2015-09-21 
09:02:49.452260




So, my first question would be what configuration parameters to set in order to 
improve this performance?

I tried some explicit configuration in the IPython notebook, but specifying 
resources explicitly when creating the Spark configuration resulted in worse 
performance; I mean :

config = 
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", 
jar_path)

worked twice faster than:

config = 
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", 
jar_path).set("spark.driver.memory", "2g").set("spark.python.worker.memory ", 
"3g")





Secondly, when I do the one hot encoding (I tried two different ways of keeping 
results) I don't arrive at showing the head(1) of the resulted dataframe. We 
have the function :

def OHE_transform(categ_feat, df_old):
outputcolname = categ_feat + "_ohe_index"
outputcolvect = categ_feat + "_ohe_vector"
stringIndexer = StringIndexer(inputCol=categ_feat, outputCol=outputcolname)
indexed = stringIndexer.fit(df_old).transform(df_old)
encoder = OneHotEncoder(inputCol=outputcolname, outputCol=outputcolvect)
encoded = encoder.transform(indexed)
return encoded


The two manners for keeping results are depicted below:

1)

result = OHE_transform(list_top_feat[0], df_top_categ)
for item in list_top_feat[1:]:
result = OHE_transform(item, result)
result.head(1)


2)

df_result = OHE_transform("VAR_A", df_top_categ)
df_result_1 = OHE_transform("VAR_B", df_result)
df_result_2 = OHE_transform("VAR_C", df_result_1)
...
df_result_12 = OHE_transform("VAR_X", df_result_11)
df_result_12.head(1)

In the first approach, at the third iteration (in the for loop), when it was 
supposed to print the head(1), the IPython notebook  remained in the state 
"Kernel busy" for several hours and then I interrupted the kernel.
The second approach managed to go through all transformations (please note that 
here I eliminated the intermediary prints of the head(1)), but it gave an "out 
of memory" error at the only (final result) head(1),  that I paste below :

===


df_result_12.head(1)

---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 df_result_12.head(1)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
head(self, n)
649 rs = self.head(1)
650 return rs[0] if rs else None
--> 651 return self.take(n)
652
653 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
take(self, num)
305 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
306 """
--> 307 return self.limit(num).collect()
308
309 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
collect(self)
279 """
280 with SCCallSiteSync(self._sc) as css:
--> 281 port = 
self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd())
282 rs = list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
283 cls = _create_cls(self.schema)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', 

Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Camelia Elena Ciolac
Hello,

I am working on a machine learning project, currently using 
spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on 
a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention 
working in Python from an IPython notebook.


I face the following problem: when working with a Dataframe created from a CSV 
file (2.7 GB) with schema inferred (1900 features), the time it takes for Spark 
to count the 145231 rows is 3:30 minutes using 4 cores. Longer times are 
recorder for computing one feature's statistics, for example:



START AT: 2015-09-21 08:56:41.136947

+---+--+
|summary|  VAR_1933|
+---+--+
|  count|145231|
|   mean| 8849.839111484464|
| stddev|3175.7863998269395|
|min| 0|
|max|  |
+---+--+


FINISH AT: 2015-09-21 
09:02:49.452260





So, my first question would be what configuration parameters to set in order to 
improve this performance?

I tried some explicit configuration in the IPython notebook, but specifying 
resources explicitly when creating the Spark configuration resulted in worse 
performance; I mean :

config = 
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", 
jar_path)

worked twice faster than:

config = 
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", 
jar_path).set("spark.driver.memory", "2g").set("spark.python.worker.memory ", 
"3g")





Secondly, when I do the one hot encoding (I tried two different ways of keeping 
results) I don't arrive at showing the head(1) of the resulted dataframe. We 
have the function :

def OHE_transform(categ_feat, df_old):
outputcolname = categ_feat + "_ohe_index"
outputcolvect = categ_feat + "_ohe_vector"
stringIndexer = StringIndexer(inputCol=categ_feat, outputCol=outputcolname)
indexed = stringIndexer.fit(df_old).transform(df_old)
encoder = OneHotEncoder(inputCol=outputcolname, outputCol=outputcolvect)
encoded = encoder.transform(indexed)
return encoded


The two manners for keeping results are depicted below:

1)

result = OHE_transform(list_top_feat[0], df_top_categ)
for item in list_top_feat[1:]:
result = OHE_transform(item, result)
result.head(1)


2)

df_result = OHE_transform("VAR_A", df_top_categ)
df_result_1 = OHE_transform("VAR_B", df_result)
df_result_2 = OHE_transform("VAR_C", df_result_1)
...
df_result_12 = OHE_transform("VAR_X", df_result_11)
df_result_12.head(1)

In the first approach, at the third iteration (in the for loop), when it was 
supposed to print the head(1), the IPython notebook  remained in the state 
"Kernel busy" for several hours and then I interrupted the kernel.
The second approach managed to go through all transformations (please note that 
here I eliminated the intermediary prints of the head(1)), but it gave an "out 
of memory" error at the only (final result) head(1),  that I paste below :

===




df_result_12.head(1)

---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 df_result_12.head(1)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
head(self, n)
649 rs = self.head(1)
650 return rs[0] if rs else None
--> 651 return self.take(n)
652
653 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
take(self, num)
305 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
306 """
--> 307 return self.limit(num).collect()
308
309 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
collect(self)
279 """
280 with SCCallSiteSync(self._sc) as css:
--> 281 port = 
self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd())
282 rs = list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
283 cls = _create_cls(self.schema)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.',