GraphFrames and IPython notebook issue - No module named graphframes

2016-04-25 Thread Camelia Elena Ciolac
Hello,


I work locally on my laptop, not using DataBricks Community edition.


I downloaded  graphframes-0.1.0-spark1.6.jar from 
http://spark-packages.org/package/graphframes/graphframes

and placed it in a folder  named spark_extra_jars where I have other jars too.


After executing in a terminal:


ipython notebook --profile = nbserver



I open in the browser http://127.0.0.1:/ and in my IPython notebook I have, 
among others :


jar_path = 
'/home/camelia/spark_extra_jars/spark-csv_2.11-1.2.0.jar,/home/camelia/spark_extra_jars/commons-csv-1.2.jar,/home/camelia/spark_extra_jars/graphframes-0.1.0-spark1.6.jar,/home/camelia/spark_extra_jars/spark-mongodb_2.10-0.11.0.jar'


config = 
SparkConf().setAppName("graph_analytics").setMaster("local[4]").set("spark.jars",
 jar_path)

I can successfully import the other modules, but when I do

import graphframes

It gives the error:


ImportError   Traceback (most recent call last)
 in ()
> 1 import graphframes

ImportError: No module named graphframes



Thank you in advance for any hint on how to import graphframes successfully.

Best regards,
Camelia


What happens in the master or slave launch ?

2015-10-07 Thread Camelia Elena Ciolac
Hello,

I have the following question:

I have two scenarios:
1) in one scenario (if I'm connected on the target node) the master starts 
successfully.
Its log contains:

Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp 
/home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
 -Dspark.deploy.defaultCores=2 -Xms512m -Xmx512m -XX:MaxPermSize=256m 
org.apache.spark.deploy.master.Master --ip ttitania-6 --port 7077 --webui-port 
8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/07 13:23:55 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/10/07 13:23:55 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/10/07 13:23:56 INFO SecurityManager: Changing view acls to: camelia
15/10/07 13:23:56 INFO SecurityManager: Changing modify acls to: camelia
15/10/07 13:23:56 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(camelia); users 
with modify permissions: Set(camelia)
15/10/07 13:23:56 INFO Slf4jLogger: Slf4jLogger started
15/10/07 13:23:56 INFO Remoting: Starting remoting
15/10/07 13:23:57 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkMaster@ttitania-6:7077]
15/10/07 13:23:57 INFO Utils: Successfully started service 'sparkMaster' on 
port 7077.
15/10/07 13:23:57 INFO Utils: Successfully started service on port 6066.
15/10/07 13:23:57 INFO StandaloneRestServer: Started REST server for submitting 
applications on port 6066
15/10/07 13:23:57 INFO Master: Starting Spark master at spark://ttitania-6:7077
15/10/07 13:23:57 INFO Master: Running Spark version 1.4.1
15/10/07 13:23:57 INFO Utils: Successfully started service 'MasterUI' on port 
8080.
15/10/07 13:23:57 INFO MasterWebUI: Started MasterWebUI at 
http://129.16.20.156:8080
15/10/07 13:23:57 INFO Master: I have been elected leader! New state: ALIVE



2) in another scenario (trying to start the spark master remotely), when I try 
to start one master, it only partially starts and then shuts down. Its log is:

Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp 
/home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
 -Dspark.deploy.defaultCores=2 -Xms512m -Xmx512m -XX:MaxPermSize=256m 
org.apache.spark.deploy.master.Master --ip ttitania-6 --port 7077 --webui-port 
8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/07 12:56:31 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/10/07 12:56:32 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/10/07 12:56:32 INFO SecurityManager: Changing view acls to: camelia
15/10/07 12:56:32 INFO SecurityManager: Changing modify acls to: camelia
15/10/07 12:56:32 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(camelia); users 
with modify permissions: Set(camelia)



Can you please explain which might be the problem?

I wait impatiently for your reply.
Many thanks,
Camelia


What happens in the master or slave launch ?

2015-10-07 Thread camelia
Hello,

I have the following question:

I have two scenarios:
1) in one scenario (if I'm connected on the target node) the master starts
successfully.
Its log contains:

Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp
/home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
-Dspark.deploy.defaultCores=2 -Xms512m -Xmx512m -XX:MaxPermSize=256m
org.apache.spark.deploy.master.Master --ip ttitania-6 --port 7077
--webui-port 8080

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/10/07 13:23:55 INFO Master: Registered signal handlers for [TERM, HUP,
INT]
15/10/07 13:23:55 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/10/07 13:23:56 INFO SecurityManager: Changing view acls to: camelia
15/10/07 13:23:56 INFO SecurityManager: Changing modify acls to: camelia
15/10/07 13:23:56 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(camelia); users
with modify permissions: Set(camelia)
15/10/07 13:23:56 INFO Slf4jLogger: Slf4jLogger started
15/10/07 13:23:56 INFO Remoting: Starting remoting
15/10/07 13:23:57 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkMaster@ttitania-6:7077]
15/10/07 13:23:57 INFO Utils: Successfully started service 'sparkMaster' on
port 7077.
15/10/07 13:23:57 INFO Utils: Successfully started service on port 6066.
15/10/07 13:23:57 INFO StandaloneRestServer: Started REST server for
submitting applications on port 6066
15/10/07 13:23:57 INFO Master: Starting Spark master at
spark://ttitania-6:7077
15/10/07 13:23:57 INFO Master: Running Spark version 1.4.1
15/10/07 13:23:57 INFO Utils: Successfully started service 'MasterUI' on
port 8080.
15/10/07 13:23:57 INFO MasterWebUI: Started MasterWebUI at
http://129.11.11.111:8080
15/10/07 13:23:57 INFO Master: I have been elected leader! New state: ALIVE


2) in another scenario (trying to start the spark master remotely, using
qrsh as we are disallowed ssh between cluster nodes), when I try to start
one master, it only partially starts and then shuts down. Its log is:

Spark Command: /usr/opt/java/jdk1.7.0_07/jre/bin/java -cp
/home/camelia/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/camelia/spark-1.4.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
-Dspark.deploy.defaultCores=2 -Xms512m -Xmx512m -XX:MaxPermSize=256m
org.apache.spark.deploy.master.Master --ip ttitania-6 --port 7077
--webui-port 8080

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/10/07 12:56:31 INFO Master: Registered signal handlers for [TERM, HUP,
INT]
15/10/07 12:56:32 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/10/07 12:56:32 INFO SecurityManager: Changing view acls to: camelia
15/10/07 12:56:32 INFO SecurityManager: Changing modify acls to: camelia
15/10/07 12:56:32 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(camelia); users
with modify permissions: Set(camelia)



Can you please explain which might be the problem?
Which is the step in Spark's starting the master process that determines
this crash? 

I wait impatiently for your reply.
Many thanks,
Camelia



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-happens-in-the-master-or-slave-launch-tp24968.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Camelia Elena Ciolac
Hello,

I am working on a machine learning project, currently using 
spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on 
a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention 
working in Python from an IPython notebook.


I face the following problem: when working with a Dataframe created from a CSV 
file (2.7 GB) with schema inferred (1900 features), the time it takes for Spark 
to count the 145231 rows is 3:30 minutes using 4 cores. Longer times are 
recorder for computing one feature's statistics, for example:



START AT: 2015-09-21 08:56:41.136947

+---+--+
|summary|  VAR_1933|
+---+--+
|  count|145231|
|   mean| 8849.839111484464|
| stddev|3175.7863998269395|
|min| 0|
|max|  |
+---+--+


FINISH AT: 2015-09-21 
09:02:49.452260




So, my first question would be what configuration parameters to set in order to 
improve this performance?

I tried some explicit configuration in the IPython notebook, but specifying 
resources explicitly when creating the Spark configuration resulted in worse 
performance; I mean :

config = 
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", 
jar_path)

worked twice faster than:

config = 
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", 
jar_path).set("spark.driver.memory", "2g").set("spark.python.worker.memory ", 
"3g")





Secondly, when I do the one hot encoding (I tried two different ways of keeping 
results) I don't arrive at showing the head(1) of the resulted dataframe. We 
have the function :

def OHE_transform(categ_feat, df_old):
outputcolname = categ_feat + "_ohe_index"
outputcolvect = categ_feat + "_ohe_vector"
stringIndexer = StringIndexer(inputCol=categ_feat, outputCol=outputcolname)
indexed = stringIndexer.fit(df_old).transform(df_old)
encoder = OneHotEncoder(inputCol=outputcolname, outputCol=outputcolvect)
encoded = encoder.transform(indexed)
return encoded


The two manners for keeping results are depicted below:

1)

result = OHE_transform(list_top_feat[0], df_top_categ)
for item in list_top_feat[1:]:
result = OHE_transform(item, result)
result.head(1)


2)

df_result = OHE_transform("VAR_A", df_top_categ)
df_result_1 = OHE_transform("VAR_B", df_result)
df_result_2 = OHE_transform("VAR_C", df_result_1)
...
df_result_12 = OHE_transform("VAR_X", df_result_11)
df_result_12.head(1)

In the first approach, at the third iteration (in the for loop), when it was 
supposed to print the head(1), the IPython notebook  remained in the state 
"Kernel busy" for several hours and then I interrupted the kernel.
The second approach managed to go through all transformations (please note that 
here I eliminated the intermediary prints of the head(1)), but it gave an "out 
of memory" error at the only (final result) head(1),  that I paste below :

===


df_result_12.head(1)

-------
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 df_result_12.head(1)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
head(self, n)
649 rs = self.head(1)
650 return rs[0] if rs else None
--> 651 return self.take(n)
652
653 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
take(self, num)
    305     [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
306 """
--> 307 return self.limit(num).collect()
308
309 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
collect(self)
279 """
280 with SCCallSiteSync(self._sc) as css:
--> 281 port = 
self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd())
282 rs = list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
283 cls = _create_cls(self.schema)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 i

Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Camelia Elena Ciolac
Hello,

I am working on a machine learning project, currently using 
spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on 
a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention 
working in Python from an IPython notebook.


I face the following problem: when working with a Dataframe created from a CSV 
file (2.7 GB) with schema inferred (1900 features), the time it takes for Spark 
to count the 145231 rows is 3:30 minutes using 4 cores. Longer times are 
recorder for computing one feature's statistics, for example:



START AT: 2015-09-21 08:56:41.136947

+---+--+
|summary|  VAR_1933|
+---+--+
|  count|145231|
|   mean| 8849.839111484464|
| stddev|3175.7863998269395|
|min| 0|
|max|  |
+---+--+


FINISH AT: 2015-09-21 
09:02:49.452260





So, my first question would be what configuration parameters to set in order to 
improve this performance?

I tried some explicit configuration in the IPython notebook, but specifying 
resources explicitly when creating the Spark configuration resulted in worse 
performance; I mean :

config = 
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", 
jar_path)

worked twice faster than:

config = 
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars", 
jar_path).set("spark.driver.memory", "2g").set("spark.python.worker.memory ", 
"3g")





Secondly, when I do the one hot encoding (I tried two different ways of keeping 
results) I don't arrive at showing the head(1) of the resulted dataframe. We 
have the function :

def OHE_transform(categ_feat, df_old):
outputcolname = categ_feat + "_ohe_index"
outputcolvect = categ_feat + "_ohe_vector"
stringIndexer = StringIndexer(inputCol=categ_feat, outputCol=outputcolname)
indexed = stringIndexer.fit(df_old).transform(df_old)
encoder = OneHotEncoder(inputCol=outputcolname, outputCol=outputcolvect)
encoded = encoder.transform(indexed)
return encoded


The two manners for keeping results are depicted below:

1)

result = OHE_transform(list_top_feat[0], df_top_categ)
for item in list_top_feat[1:]:
result = OHE_transform(item, result)
result.head(1)


2)

df_result = OHE_transform("VAR_A", df_top_categ)
df_result_1 = OHE_transform("VAR_B", df_result)
df_result_2 = OHE_transform("VAR_C", df_result_1)
...
df_result_12 = OHE_transform("VAR_X", df_result_11)
df_result_12.head(1)

In the first approach, at the third iteration (in the for loop), when it was 
supposed to print the head(1), the IPython notebook  remained in the state 
"Kernel busy" for several hours and then I interrupted the kernel.
The second approach managed to go through all transformations (please note that 
here I eliminated the intermediary prints of the head(1)), but it gave an "out 
of memory" error at the only (final result) head(1),  that I paste below :

===




df_result_12.head(1)

-------
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 df_result_12.head(1)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
head(self, n)
649 rs = self.head(1)
650 return rs[0] if rs else None
--> 651 return self.take(n)
652
653 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
take(self, num)
    305     [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
306 """
--> 307 return self.limit(num).collect()
308
309 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
collect(self)
279 """
280 with SCCallSiteSync(self._sc) as css:
--> 281 port = 
self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd())
282 rs = list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
283 cls = _create_cls(self.schema)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 i

Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread camelia
Hello,

I am working on a machine learning project, currently using
spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running
on a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also
mention working in Python from an IPython notebook.
 

I face the following problem: when working with a Dataframe created from a
CSV file (2.7 GB) with schema inferred (1900 features), the time it takes
for Spark to count the 145231 rows is 3:30 minutes using 4 cores. Longer
times are recorder for computing one feature's statistics, for example:


START AT: 2015-09-21
08:56:41.136947
 
+---+--+
|summary|  VAR_1933|
+---+--+
|  count|145231|
|   mean| 8849.839111484464|
| stddev|3175.7863998269395|
|min| 0|
|max|  |
+---+--+

 
FINISH AT: 2015-09-21
09:02:49.452260



So, my first question would be what configuration parameters to set in order
to improve this performance?

I tried some explicit configuration in the IPython notebook, but specifying
resources explicitly when creating the Spark configuration resulted in worse
performance; I mean :

config =
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars",
jar_path)

worked twice faster than:

config =
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars",
jar_path).set("spark.driver.memory", "2g").set("spark.python.worker.memory
", "3g")





Secondly, when I do the one hot encoding (I tried two different ways of
keeping results) I don't arrive at showing the head(1) of the resulted
dataframe. We have the function :

def OHE_transform(categ_feat, df_old):
outputcolname = categ_feat + "_ohe_index"
outputcolvect = categ_feat + "_ohe_vector"
stringIndexer = StringIndexer(inputCol=categ_feat,
outputCol=outputcolname)
indexed = stringIndexer.fit(df_old).transform(df_old)
encoder = OneHotEncoder(inputCol=outputcolname, outputCol=outputcolvect)
encoded = encoder.transform(indexed)
return encoded


The two manners for keeping results are depicted below:

1)

result = OHE_transform(list_top_feat[0], df_top_categ)
for item in list_top_feat[1:]:
result = OHE_transform(item, result)
result.head(1)


2)

df_result = OHE_transform("VAR_A", df_top_categ)
df_result_1 = OHE_transform("VAR_B", df_result)
df_result_2 = OHE_transform("VAR_C", df_result_1)
...
df_result_12 = OHE_transform("VAR_X", df_result_11)
df_result_12.head(1)

In the first approach, at the third iteration (in the for loop), when it was
supposed to print the head(1), the IPython notebook  remained in the state
"Kernel busy" for several hours and then I interrupted the kernel.
The second approach managed to go through all transformations (please note
that here I eliminated the intermediary prints of the head(1)), but it gave
an "out of memory" error at the only (final result) head(1),  that I paste
below :

===

 

df_result_12.head(1)

-------
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 df_result_12.head(1)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in
head(self, n)
649 rs = self.head(1)
650 return rs[0] if rs else None
--> 651 return self.take(n)
652 
653 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in
take(self, num)
    305     [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
306 """
--> 307 return self.limit(num).collect()
308 
309 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in
collect(self)
279 """
280 with SCCallSiteSync(self._sc) as css:
--> 281 port =
self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd())
282 rs = list(_load_from_socket(port,
BatchedSerializer(PickleSerializer(
283 cls = _create_cls(self.schema)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
in get_return_value(answer, g