number of partitions in join: Spark documentation misleading!

2015-06-15 Thread mrm
Hi all,

I was looking for an explanation on the number of partitions for a joined
rdd.

The documentation of Spark 1.3.1. says that:
For distributed shuffle operations like reduceByKey and join, the largest
number of partitions in a parent RDD.
https://spark.apache.org/docs/latest/configuration.html

And the Partitioner.scala comments (line 51) state that:
Unless spark.default.parallelism is set, the number of partitions will be
the same as the number of partitions in the largest upstream RDD, as this
should be least likely to cause out-of-memory errors.

But this is misleading for the Python API where if you do rddA.join(rddB),
the output number of partitions is the number of partitions of A plus the
number of partitions of B!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: cannot access port 4040

2015-06-11 Thread mrm
Hi,

Akhil Das suggested using ssh tunnelling (ssh -L 4040:127.0.0.1:4040
master-ip, and then open localhost:4040 in browser.) and this solved my
problem, so it made me think that the settings of my cluster were wrong.

So I checked the inbound rules for the security group of my cluster and I
realised that the port 4040 was missing!

Now I can log back into port 4040.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23271.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running Spark in Local Mode

2015-06-11 Thread mrm
Hi, 

Did you resolve this? I have the same questions.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-tp22279p23278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: cannot access port 4040

2015-06-10 Thread mrm
Hi Akhil,

(Your reply does not appear in the mailing list but I received an email so I
will reply here).

I have an application running already in the shell using pyspark. I can see
the application running on port 8080, but I cannot log into it through port
4040. It says connection timed out after a while. I tried relaunching my
cluster using the spark-ec2 script but still no success.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23251.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: cannot access port 4040

2015-06-10 Thread mrm
Hi Akhil,

Thanks for your reply! I still cannot see port 4040 in my machine when I
type master-ip-address:4040 in my browser.

I have tried this command: netstat -nat | grep 4040 and it returns this:
tcp0  0 :::4040 :::*   
LISTEN 

Logging into my master is not a problem since I can access port 8080 by
writing master-ip-address:8080 in my browser.

I have made sure that spark.ui.enabled was set to True by launching my
application using: ~/spark/bin/pyspark --conf spark.ui.enabled=True.

I don't know if this is a symptom of the problem that I have, but it might
be another piece of useful information. When I look at Completed
Applications in port 8080, I see my two previous applications. One of them
says cores: 160, the last one has cores: 0. Could this be a clue?







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23252.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



cannot access port 4040

2015-06-10 Thread mrm
Hi,

I am using Spark 1.3.1 standalone and I have a problem where my cluster is
working fine, I can see the port 8080 and check that my ec2 instances are
fine, but I cannot access port 4040.

I have tried sbin/stop-all.sh, sbin/stop-master.sh, exiting the spark
context and restarting it to no avail.

Any clues on what to try next?

Thanks,
Maria



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL nested dictionaries

2015-06-08 Thread mrm
Hi,

Is it possible to query a data structure that is a dictionary within a
dictionary?

I have a parquet file with a a structure:
test
|key1: {key_string: val_int}
|key2: {key_string: val_int}

if I try to do:
 parquetFile.test
 -- Columntest

 parquetFile.test.key2
 -- AttributeError: 'Column' object has no attribute 'key2'

Similarly, if I try to do a SQL query, it throws this error:

org.apache.spark.sql.AnalysisException: GetField is not valid on fields of
type MapType(StringType,MapType(StringType,IntegerType,true),true);

Is this at all possible with the Python API in Spark SQL?

Thanks,
Maria



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-nested-dictionaries-tp23207.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.2. loses often all executors

2015-03-23 Thread mrm
Hi, 

I have received three replies to my question on my personal e-mail, why
don't they also show up on the mailing list? I would like to reply to the 3
users through a thread.

Thanks,
Maria



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-loses-often-all-executors-tp22162p22187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.2. loses often all executors

2015-03-20 Thread mrm
Hi,

I recently changed from Spark 1.1. to Spark 1.2., and I noticed that it
loses all executors whenever I have any Python code bug (like looking up a
key in a dictionary that does not exist). In earlier versions, it would
raise an exception but it would not lose all executors. 

Anybody with a similar problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-loses-often-all-executors-tp22162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



optimize multiple filter operations

2014-11-28 Thread mrm
Hi, 

My question is:

I have multiple filter operations where I split my initial rdd into two
different groups. The two groups cover the whole initial set. In code, it's
something like:

set1 = initial.filter(lambda x: x == something)
set2 = initial.filter(lambda x: x != something)

By doing this, I am doing two passes over the data. Is there any way to
optimise this to do it in a single pass?

Note: I was trying to look in the mailing list to see if this question has
been asked already, but could not find it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: advantages of SparkSQL?

2014-11-25 Thread mrm
Thank you for answering, this is all very helpful!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



advantages of SparkSQL?

2014-11-24 Thread mrm
Hi,

Is there any advantage to storing data as a parquet format, loading it using
the sparkSQL context, but never registering as a table/using sql on it?
Something like:

Something like:
data = sqc.parquetFile(path)
results =  data.map(lambda x: applyfunc(x.field))

Is this faster/more optimised than having the data stored as a text file and
using Spark (non-SQL) to process it?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

2014-10-08 Thread mrm
They reverted to a previous version of the spark-ec2 script and things are
working again!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-ec2 script with Tachyon

2014-09-26 Thread mrm
Hi,

Did you manage to figure this out? I would appreciate if you could share the
answer.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-Tachyon-tp9996p15249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark not installed + no access to web UI

2014-09-11 Thread mrm
Hi,

I have been launching Spark in the same ways for the past months, but I have
only recently started to have problems with it. I launch Spark using
spark-ec2 script, but then I cannot access the web UI when I type
address:8080 into the browser (it doesn't work with lynx either from the
master node), and I cannot find pyspark in the usual spark/bin/pyspark
folder. Any hints as to what might be happening?

Thanks in advance!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark not installed + no access to web UI

2014-09-11 Thread mrm
I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit. 

After several hours trying to launch it, now it seems to be working, this is
what I did (not sure if any of these steps helped):
1/ clone the spark repo into the master node
2/ run sbt/sbt assembly
3/ copy spark and spark-ec2 directories to my slaves
4/ launch the cluster again with --resume

Now I can finally access the web UI and spark is properly installed!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952p13957.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hadoop version

2014-07-23 Thread mrm
Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405p10485.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


driver memory

2014-07-23 Thread mrm
Hi,

How do I increase the driver memory? This are my configs right now:

sed 's/INFO/ERROR/' spark/conf/log4j.properties.template   
./ephemeral-hdfs/conf/log4j.properties
sed 's/INFO/ERROR/' spark/conf/log4j.properties.template  
spark/conf/log4j.properties
# Environment variables and Spark properties
export SPARK_WORKER_MEMORY=30g # Whole memory per worker node indepedent
of application (default: total memory on worker node minus 1 GB)
# SPARK_WORKER_CORES = total number of cores an application can use on a
machine
# SPARK_WORKER_INSTANCES = how many workers per machine? Limit the number of
cores per worker if more than one worker on a machine
export SPARK_JAVA_OPTS= -Dspark.executor.memory=30g
-Dspark.speculation.quantile=0.5 -Dspark.speculation=true
-Dspark.cores.max=80 -Dspark.akka.frameSize=1000 -Dspark.rdd.compress=true
#spark.executor.memory = memory taken by spark on a machine
export SPARK_DAEMON_MEMORY=2g

In the application UI, it says my driver has 295 MB memory. I am trying to
broadcast a variable that is 0.15 gigs and it is throwing OutOfMemory
errors, so I am trying to see if by increasing the driver memory I can fix
this.

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/driver-memory-tp10486.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: driver memory

2014-07-23 Thread mrm
Hi,

I figured out my problem so I wanted to share my findings. I was basically
trying to broadcast an array with 4 million elements, and a size of
approximatively 150 MB. Every time I was trying to broadcast, I got an
OutOfMemory error. I fixed my problem by increasing the driver memory using:
export SPARK_MEM=2g

Using SPARK_DAEMON_MEM or spark.executor.memory did not help in this case! I
don't have a good understanding of all these settings and I have the feeling
many people are in the same situation. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/driver-memory-tp10486p10489.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: gain access to persisted rdd

2014-07-22 Thread mrm
Ok, thanks for the answers. Unfortunately, there is no sc.getPersistentRDDs
for pyspark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


hadoop version

2014-07-22 Thread mrm
Hi,

Where can I find the version of Hadoop my cluster is using? I launched my
ec2 cluster using the spark-ec2 script with the --hadoop-major-version=2
option. However, the folder hadoop-native/lib in the master node only
contains files that end in 1.0.0. Does that mean that I have Hadoop version
1?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


gain access to persisted rdd

2014-07-21 Thread mrm
Hi,

I am using pyspark and have persisted a list of rdds within a function, but
I don't have a reference to them anymore. The RDD's are listed in the UI,
under the Storage tab, and they have names associated to them (e.g. 4). Is
it possible to access the RDD's to unpersist them?

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread mrm
I have the same error! Did you manage to fix it?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


running scrapy (or any other scraper) on the cluster?

2014-07-10 Thread mrm
Hi all,

Has anybody tried to run scrapy on a cluster? If yes, I would appreciate
hearing about the general approach that was taken (multiple spiders? single
spider? how to distribute urls across nodes?...etc). I would also be
interested in hearing about any experience running a different scraper on a
cluster, maybe scrapy is not the best one.

Thank you!

Maria



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/running-scrapy-or-any-other-scraper-on-the-cluster-tp9286.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Getting different answers running same line of code

2014-06-19 Thread mrm
Hi,

I have had this issue for some time already, where I get different answers
when I run the same line of code twice. I have run some experiments to see
what is happening, please help me! Here is the code and the answers that I
get. I suspect I have a problem when reading large datasets from S3.

rd1 = sc.textFile('s3n://blabla')
*rd1.persist()*
rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x))

Note: both filter1() and map1() are deterministic

rd2.count()  == 294928559
rd2.count()  == 294928559

So far so good, I get the same counts. Now when I unpersist rd1, that's when
I start getting problems!

*rd1.unpersist()*
rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x))
rd2.count()  == 294928559
rd2.count()  == 294509501
rd2.count()  == 294679795
...

I would appreciate it if you could help me!

Thanks,
Maria





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Getting-different-answers-running-same-line-of-code-tp7920.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


list of persisted rdds

2014-06-13 Thread mrm
Hi,

How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()

rd2.cache()
...

rdN.cache()

How can I unpersist all rdd's at once? And is it possible to get the names
of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)?

Thank you!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


list of persisted rdds

2014-06-13 Thread mrm
Hi,

How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()

rd2.cache()
...

rdN.cache()

How can I unpersist all rdd's at once? And is it possible to get the names
of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)?

Thank you!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7565.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Daniel,

Thank you for your help! This is the sort of thing I was looking for.
However, when I type sc.getPersistentRDDs, i get the error
AttributeError: 'SparkContext' object has no attribute
'getPersistentRDDs'.

I don't get any error when I type sc.defaultParallelism for example.

I would appreciate it if you could help me with this, I have tried different
ways and googling it! I suspect it might be a silly error but I can't figure
it out.

Maria



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Nick,

Thank you for the reply, I forgot to mention I was using pyspark in my first
message.

Maria



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Loading Python libraries into Spark

2014-06-05 Thread mrm
Hi,

I am new to Spark (and almost-new in python!). How can I download and
install a Python library in my cluster so I can just import it later?

Any help would be much appreciated.

Thanks!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Loading Python libraries into Spark

2014-06-05 Thread mrm
Hi Andrei,

Thank you for your help! Just to make sure I understand, when I run this
command sc.addPyFile(/path/to/yourmodule.py), I need to be already logged
into the master node and have my python files somewhere, is that correct?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059p7073.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.