number of partitions in join: Spark documentation misleading!

2015-06-15 Thread mrm
Hi all, I was looking for an explanation on the number of partitions for a joined rdd. The documentation of Spark 1.3.1. says that: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.

Re: cannot access port 4040

2015-06-11 Thread mrm
Hi, Akhil Das suggested using ssh tunnelling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) and this solved my problem, so it made me think that the settings of my cluster were wrong. So I checked the inbound rules for the security group of my cluster and I

Re: Running Spark in Local Mode

2015-06-11 Thread mrm
Hi, Did you resolve this? I have the same questions. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-tp22279p23278.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: cannot access port 4040

2015-06-10 Thread mrm
Hi Akhil, (Your reply does not appear in the mailing list but I received an email so I will reply here). I have an application running already in the shell using pyspark. I can see the application running on port 8080, but I cannot log into it through port 4040. It says connection timed out

Re: cannot access port 4040

2015-06-10 Thread mrm
Hi Akhil, Thanks for your reply! I still cannot see port 4040 in my machine when I type master-ip-address:4040 in my browser. I have tried this command: netstat -nat | grep 4040 and it returns this: tcp0 0 :::4040 :::* LISTEN Logging into

cannot access port 4040

2015-06-10 Thread mrm
Hi, I am using Spark 1.3.1 standalone and I have a problem where my cluster is working fine, I can see the port 8080 and check that my ec2 instances are fine, but I cannot access port 4040. I have tried sbin/stop-all.sh, sbin/stop-master.sh, exiting the spark context and restarting it to no

SparkSQL nested dictionaries

2015-06-08 Thread mrm
Hi, Is it possible to query a data structure that is a dictionary within a dictionary? I have a parquet file with a a structure: test |key1: {key_string: val_int} |key2: {key_string: val_int} if I try to do: parquetFile.test -- Columntest parquetFile.test.key2 -- AttributeError:

Re: Spark 1.2. loses often all executors

2015-03-23 Thread mrm
Hi, I have received three replies to my question on my personal e-mail, why don't they also show up on the mailing list? I would like to reply to the 3 users through a thread. Thanks, Maria -- View this message in context:

Spark 1.2. loses often all executors

2015-03-20 Thread mrm
Hi, I recently changed from Spark 1.1. to Spark 1.2., and I noticed that it loses all executors whenever I have any Python code bug (like looking up a key in a dictionary that does not exist). In earlier versions, it would raise an exception but it would not lose all executors. Anybody with a

optimize multiple filter operations

2014-11-28 Thread mrm
Hi, My question is: I have multiple filter operations where I split my initial rdd into two different groups. The two groups cover the whole initial set. In code, it's something like: set1 = initial.filter(lambda x: x == something) set2 = initial.filter(lambda x: x != something) By doing

Re: advantages of SparkSQL?

2014-11-25 Thread mrm
Thank you for answering, this is all very helpful! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

advantages of SparkSQL?

2014-11-24 Thread mrm
Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context, but never registering as a table/using sql on it? Something like: Something like: data = sqc.parquetFile(path) results = data.map(lambda x: applyfunc(x.field)) Is this faster/more optimised

Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

2014-10-08 Thread mrm
They reverted to a previous version of the spark-ec2 script and things are working again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html Sent from the Apache Spark User List mailing list

Re: spark-ec2 script with Tachyon

2014-09-26 Thread mrm
Hi, Did you manage to figure this out? I would appreciate if you could share the answer. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-Tachyon-tp9996p15249.html Sent from the Apache Spark User List mailing list archive at

Spark not installed + no access to web UI

2014-09-11 Thread mrm
Hi, I have been launching Spark in the same ways for the past months, but I have only recently started to have problems with it. I launch Spark using spark-ec2 script, but then I cannot access the web UI when I type address:8080 into the browser (it doesn't work with lynx either from the master

Re: Spark not installed + no access to web UI

2014-09-11 Thread mrm
I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit. After several hours trying to launch it, now it seems to be working, this is what I did (not sure if any of these steps helped): 1/ clone the spark repo into the master node 2/ run sbt/sbt assembly 3/ copy spark and spark-ec2

Re: hadoop version

2014-07-23 Thread mrm
Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405p10485.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

driver memory

2014-07-23 Thread mrm
Hi, How do I increase the driver memory? This are my configs right now: sed 's/INFO/ERROR/' spark/conf/log4j.properties.template ./ephemeral-hdfs/conf/log4j.properties sed 's/INFO/ERROR/' spark/conf/log4j.properties.template spark/conf/log4j.properties # Environment variables and Spark

Re: driver memory

2014-07-23 Thread mrm
Hi, I figured out my problem so I wanted to share my findings. I was basically trying to broadcast an array with 4 million elements, and a size of approximatively 150 MB. Every time I was trying to broadcast, I got an OutOfMemory error. I fixed my problem by increasing the driver memory using:

Re: gain access to persisted rdd

2014-07-22 Thread mrm
Ok, thanks for the answers. Unfortunately, there is no sc.getPersistentRDDs for pyspark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

hadoop version

2014-07-22 Thread mrm
Hi, Where can I find the version of Hadoop my cluster is using? I launched my ec2 cluster using the spark-ec2 script with the --hadoop-major-version=2 option. However, the folder hadoop-native/lib in the master node only contains files that end in 1.0.0. Does that mean that I have Hadoop version

gain access to persisted rdd

2014-07-21 Thread mrm
Hi, I am using pyspark and have persisted a list of rdds within a function, but I don't have a reference to them anymore. The RDD's are listed in the UI, under the Storage tab, and they have names associated to them (e.g. 4). Is it possible to access the RDD's to unpersist them? Thanks! --

Re: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread mrm
I have the same error! Did you manage to fix it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

running scrapy (or any other scraper) on the cluster?

2014-07-10 Thread mrm
Hi all, Has anybody tried to run scrapy on a cluster? If yes, I would appreciate hearing about the general approach that was taken (multiple spiders? single spider? how to distribute urls across nodes?...etc). I would also be interested in hearing about any experience running a different scraper

Getting different answers running same line of code

2014-06-19 Thread mrm
Hi, I have had this issue for some time already, where I get different answers when I run the same line of code twice. I have run some experiments to see what is happening, please help me! Here is the code and the answers that I get. I suspect I have a problem when reading large datasets from S3.

list of persisted rdds

2014-06-13 Thread mrm
Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you!

list of persisted rdds

2014-06-13 Thread mrm
Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you!

Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Daniel, Thank you for your help! This is the sort of thing I was looking for. However, when I type sc.getPersistentRDDs, i get the error AttributeError: 'SparkContext' object has no attribute 'getPersistentRDDs'. I don't get any error when I type sc.defaultParallelism for example. I would

Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Nick, Thank you for the reply, I forgot to mention I was using pyspark in my first message. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html Sent from the Apache Spark User List mailing list archive at

Loading Python libraries into Spark

2014-06-05 Thread mrm
Hi, I am new to Spark (and almost-new in python!). How can I download and install a Python library in my cluster so I can just import it later? Any help would be much appreciated. Thanks! -- View this message in context:

Re: Loading Python libraries into Spark

2014-06-05 Thread mrm
Hi Andrei, Thank you for your help! Just to make sure I understand, when I run this command sc.addPyFile(/path/to/yourmodule.py), I need to be already logged into the master node and have my python files somewhere, is that correct? -- View this message in context: