from:"mrm"

number of partitions in join: Spark documentation misleading!

2015-06-15 Thread mrm

Hi all, I was looking for an explanation on the number of partitions for a joined rdd. The documentation of Spark 1.3.1. says that: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.

Re: cannot access port 4040

2015-06-11 Thread mrm

Hi, Akhil Das suggested using ssh tunnelling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) and this solved my problem, so it made me think that the settings of my cluster were wrong. So I checked the inbound rules for the security group of my cluster and I

Re: Running Spark in Local Mode

2015-06-11 Thread mrm

Hi, Did you resolve this? I have the same questions. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-tp22279p23278.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: cannot access port 4040

2015-06-10 Thread mrm

Hi Akhil, (Your reply does not appear in the mailing list but I received an email so I will reply here). I have an application running already in the shell using pyspark. I can see the application running on port 8080, but I cannot log into it through port 4040. It says connection timed out

Re: cannot access port 4040

2015-06-10 Thread mrm

Hi Akhil, Thanks for your reply! I still cannot see port 4040 in my machine when I type master-ip-address:4040 in my browser. I have tried this command: netstat -nat | grep 4040 and it returns this: tcp0 0 :::4040 :::* LISTEN Logging into

cannot access port 4040

2015-06-10 Thread mrm

Hi, I am using Spark 1.3.1 standalone and I have a problem where my cluster is working fine, I can see the port 8080 and check that my ec2 instances are fine, but I cannot access port 4040. I have tried sbin/stop-all.sh, sbin/stop-master.sh, exiting the spark context and restarting it to no

SparkSQL nested dictionaries

2015-06-08 Thread mrm

Hi, Is it possible to query a data structure that is a dictionary within a dictionary? I have a parquet file with a a structure: test |key1: {key_string: val_int} |key2: {key_string: val_int} if I try to do: parquetFile.test -- Columntest parquetFile.test.key2 -- AttributeError:

Re: Spark 1.2. loses often all executors

2015-03-23 Thread mrm

Hi, I have received three replies to my question on my personal e-mail, why don't they also show up on the mailing list? I would like to reply to the 3 users through a thread. Thanks, Maria -- View this message in context:

Spark 1.2. loses often all executors

2015-03-20 Thread mrm

Hi, I recently changed from Spark 1.1. to Spark 1.2., and I noticed that it loses all executors whenever I have any Python code bug (like looking up a key in a dictionary that does not exist). In earlier versions, it would raise an exception but it would not lose all executors. Anybody with a

optimize multiple filter operations

2014-11-28 Thread mrm

Hi, My question is: I have multiple filter operations where I split my initial rdd into two different groups. The two groups cover the whole initial set. In code, it's something like: set1 = initial.filter(lambda x: x == something) set2 = initial.filter(lambda x: x != something) By doing

Re: advantages of SparkSQL?

2014-11-25 Thread mrm

Thank you for answering, this is all very helpful! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

advantages of SparkSQL?

2014-11-24 Thread mrm

Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context, but never registering as a table/using sql on it? Something like: Something like: data = sqc.parquetFile(path) results = data.map(lambda x: applyfunc(x.field)) Is this faster/more optimised

Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

2014-10-08 Thread mrm

They reverted to a previous version of the spark-ec2 script and things are working again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html Sent from the Apache Spark User List mailing list

Re: spark-ec2 script with Tachyon

2014-09-26 Thread mrm

Hi, Did you manage to figure this out? I would appreciate if you could share the answer. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-Tachyon-tp9996p15249.html Sent from the Apache Spark User List mailing list archive at

Spark not installed + no access to web UI

2014-09-11 Thread mrm

Hi, I have been launching Spark in the same ways for the past months, but I have only recently started to have problems with it. I launch Spark using spark-ec2 script, but then I cannot access the web UI when I type address:8080 into the browser (it doesn't work with lynx either from the master

Re: Spark not installed + no access to web UI

2014-09-11 Thread mrm

I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit. After several hours trying to launch it, now it seems to be working, this is what I did (not sure if any of these steps helped): 1/ clone the spark repo into the master node 2/ run sbt/sbt assembly 3/ copy spark and spark-ec2

Re: hadoop version

2014-07-23 Thread mrm

Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405p10485.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

driver memory

2014-07-23 Thread mrm

Hi, How do I increase the driver memory? This are my configs right now: sed 's/INFO/ERROR/' spark/conf/log4j.properties.template ./ephemeral-hdfs/conf/log4j.properties sed 's/INFO/ERROR/' spark/conf/log4j.properties.template spark/conf/log4j.properties # Environment variables and Spark

Re: driver memory

2014-07-23 Thread mrm

Hi, I figured out my problem so I wanted to share my findings. I was basically trying to broadcast an array with 4 million elements, and a size of approximatively 150 MB. Every time I was trying to broadcast, I got an OutOfMemory error. I fixed my problem by increasing the driver memory using:

Re: gain access to persisted rdd

2014-07-22 Thread mrm

Ok, thanks for the answers. Unfortunately, there is no sc.getPersistentRDDs for pyspark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

hadoop version

2014-07-22 Thread mrm

Hi, Where can I find the version of Hadoop my cluster is using? I launched my ec2 cluster using the spark-ec2 script with the --hadoop-major-version=2 option. However, the folder hadoop-native/lib in the master node only contains files that end in 1.0.0. Does that mean that I have Hadoop version

gain access to persisted rdd

2014-07-21 Thread mrm

Hi, I am using pyspark and have persisted a list of rdds within a function, but I don't have a reference to them anymore. The RDD's are listed in the UI, under the Storage tab, and they have names associated to them (e.g. 4). Is it possible to access the RDD's to unpersist them? Thanks! --

Re: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread mrm

I have the same error! Did you manage to fix it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

running scrapy (or any other scraper) on the cluster?

2014-07-10 Thread mrm

Hi all, Has anybody tried to run scrapy on a cluster? If yes, I would appreciate hearing about the general approach that was taken (multiple spiders? single spider? how to distribute urls across nodes?...etc). I would also be interested in hearing about any experience running a different scraper

Getting different answers running same line of code

2014-06-19 Thread mrm

Hi, I have had this issue for some time already, where I get different answers when I run the same line of code twice. I have run some experiments to see what is happening, please help me! Here is the code and the answers that I get. I suspect I have a problem when reading large datasets from S3.

list of persisted rdds

2014-06-13 Thread mrm

Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you!

list of persisted rdds

2014-06-13 Thread mrm

Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you!

Re: list of persisted rdds

2014-06-13 Thread mrm

Hi Daniel, Thank you for your help! This is the sort of thing I was looking for. However, when I type sc.getPersistentRDDs, i get the error AttributeError: 'SparkContext' object has no attribute 'getPersistentRDDs'. I don't get any error when I type sc.defaultParallelism for example. I would

Re: list of persisted rdds

2014-06-13 Thread mrm

Hi Nick, Thank you for the reply, I forgot to mention I was using pyspark in my first message. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html Sent from the Apache Spark User List mailing list archive at

Loading Python libraries into Spark

2014-06-05 Thread mrm

Hi, I am new to Spark (and almost-new in python!). How can I download and install a Python library in my cluster so I can just import it later? Any help would be much appreciated. Thanks! -- View this message in context:

Re: Loading Python libraries into Spark

2014-06-05 Thread mrm

Hi Andrei, Thank you for your help! Just to make sure I understand, when I run this command sc.addPyFile(/path/to/yourmodule.py), I need to be already logged into the master node and have my python files somewhere, is that correct? -- View this message in context:

number of partitions in join: Spark documentation misleading!

Re: cannot access port 4040

Re: Running Spark in Local Mode

Re: cannot access port 4040

Re: cannot access port 4040

cannot access port 4040

SparkSQL nested dictionaries

Re: Spark 1.2. loses often all executors

Spark 1.2. loses often all executors

optimize multiple filter operations

Re: advantages of SparkSQL?

advantages of SparkSQL?

Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

Re: spark-ec2 script with Tachyon

Spark not installed + no access to web UI

Re: Spark not installed + no access to web UI

Re: hadoop version

driver memory

Re: driver memory

Re: gain access to persisted rdd

hadoop version

gain access to persisted rdd

Re: LiveListenerBus throws exception and weird web UI bug

running scrapy (or any other scraper) on the cluster?

Getting different answers running same line of code

list of persisted rdds

list of persisted rdds

Re: list of persisted rdds

Re: list of persisted rdds

Loading Python libraries into Spark

Re: Loading Python libraries into Spark

31 matches

Site Navigation

Mail list logo

Footer information