number of partitions in join: Spark documentation misleading!
Hi all, I was looking for an explanation on the number of partitions for a joined rdd. The documentation of Spark 1.3.1. says that: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. https://spark.apache.org/docs/latest/configuration.html And the Partitioner.scala comments (line 51) state that: Unless spark.default.parallelism is set, the number of partitions will be the same as the number of partitions in the largest upstream RDD, as this should be least likely to cause out-of-memory errors. But this is misleading for the Python API where if you do rddA.join(rddB), the output number of partitions is the number of partitions of A plus the number of partitions of B! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/number-of-partitions-in-join-Spark-documentation-misleading-tp23316.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: cannot access port 4040
Hi, Akhil Das suggested using ssh tunnelling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) and this solved my problem, so it made me think that the settings of my cluster were wrong. So I checked the inbound rules for the security group of my cluster and I realised that the port 4040 was missing! Now I can log back into port 4040. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23271.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running Spark in Local Mode
Hi, Did you resolve this? I have the same questions. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-tp22279p23278.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: cannot access port 4040
Hi Akhil, (Your reply does not appear in the mailing list but I received an email so I will reply here). I have an application running already in the shell using pyspark. I can see the application running on port 8080, but I cannot log into it through port 4040. It says connection timed out after a while. I tried relaunching my cluster using the spark-ec2 script but still no success. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23251.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: cannot access port 4040
Hi Akhil, Thanks for your reply! I still cannot see port 4040 in my machine when I type master-ip-address:4040 in my browser. I have tried this command: netstat -nat | grep 4040 and it returns this: tcp0 0 :::4040 :::* LISTEN Logging into my master is not a problem since I can access port 8080 by writing master-ip-address:8080 in my browser. I have made sure that spark.ui.enabled was set to True by launching my application using: ~/spark/bin/pyspark --conf spark.ui.enabled=True. I don't know if this is a symptom of the problem that I have, but it might be another piece of useful information. When I look at Completed Applications in port 8080, I see my two previous applications. One of them says cores: 160, the last one has cores: 0. Could this be a clue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23252.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
cannot access port 4040
Hi, I am using Spark 1.3.1 standalone and I have a problem where my cluster is working fine, I can see the port 8080 and check that my ec2 instances are fine, but I cannot access port 4040. I have tried sbin/stop-all.sh, sbin/stop-master.sh, exiting the spark context and restarting it to no avail. Any clues on what to try next? Thanks, Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL nested dictionaries
Hi, Is it possible to query a data structure that is a dictionary within a dictionary? I have a parquet file with a a structure: test |key1: {key_string: val_int} |key2: {key_string: val_int} if I try to do: parquetFile.test -- Columntest parquetFile.test.key2 -- AttributeError: 'Column' object has no attribute 'key2' Similarly, if I try to do a SQL query, it throws this error: org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type MapType(StringType,MapType(StringType,IntegerType,true),true); Is this at all possible with the Python API in Spark SQL? Thanks, Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-nested-dictionaries-tp23207.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.2. loses often all executors
Hi, I have received three replies to my question on my personal e-mail, why don't they also show up on the mailing list? I would like to reply to the 3 users through a thread. Thanks, Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-loses-often-all-executors-tp22162p22187.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.2. loses often all executors
Hi, I recently changed from Spark 1.1. to Spark 1.2., and I noticed that it loses all executors whenever I have any Python code bug (like looking up a key in a dictionary that does not exist). In earlier versions, it would raise an exception but it would not lose all executors. Anybody with a similar problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-loses-often-all-executors-tp22162.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
optimize multiple filter operations
Hi, My question is: I have multiple filter operations where I split my initial rdd into two different groups. The two groups cover the whole initial set. In code, it's something like: set1 = initial.filter(lambda x: x == something) set2 = initial.filter(lambda x: x != something) By doing this, I am doing two passes over the data. Is there any way to optimise this to do it in a single pass? Note: I was trying to look in the mailing list to see if this question has been asked already, but could not find it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: advantages of SparkSQL?
Thank you for answering, this is all very helpful! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
advantages of SparkSQL?
Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context, but never registering as a table/using sql on it? Something like: Something like: data = sqc.parquetFile(path) results = data.map(lambda x: applyfunc(x.field)) Is this faster/more optimised than having the data stored as a text file and using Spark (non-SQL) to process it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster
They reverted to a previous version of the spark-ec2 script and things are working again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-ec2 script with Tachyon
Hi, Did you manage to figure this out? I would appreciate if you could share the answer. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-Tachyon-tp9996p15249.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark not installed + no access to web UI
Hi, I have been launching Spark in the same ways for the past months, but I have only recently started to have problems with it. I launch Spark using spark-ec2 script, but then I cannot access the web UI when I type address:8080 into the browser (it doesn't work with lynx either from the master node), and I cannot find pyspark in the usual spark/bin/pyspark folder. Any hints as to what might be happening? Thanks in advance! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark not installed + no access to web UI
I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit. After several hours trying to launch it, now it seems to be working, this is what I did (not sure if any of these steps helped): 1/ clone the spark repo into the master node 2/ run sbt/sbt assembly 3/ copy spark and spark-ec2 directories to my slaves 4/ launch the cluster again with --resume Now I can finally access the web UI and spark is properly installed! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952p13957.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hadoop version
Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405p10485.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
driver memory
Hi, How do I increase the driver memory? This are my configs right now: sed 's/INFO/ERROR/' spark/conf/log4j.properties.template ./ephemeral-hdfs/conf/log4j.properties sed 's/INFO/ERROR/' spark/conf/log4j.properties.template spark/conf/log4j.properties # Environment variables and Spark properties export SPARK_WORKER_MEMORY=30g # Whole memory per worker node indepedent of application (default: total memory on worker node minus 1 GB) # SPARK_WORKER_CORES = total number of cores an application can use on a machine # SPARK_WORKER_INSTANCES = how many workers per machine? Limit the number of cores per worker if more than one worker on a machine export SPARK_JAVA_OPTS= -Dspark.executor.memory=30g -Dspark.speculation.quantile=0.5 -Dspark.speculation=true -Dspark.cores.max=80 -Dspark.akka.frameSize=1000 -Dspark.rdd.compress=true #spark.executor.memory = memory taken by spark on a machine export SPARK_DAEMON_MEMORY=2g In the application UI, it says my driver has 295 MB memory. I am trying to broadcast a variable that is 0.15 gigs and it is throwing OutOfMemory errors, so I am trying to see if by increasing the driver memory I can fix this. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/driver-memory-tp10486.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: driver memory
Hi, I figured out my problem so I wanted to share my findings. I was basically trying to broadcast an array with 4 million elements, and a size of approximatively 150 MB. Every time I was trying to broadcast, I got an OutOfMemory error. I fixed my problem by increasing the driver memory using: export SPARK_MEM=2g Using SPARK_DAEMON_MEM or spark.executor.memory did not help in this case! I don't have a good understanding of all these settings and I have the feeling many people are in the same situation. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/driver-memory-tp10486p10489.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: gain access to persisted rdd
Ok, thanks for the answers. Unfortunately, there is no sc.getPersistentRDDs for pyspark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
hadoop version
Hi, Where can I find the version of Hadoop my cluster is using? I launched my ec2 cluster using the spark-ec2 script with the --hadoop-major-version=2 option. However, the folder hadoop-native/lib in the master node only contains files that end in 1.0.0. Does that mean that I have Hadoop version 1? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
gain access to persisted rdd
Hi, I am using pyspark and have persisted a list of rdds within a function, but I don't have a reference to them anymore. The RDD's are listed in the UI, under the Storage tab, and they have names associated to them (e.g. 4). Is it possible to access the RDD's to unpersist them? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: LiveListenerBus throws exception and weird web UI bug
I have the same error! Did you manage to fix it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
running scrapy (or any other scraper) on the cluster?
Hi all, Has anybody tried to run scrapy on a cluster? If yes, I would appreciate hearing about the general approach that was taken (multiple spiders? single spider? how to distribute urls across nodes?...etc). I would also be interested in hearing about any experience running a different scraper on a cluster, maybe scrapy is not the best one. Thank you! Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/running-scrapy-or-any-other-scraper-on-the-cluster-tp9286.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Getting different answers running same line of code
Hi, I have had this issue for some time already, where I get different answers when I run the same line of code twice. I have run some experiments to see what is happening, please help me! Here is the code and the answers that I get. I suspect I have a problem when reading large datasets from S3. rd1 = sc.textFile('s3n://blabla') *rd1.persist()* rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x)) Note: both filter1() and map1() are deterministic rd2.count() == 294928559 rd2.count() == 294928559 So far so good, I get the same counts. Now when I unpersist rd1, that's when I start getting problems! *rd1.unpersist()* rd2 = rd_imp.filter(lambda x: filter1(x)).map(lambda x: map1(x)) rd2.count() == 294928559 rd2.count() == 294509501 rd2.count() == 294679795 ... I would appreciate it if you could help me! Thanks, Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-different-answers-running-same-line-of-code-tp7920.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
list of persisted rdds
Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
list of persisted rdds
Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7565.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: list of persisted rdds
Hi Daniel, Thank you for your help! This is the sort of thing I was looking for. However, when I type sc.getPersistentRDDs, i get the error AttributeError: 'SparkContext' object has no attribute 'getPersistentRDDs'. I don't get any error when I type sc.defaultParallelism for example. I would appreciate it if you could help me with this, I have tried different ways and googling it! I suspect it might be a silly error but I can't figure it out. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7569.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: list of persisted rdds
Hi Nick, Thank you for the reply, I forgot to mention I was using pyspark in my first message. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Loading Python libraries into Spark
Hi, I am new to Spark (and almost-new in python!). How can I download and install a Python library in my cluster so I can just import it later? Any help would be much appreciated. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Loading Python libraries into Spark
Hi Andrei, Thank you for your help! Just to make sure I understand, when I run this command sc.addPyFile(/path/to/yourmodule.py), I need to be already logged into the master node and have my python files somewhere, is that correct? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059p7073.html Sent from the Apache Spark User List mailing list archive at Nabble.com.