Hi all,
I was looking for an explanation on the number of partitions for a joined
rdd.
The documentation of Spark 1.3.1. says that:
For distributed shuffle operations like reduceByKey and join, the largest
number of partitions in a parent RDD.
Hi,
Akhil Das suggested using ssh tunnelling (ssh -L 4040:127.0.0.1:4040
master-ip, and then open localhost:4040 in browser.) and this solved my
problem, so it made me think that the settings of my cluster were wrong.
So I checked the inbound rules for the security group of my cluster and I
Hi,
Did you resolve this? I have the same questions.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-tp22279p23278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Akhil,
(Your reply does not appear in the mailing list but I received an email so I
will reply here).
I have an application running already in the shell using pyspark. I can see
the application running on port 8080, but I cannot log into it through port
4040. It says connection timed out
Hi Akhil,
Thanks for your reply! I still cannot see port 4040 in my machine when I
type master-ip-address:4040 in my browser.
I have tried this command: netstat -nat | grep 4040 and it returns this:
tcp0 0 :::4040 :::*
LISTEN
Logging into
Hi,
I am using Spark 1.3.1 standalone and I have a problem where my cluster is
working fine, I can see the port 8080 and check that my ec2 instances are
fine, but I cannot access port 4040.
I have tried sbin/stop-all.sh, sbin/stop-master.sh, exiting the spark
context and restarting it to no
Hi,
Is it possible to query a data structure that is a dictionary within a
dictionary?
I have a parquet file with a a structure:
test
|key1: {key_string: val_int}
|key2: {key_string: val_int}
if I try to do:
parquetFile.test
-- Columntest
parquetFile.test.key2
-- AttributeError:
Hi,
I have received three replies to my question on my personal e-mail, why
don't they also show up on the mailing list? I would like to reply to the 3
users through a thread.
Thanks,
Maria
--
View this message in context:
Hi,
I recently changed from Spark 1.1. to Spark 1.2., and I noticed that it
loses all executors whenever I have any Python code bug (like looking up a
key in a dictionary that does not exist). In earlier versions, it would
raise an exception but it would not lose all executors.
Anybody with a
Hi,
My question is:
I have multiple filter operations where I split my initial rdd into two
different groups. The two groups cover the whole initial set. In code, it's
something like:
set1 = initial.filter(lambda x: x == something)
set2 = initial.filter(lambda x: x != something)
By doing
Thank you for answering, this is all very helpful!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
Is there any advantage to storing data as a parquet format, loading it using
the sparkSQL context, but never registering as a table/using sql on it?
Something like:
Something like:
data = sqc.parquetFile(path)
results = data.map(lambda x: applyfunc(x.field))
Is this faster/more optimised
They reverted to a previous version of the spark-ec2 script and things are
working again!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html
Sent from the Apache Spark User List mailing list
Hi,
Did you manage to figure this out? I would appreciate if you could share the
answer.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-Tachyon-tp9996p15249.html
Sent from the Apache Spark User List mailing list archive at
Hi,
I have been launching Spark in the same ways for the past months, but I have
only recently started to have problems with it. I launch Spark using
spark-ec2 script, but then I cannot access the web UI when I type
address:8080 into the browser (it doesn't work with lynx either from the
master
I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit.
After several hours trying to launch it, now it seems to be working, this is
what I did (not sure if any of these steps helped):
1/ clone the spark repo into the master node
2/ run sbt/sbt assembly
3/ copy spark and spark-ec2
Thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405p10485.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
How do I increase the driver memory? This are my configs right now:
sed 's/INFO/ERROR/' spark/conf/log4j.properties.template
./ephemeral-hdfs/conf/log4j.properties
sed 's/INFO/ERROR/' spark/conf/log4j.properties.template
spark/conf/log4j.properties
# Environment variables and Spark
Hi,
I figured out my problem so I wanted to share my findings. I was basically
trying to broadcast an array with 4 million elements, and a size of
approximatively 150 MB. Every time I was trying to broadcast, I got an
OutOfMemory error. I fixed my problem by increasing the driver memory using:
Ok, thanks for the answers. Unfortunately, there is no sc.getPersistentRDDs
for pyspark.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
Where can I find the version of Hadoop my cluster is using? I launched my
ec2 cluster using the spark-ec2 script with the --hadoop-major-version=2
option. However, the folder hadoop-native/lib in the master node only
contains files that end in 1.0.0. Does that mean that I have Hadoop version
Hi,
I am using pyspark and have persisted a list of rdds within a function, but
I don't have a reference to them anymore. The RDD's are listed in the UI,
under the Storage tab, and they have names associated to them (e.g. 4). Is
it possible to access the RDD's to unpersist them?
Thanks!
--
I have the same error! Did you manage to fix it?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
Has anybody tried to run scrapy on a cluster? If yes, I would appreciate
hearing about the general approach that was taken (multiple spiders? single
spider? how to distribute urls across nodes?...etc). I would also be
interested in hearing about any experience running a different scraper
Hi,
I have had this issue for some time already, where I get different answers
when I run the same line of code twice. I have run some experiments to see
what is happening, please help me! Here is the code and the answers that I
get. I suspect I have a problem when reading large datasets from S3.
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once? And is it possible to get the names
of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)?
Thank you!
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once? And is it possible to get the names
of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)?
Thank you!
Hi Daniel,
Thank you for your help! This is the sort of thing I was looking for.
However, when I type sc.getPersistentRDDs, i get the error
AttributeError: 'SparkContext' object has no attribute
'getPersistentRDDs'.
I don't get any error when I type sc.defaultParallelism for example.
I would
Hi Nick,
Thank you for the reply, I forgot to mention I was using pyspark in my first
message.
Maria
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html
Sent from the Apache Spark User List mailing list archive at
Hi,
I am new to Spark (and almost-new in python!). How can I download and
install a Python library in my cluster so I can just import it later?
Any help would be much appreciated.
Thanks!
--
View this message in context:
Hi Andrei,
Thank you for your help! Just to make sure I understand, when I run this
command sc.addPyFile(/path/to/yourmodule.py), I need to be already logged
into the master node and have my python files somewhere, is that correct?
--
View this message in context:
31 matches
Mail list logo