Hi,
I would like to ask if it is currently possible to use spark-ec2 script
together with credentials that are consisting not only from: aws_access_key_id
and aws_secret_access_key, but it also contains aws_security_token.
When I try to run the script I am getting following error message:
Hi,
I am trying to do some logging in my PySpark jobs, particularly in map that is
performed on workers. Unfortunately I am not able tofind these logs. Based on
the documentation it seems that the logs should be on masters in the
SPARK_KOME, directory work
So it seems that this problem was related to
http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
and increasing the executor memory worked for me.
__
Hi,
I am getting
I have tried it out to merge the file to one, Spark is now working with RAM as
I've expected.
Unfortunately after doing this there appears another problem. Now Spark running
on YARN is scheduling all the work only to one worker node as a one big job. Is
there some way, how to force Spark and
Ok so the problem was solved, it that the file was gziped and it looks that
Spark does not support direct .gz file distribution to workers.
Thank you very much fro the suggestion to merge the files.
Best regards,
Jan
__
I have
Could you please give me an example or send me a link of how to use Hadoop
CombinedFileInputFormat? It sound very interesting to me and it would probably
save me several hours of my pipeline computation. Merging of the files is
currently the bottleneck in my system.
I have 3 datasets in all the datasets the average file size is 10-12Kb.
I am able to run my code on the dataset with 70K files, but I am not able to
run it on datasets with 1.1M and 3.8M files.
__
On Sun, Nov 2, 2014 at 1:35 AM,
Hi,
I am using Spark on Yarn, particularly Spark in Python. I am trying to run:
myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json)
myrdd.getNumPartitions()
Unfortunately it seems that Spark tries to load everything to RAM, or at least
after while of running this everything slows down and
Thank you, I would expect it to work as you write, but I am probably
experiencing it working other way. But now it seems that Spark is generally
trying to fit everything to RAM. I run Spark on YARN and I have wraped this to
another question:
Now I am getting to problems using:
distData = sc.textFile(sys.argv[2]).coalesce(10)
The problem is that it seems that Spark is trying to put all the data to RAM
first and then perform coalesce. Do you know if there is something that would
do coalesce on fly with for example fixed size of
Hi,
I have inpot data that are many of very small files containing one .json.
For performance reasons (I use PySpark) I have to do repartioning, currently I
do:
sc.textFile(files).coalesce(100))
Problem is that I have to guess the number of partitions in a such way that
it's as fast as
Hi Ilya,
This seems to me as quiet complicated solution, I'm thinking that easier
(though not optimal) solution might be for example to use heuristicaly
something like RDD.coalesce(RDD.getNumPartitions() / N), but it keeps me wonder
that Spark does not have something like
Yes I would expect it as you say, setting executor-cores as 1 would work, but
it seems to me that when I do use executor-cores=1 than it does actually
perform more than one job on each of the machines at one time moment (at least
based on what top says).
Hi,
I am currently struggling with how to properly set Spark to perform only one
map, flatMap, etc at once. In other words my map uses multi core algorithm so I
would like to have only one map running to be able to use all the machine cores.
Thank you in advance for advices and replies.
Jan
But I guess that this makes only one task over all the clusters nodes. I would
like to run several tasks, but I would like Spark to not run more than one map
at each of my nodes at one time. That means I would like to let's say have 4
different tasks and 2 nodes where each node has 2 cores.
So the problem was that Spark has internaly set home to /home. Hack to make
this work with Python is to add before call of textblob line:
os.environ['HOME'] = '/home/hadoop'
__
Maybe I'll add one more question. I think that the
Maybe I'll add one more question. I think that the problem is with user, so I
would like to ask under which user are run Spark jobs on slaves?
__
Hi,
I am trying to implement function for text preprocessing in PySpark. I have amazon
Hi,
I would like to ask under which user is run the Spark program on slaves? My
Spark is running on top of the Yarn.
The reason I am asking for this is that I need to download data for NLTK
library and these data are dowloaded for specific python user and I am
currently struggling with this.
Thank you very much lot of very small json files was exactly the speed
performance problem, using coalesce makes my Spark program to run on single
node only twice slower (even with starting Spark) than single node Python
program, which is acceptable.
Jan
Hi,
I have program that I have for single computer (in Python) exection and also
implemented the same for Spark. This program basically only reads .json from
which it takes one field and saves it back. Using Spark my program runs
aproximately 100 times slower on 1 master and 1 slave. So I
Hi,
I have seen in the video from Spark summit that usually (when I use HDFS) are
data distributed across the whole cluster and usually computations goes to the
data.
My question is how does it work when I read the data from Amazon S3? Is the
whole input dataset readed by the master node
hi,
I have cluster that has several nodes and every node has several cores. I'd
like to run multi-core algorithm within every map. So I'd like to assure that
there will be performed only one map per cluster node. Is there some way, how
to assure this? It seems to me that it should be possible
I've tried to add / at the end of the path, but the result was exactly the
same. I also guess that there will be some problem on the level of Hadoop - S3
comunication. Doy you know if there is some possibility of how tu run scripts
from Spark on for example different hadoom version from the
Thank you, this seems to be the way to go, but unfortunately, when I'm trying
to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting
following Error:
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat:
My additional question is if this problem can be possibly caused by the fact
that my file is bigger than RAM memory across the whole cluster?
__
Hi
I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm
getting
One more update: I've realized that this problem is not only Python related. I've tried
it also in Scala, but I'm still getting the same error, my scala code: val file =
sc.wholeTextFiles(s3n://wiki-dump/wikiinput).first()
__
My
Hi,
I have already unsucesfully asked quiet simmilar question at stackoverflow,
particularly here:
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
I've also unsucessfully tryied some workaround, but unsucessfuly, workaround
problem can be
Hi,
I would like to ask if it is possible to use generator, that generates data
bigger than size of RAM across all the machines as the input for sc =
SparkContext(), sc.paralelize(generator). I would like to create RDD this way.
When I am trying to create RDD by sc.TextFile(file) where file
Hi,
Thank you for your advice. It really might work, but to specify my problem a
bit more, think of my data more like one generated item is one parsed wikipedia
page. I am getting this generator from the parser and I don't want to save it
to the storage, but directly apply parallelize and
@Davies
I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle
neck on the master node.
Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes
to store there whole data that needs to be parsed by extract_pages. I have my
data on S3 and I kind
30 matches
Mail list logo