spark spark-ec2 credentials using aws_security_token

2015-07-27 Thread jan.zikes
Hi,   I would like to ask if it is currently possible to use spark-ec2 script together with credentials that are consisting not only from: aws_access_key_id and aws_secret_access_key, but it also contains aws_security_token.   When I try to run the script I am getting following error message:  

Where can I find logs from workers PySpark

2014-11-11 Thread jan.zikes
Hi,  I am trying to do some logging in my PySpark jobs, particularly in map that is performed on workers. Unfortunately I am not able tofind these logs. Based on the documentation it seems that the logs should be on masters in the SPARK_KOME, directory work

Re: Spark on YARN, ExecutorLostFailure for long running computations in map

2014-11-08 Thread jan.zikes
So it seems that this problem was related to  http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html and increasing the executor memory worked for me. __ Hi, I am getting 

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
I have tried it out to merge the file to one, Spark is now working with RAM as I've expected. Unfortunately after doing this there appears another problem. Now Spark running on YARN is scheduling all the work only to one worker node as a one big job. Is there some way, how to force Spark and

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
Ok so the problem was solved, it that the file was gziped and it looks that Spark does not support direct .gz file distribution to workers.  Thank you very much fro the suggestion to merge the files. Best regards, Jan  __ I have

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
Could you please give me an example or send me a link of how to use Hadoop CombinedFileInputFormat? It sound very interesting to me and it would probably save me several hours of my pipeline computation. Merging of the files is currently the bottleneck in my system.

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-03 Thread jan.zikes
I have 3 datasets in all the datasets the average file size is 10-12Kb.  I am able to run my code on the dataset with 70K files, but I am not able to run it on datasets with 1.1M and 3.8M files.  __ On Sun, Nov 2, 2014 at 1:35 AM,  

Spark on Yarn probably trying to load all the data to RAM

2014-11-02 Thread jan.zikes
Hi, I am using Spark on Yarn, particularly Spark in Python. I am trying to run: myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json) myrdd.getNumPartitions() Unfortunately it seems that Spark tries to load everything to RAM, or at least after while of running this everything slows down and

Re: Spark speed performance

2014-11-02 Thread jan.zikes
Thank you, I would expect it to work as you write, but I am probably experiencing it working other way. But now it seems that Spark is generally trying to fit everything to RAM. I run Spark on YARN and I have wraped this to another question: 

Re: Spark speed performance

2014-11-01 Thread jan.zikes
Now I am getting to problems using: distData = sc.textFile(sys.argv[2]).coalesce(10)   The problem is that it seems that Spark is trying to put all the data to RAM first and then perform coalesce. Do you know if there is something that would do coalesce on fly with for example fixed size of

Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread jan.zikes
Hi, I have inpot data that are many of very small files containing one .json. For performance reasons (I use PySpark) I have to do repartioning, currently I do: sc.textFile(files).coalesce(100))   Problem is that I have to guess the number of partitions in a such way that it's as fast as

RE: Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread jan.zikes
Hi Ilya, This seems to me as quiet complicated solution, I'm thinking that easier (though not optimal) solution might be for example to use heuristicaly something like RDD.coalesce(RDD.getNumPartitions() / N), but it keeps me wonder that Spark does not have something like

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-31 Thread jan.zikes
Yes I would expect it as you say, setting executor-cores as 1 would work, but  it seems to me that when I do use executor-cores=1 than it does actually perform more than one job on each of the machines at one time moment (at least based on what top says).

How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread jan.zikes
Hi, I am currently struggling with how to properly set Spark to perform only one map, flatMap, etc at once. In other words my map uses multi core algorithm so I would like to have only one map running to be able to use all the machine cores. Thank you in advance for advices and replies.  Jan 

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread jan.zikes
But I guess that this makes only one task over all the clusters nodes. I would like to run several tasks, but I would like Spark to not run more than one map at each of my nodes at one time. That means I would like to let's say have 4 different tasks and 2 nodes where each node has 2 cores.

Re: PySpark problem with textblob from NLTK used in map

2014-10-27 Thread jan.zikes
So the problem was that Spark has internaly set home to /home. Hack to make this work with Python is to add before call of textblob line: os.environ['HOME'] = '/home/hadoop'  __ Maybe I'll add one more question. I think that the

Re: PySpark problem with textblob from NLTK used in map

2014-10-24 Thread jan.zikes
Maybe I'll add one more question. I think that the problem is with user, so I would like to ask under which user are run Spark jobs on slaves? __ Hi, I am trying to implement function for text preprocessing in PySpark. I have amazon

Under which user is the program run on slaves?

2014-10-24 Thread jan.zikes
Hi, I would like to ask under which user is run the Spark program on slaves? My Spark is running on top of the Yarn. The reason I am asking for this is that I need to download data for NLTK library and these data are dowloaded for specific python user and I am currently struggling with this. 

Re: Spark speed performance

2014-10-19 Thread jan.zikes
Thank you very much lot of very small json files was exactly the speed performance problem, using coalesce makes my Spark program to run on single node only twice slower (even with starting Spark) than single node Python program, which is acceptable. Jan 

Spark speed performance

2014-10-18 Thread jan.zikes
Hi, I have program that I have for single computer (in Python) exection and also implemented the same for Spark. This program basically only reads .json from which it takes one field and saves it back. Using Spark my program runs aproximately 100 times slower on 1 master and 1 slave. So I

Re: How does reading the data from Amazon S3 works?

2014-10-17 Thread jan.zikes
Hi,  I have seen in the video from Spark summit that usually (when I use HDFS) are data distributed across the whole cluster and usually computations goes to the data. My question is how does it work when I read the data from Amazon S3? Is the whole input dataset readed by the master node

How to assure that there will be run only one map per cluster node?

2014-10-17 Thread jan.zikes
hi, I have cluster that has several nodes and every node has several cores. I'd like to run multi-core algorithm within every map. So I'd like to assure that there will be performed only one map per cluster node. Is there some way, how to assure this? It seems to me that it should be possible

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-09 Thread jan.zikes
I've tried to add / at the end of the path, but the result was exactly the same. I also guess that there will be some problem on the level of Hadoop - S3 comunication. Doy you know if there is some possibility of how tu run scripts from Spark on for example different hadoom version from the

Re: Parsing one big multiple line .xml loaded in RDD using Python

2014-10-08 Thread jan.zikes
Thank you, this seems to be the way to go, but unfortunately, when I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting following Error:   14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1 14/10/08 06:09:50 INFO input.FileInputFormat:

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-08 Thread jan.zikes
My additional question is if this problem can be possibly caused by the fact that my file is bigger than RAM memory across the whole cluster?   __ Hi I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-08 Thread jan.zikes
One more update: I've realized that this problem is not only Python related. I've tried it also in Scala, but I'm still getting the same error, my scala code: val file = sc.wholeTextFiles(s3n://wiki-dump/wikiinput).first() __ My

Parsing one big multiple line .xml loaded in RDD using Python

2014-10-07 Thread jan.zikes
Hi, I have already unsucesfully asked quiet simmilar question at stackoverflow, particularly here:  http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim. I've also unsucessfully tryied some workaround, but unsucessfuly, workaround problem can be

Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, I would like to ask if it is possible to use generator, that generates data bigger than size of RAM across all the machines as the input for sc = SparkContext(), sc.paralelize(generator). I would like to create RDD this way. When I am trying to create RDD by sc.TextFile(file) where file

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
Hi, Thank you for your advice. It really might work, but to specify my problem a bit more, think of my data more like one generated item is one parsed wikipedia page. I am getting this generator from the parser and I don't want to save it to the storage, but directly apply parallelize and

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread jan.zikes
@Davies I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle neck on the master node. Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes to store there whole data that needs to be parsed by extract_pages. I have my data on S3 and I kind