Why executor encourage OutOfMemoryException: Java heap space

2015-03-26 Thread sergunok
Hi all, sometimes you can see OutOfMemoryException: Java heap space of executor in Spark. There many ideas about how to work arounds. My question is: how does executor execute tasks from the point of view of memory usage and parallelism? Picture in my mind is: Executor is JVM instance. Number

Which RDD operations preserve ordering?

2015-03-26 Thread sergunok
Hi guys, I don't have exact picture about preserving of ordering of elements of RDD after executing of operations. Which operations preserve it? 1) Map (Yes?) 2) ZipWithIndex (Yes or sometimes yes?) Serg. -- View this message in context:

Spark UI tunneling

2015-03-23 Thread sergunok
Is it a way to tunnel Spark UI? I tried to tunnel client-node:4040 but my browser was redirected from localhost to some cluster locally visible domain name.. Maybe there is some startup option to encourage Spark UI be fully accessiable just through single endpoint (address:port)? Serg. --

log files of failed task

2015-03-23 Thread sergunok
Hi, I executed a task on Spark in YARN and it failed. I see just executor lost message from YARNClientScheduler, no further details.. (I read ths error can be connected to spark.yarn.executor.memoryOverhead setting and already played with this param) How to go more deeply in details in log files

calculating TF-IDF for large 100GB dataset problems

2015-03-19 Thread sergunok
Hi, I try to vectorize on yarn cluster corpus of texts (about 500K texts in 13 files - 100GB totally) located in HDFS . This process already token about 20 hours on 3 node cluster with 6 cores, 20GB RAM on each node. In my opinion it's to long :-) I started the task with the following command:

RDD ordering after map

2015-03-18 Thread sergunok
Does map(...) preserve ordering of original RDD? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-ordering-after-map-tp22129.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread sergunok
What persistance level is better if RDD to be cached is heavily to be recalculated? Am I right it is MEMORY_AND_DISK? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html Sent from the Apache Spark User List mailing

Processing of text file in large gzip archive

2015-03-16 Thread sergunok
I have a 30GB gzip file (originally that is text file where each line represents text document) in HDFS and Spark 1.2.0 under YARN cluster with 3 worker nodes with 64GB RAM and 4 cores on each node. Replictaion factor for my file is 3. I tried to implement simple pyspark script to parse this file

SVD transform of large matrix with MLlib

2015-03-11 Thread sergunok
Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse matrix? What time did it take? What implementation of SVD is used in MLLib? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVD-transform-of-large-matrix-with-MLlib-tp22005.html

can not submit job to spark in windows

2015-02-26 Thread sergunok
Hi! I downloaded and extracted Spark to local folder under windows 7 and have successfully played with it in pyspark interactive shell. BUT When I try to use spark-submit (for example: job-submit pi.py ) I get: C:\spark-1.2.1-bin-hadoop2.4\binspark-submit.cmd pi.py Using Spark's default log4j