R: spark 1.2 writing on parquet after a join never ends - GC problems

2015-02-08 Thread Paolo Platter
Could anyone figure out what is going in my spark cluster? Thanks in advance Paolo Inviata dal mio Windows Phone Da: Paolo Plattermailto:paolo.plat...@agilelab.it Inviato: ‎06/‎02/‎2015 10:48 A: user@spark.apache.orgmailto:user@spark.apache.org Oggetto: spark

Re: Installing a python library along with ec2 cluster

2015-02-08 Thread gen tang
Hi, You can make a image of ec2 with all the python libraries installed and create a bash script to export python_path in the /etc/init.d/ directory. Then you can launch the cluster with this image and ec2.py Hope this can be helpful Cheers Gen On Sun, Feb 8, 2015 at 9:46 AM, Chengi Liu

Re: no space left at worker node

2015-02-08 Thread gen tang
Hi, I fact, I met this problem before. it is a bug of AWS. Which type of machine do you use? If I guess well, you can check the file /etc/fstab. There would be a double mount of /dev/xvdb. If yes, you should 1. stop hdfs 2. umount /dev/xvdb at / 3. restart hdfs Hope this could be helpful.

Mesos coarse mode not working (fine grained does)

2015-02-08 Thread Hans van den Bogert
://upperpaste.com/spark-1.2.0-bin-hadoop2.4.tgz' I0208 12:57:45.415575 25720 fetcher.cpp:126] Downloading 'http://upperpaste.com/spark-1.2.0-bin-hadoop2.4.tgz' to '/local/vdbogert/var/lib/mesos//slaves/20150206-110658-16813322-5050-5515-S1/frameworks/20150208-125721-906005770-5050-32371-

RE: no space left at worker node

2015-02-08 Thread ey-chih chow
Gen, Thanks for your information. The content of /etc/fstab at the worker node (r3.large) is: #LABEL=/ / ext4defaults,noatime 1 1tmpfs /dev/shm tmpfs defaults0 0devpts /dev/ptsdevpts gid=5,mode=620 0 0sysfs /syssysfs

Re: no space left at worker node

2015-02-08 Thread gen tang
Hi, In fact, /dev/sdb is /dev/xvdb. It seems that there is no problem about double mount. However, there is no information about /mnt2. You should check whether /dev/sdc is well mounted or not. The reply of Micheal is good solution about this type of problem. You can check his site. Cheers Gen

RE: no space left at worker node

2015-02-08 Thread ey-chih chow
Thanks Gen. How can I check if /dev/sdc is well mounted or not? In general, the problem shows up when I submit the second or third job. The first job I submit most likely will succeed. Ey-Chih Chow Date: Sun, 8 Feb 2015 18:18:03 +0100 Subject: Re: no space left at worker node From:

RE: no space left at worker node

2015-02-08 Thread ey-chih chow
Thanks Michael. I didn't edit core-site.xml. We use the default one. I only saw hdaoop.tmp.dir in core-site.xml, pointing to /mnt/ephemeral-hdfs. How can I edit the config file? Best regards, Ey-Chih Date: Sun, 8 Feb 2015 16:51:32 + From: m_albert...@yahoo.com To: gen.tan...@gmail.com;

Re: Mesos coarse mode not working (fine grained does)

2015-02-08 Thread Hans van den Bogert
] Downloading 'http://upperpaste.com/spark-1.2.0-bin-hadoop2.4.tgz' to '/local/vdbogert/var/lib/mesos//slaves/20150206-110658-16813322-5050-5515-S1/frameworks/20150208-125721-906005770-5050-32371-/executors/0/runs/cb525b32-387c-4698-a27e-8d4213080151/spark-1.2.0-bin-hadoop2.4.tgz' I0208 12:58

Re: Mesos coarse mode not working (fine grained does)

2015-02-08 Thread Tim Chen
/20150206-110658-16813322-5050-5515-S1/frameworks/20150208-125721-906005770-5050-32371-/executors/0/runs/cb525b32-387c-4698-a27e-8d4213080151/spark-1.2.0-bin-hadoop2.4.tgz' I0208 12:58:09.146960 25720 fetcher.cpp:64] Extracted resource '/local/vdbogert/var/lib/mesos//slaves/20150206-110658

Re: Spark concurrency question

2015-02-08 Thread Sean Owen
I think I have this right: You will run one executor per application per worker. Generally there is one worker per machine, and it manages all of the machine's resources. So if you want one app to use this whole machine you need to ask for 48G and 24 cores. That's better than splitting up the

Spark concurrency question

2015-02-08 Thread java8964
Hi, I have some questions about how the spark run the job concurrently. For example, if I setup the Spark on one standalone test box, which has 24 core and 64G memory. I setup the Worker memory to 48G, and Executor memory to 4G, and using spark-shell to run some jobs. Here is something confusing

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-08 Thread Kyle Ellrott
I changed the curGraph = curGraph.outerJoinVertices(curMessages)( (vid, vertex, message) = vertex.process(message.getOrElse(List[Message]()), ti) ).cache() to curGraph = curGraph.outerJoinVertices(curMessages)( (vid, vertex, message) = (vertex,

Re: no space left at worker node

2015-02-08 Thread gen tang
Hi, I am sorry that I made a mistake. r3.large has only one SSD which has been mounted in /mnt. Therefore this is no /dev/sdc. In fact, the problem is that there is no space in the under / directory. So you should check whether your application write data under this directory(for instance, save

Re: Spark concurrency question

2015-02-08 Thread Sean Owen
On Sun, Feb 8, 2015 at 10:26 PM, java8964 java8...@hotmail.com wrote: standalone one box environment, if I want to use all 48G memory allocated to worker for my application, I should ask 48G memory for the executor in the spark shell, right? Because 48G is too big for a JVM heap in normal case,

RE: no space left at worker node

2015-02-08 Thread ey-chih chow
Hi Gen, Thanks. I save my logs in a file under /var/log. This is the only place to save data. Will the problem go away if I use a better machine? Best regards, Ey-Chih Chow Date: Sun, 8 Feb 2015 23:32:27 +0100 Subject: Re: no space left at worker node From: gen.tan...@gmail.com To:

Re: Can't access remote Hive table from spark

2015-02-08 Thread guxiaobo1982
Hi Lian, Will the latest 0.14.0 version of Hive,which is installed by ambari 1.7.0 by default, be supported by the next release of Spark? Regards, -- Original -- From: Cheng Lian;lian.cs@gmail.com; Send time: Friday, Feb 6, 2015 9:02 AM To:

Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-08 Thread fightf...@163.com
Hi, Problem still exists. Any experts would take a look at this? Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-06 17:54 To: user; dev Subject: Sort Shuffle performance issues about using AppendOnlyMap for large data sets Hi, all Recently we had caught performance

Error when running example (pi.py)

2015-02-08 Thread Ashish Kumar
Traceback (most recent call last): File pi.py, line 29, in module sc = SparkContext(appName=PythonPi) File /home/ashish/Downloads/spark-1.1.0-bin-hadoop2.4/python/pyspark/context.py, line 104, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File

Re: WebUI on yarn through ssh tunnel affected by AmIpfilter

2015-02-08 Thread Akhil Das
Just to add why tunneling is not a good practice sometime: There could be some other ports/apps depeneding on other processes running on different ports. Lets say a web app running on port 8080 pulling info from other processes through rest api which will fail here since you only tunnel for 8080

[MLlib] Performance issues when building GBM models

2015-02-08 Thread Christopher Thom
Hi All, I wonder if anyone else has some experience building a Gradient Boosted Trees model using spark/mllib? I have noticed when building decent-size models that the process slows down over time. We observe that the time to build tree n is approximately a constant time longer than the time

RE: no space left at worker node

2015-02-08 Thread ey-chih chow
Is there any way we can disable Spark copying the jar file to the corresponding directory. I have a fat jar and is already copied to worker nodes using the command copydir. Why Spark needs to save the jar to ./spark/work/appid each time a job get started? Ey-Chih Chow Date: Sun, 8 Feb

Re: Installing a python library along with ec2 cluster

2015-02-08 Thread Chengi Liu
Hi I am very new both in spark and aws stuff.. Say, I want to install pandas on ec2.. (pip install pandas) How do I create the image and the above library which would be used from pyspark. Thanks On Sun, Feb 8, 2015 at 3:03 AM, gen tang gen.tan...@gmail.com wrote: Hi, You can make a image of

Re: Installing a python library along with ec2 cluster

2015-02-08 Thread Akhil Das
You can basically add one function call to install the stuffs you want. If you look at the spark-ec2 script, there's a function which does all the setup named: setup_cluster(..) https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L625. Now, if you want to install a python library (

Re: no space left at worker node

2015-02-08 Thread Kelvin Chu
Maybe, try with local: under the heading of Advanced Dependency Management here: https://spark.apache.org/docs/1.1.0/submitting-applications.html It seems this is what you want. Hope this help. Kelvin On Sun, Feb 8, 2015 at 9:13 PM, ey-chih chow eyc...@hotmail.com wrote: Is there any way we

RE: no space left at worker node

2015-02-08 Thread ey-chih chow
I found the problem is, for each application, the Spark worker node saves the corresponding std output and std err under ./spark/work/appid, where appid is the id of the application. If I ran several applications in a row, it will out of space. In my case, the disk usage under ./spark/work/

RE: no space left at worker node

2015-02-08 Thread ey-chih chow
By this way, the input and output paths of the job are all in s3. I did not use paths of hdfs as input or output. Best regards, Ey-Chih Chow From: eyc...@hotmail.com To: gen.tan...@gmail.com CC: user@spark.apache.org Subject: RE: no space left at worker node Date: Sun, 8 Feb 2015 14:57:15 -0800