Is Spark History Server supported for Mesos?

2015-12-09 Thread Kelvin Chu
Spark on YARN can use History Server by setting the configuration spark.yarn.historyServer.address. But, I can't find similar config for Mesos. Is History Server supported by Spark on Mesos? Thanks. Kelvin

Re: Combining Many RDDs

2015-03-26 Thread Kelvin Chu
Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable 'data' is a Scala collection and compute() returns an RDD. Right? If yes, I tried the approach below to operate on one RDD only during the whole computation (Yes, I also saw that too many RDD hurt performance).

Re: Running out of space (when there's no shortage)

2015-02-27 Thread Kelvin Chu
Hi Joe, you might increase spark.yarn.executor.memoryOverhead to see if it fixes the problem. Please take a look of this report: https://issues.apache.org/jira/browse/SPARK-4996 Hope this helps. On Tue, Feb 24, 2015 at 2:05 PM, Yiannis Gkoufas johngou...@gmail.com wrote: No problem, Joe. There

Re: job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-27 Thread Kelvin Chu
Hi Darin, you might increase spark.yarn.executor.memoryOverhead to see if it fixes the problem. Please take a look of this report: https://issues.apache.org/jira/browse/SPARK-4996 On Fri, Feb 27, 2015 at 12:38 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Can you share what error you

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Spark bookkeeping and anything the user does inside UDFs. -Sandy On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory

Re: Setting the number of executors in standalone mode

2015-02-20 Thread Kelvin Chu
Hi, Currently, there is only one executor per worker. There is jira ticket to relax this: https://issues.apache.org/jira/browse/SPARK-1706 But, if you want to use more cores, maybe, you can try increasing SPARK_WORKER_INSTANCES. It increases the number of workers per machine. Take a look here:

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters

Re: using a database connection pool to write data into an RDBMS from a Spark application

2015-02-19 Thread Kelvin Chu
Hi Mohammed, Did you use --jars to specify your jdbc driver when you submitted your job? Take a look of this link: http://spark.apache.org/docs/1.2.0/submitting-applications.html Hope this help! Kelvin On Thu, Feb 19, 2015 at 7:24 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – I

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Kelvin Chu
Since the stacktrace shows kryo is being used, maybe, you could also try increasing spark.kryoserializer.buffer.max.mb. Hope this help. Kelvin On Tue, Feb 10, 2015 at 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You could try increasing the driver memory. Also, can you be more specific

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Kelvin Chu
I had a similar use case before. I found: 1. textFile() produced one partition per file. It can result in many partitions. I found that calling coalecse() without shuffle helped. 2. If you used persist(), count() will do I/O and put the result into cache. Transformation later did computation out

Re: Can spark job server be used to visualize streaming data?

2015-02-10 Thread Kelvin Chu
Hi Su, Out of the box, no. But, I know people integrate it with Spark Streaming to do real-time visualization. It will take some work though. Kelvin On Mon, Feb 9, 2015 at 5:04 PM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I was reading this blog post:

Re: no space left at worker node

2015-02-08 Thread Kelvin Chu
Maybe, try with local: under the heading of Advanced Dependency Management here: https://spark.apache.org/docs/1.1.0/submitting-applications.html It seems this is what you want. Hope this help. Kelvin On Sun, Feb 8, 2015 at 9:13 PM, ey-chih chow eyc...@hotmail.com wrote: Is there any way we

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Kelvin Chu
Joe, I also use S3 and gzip. So far the I/O is not a problem. In my case, the operation is SQLContext.JsonFile() and I can see from Ganglia that the whole cluster is CPU bound (99% saturated). I have 160 cores and I can see the network can sustain about 150MBit/s. Kelvin On Wed, Feb 4, 2015 at

Re: Interactive interface tool for spark

2014-10-08 Thread Kelvin Chu
Hi Andy, It sounds great! Quick questions: I have been using IPython + PySpark. I crunch the data by PySpark and then visualize the data by Python libraries like matplotlib and basemap. Could I still use these Python libraries in the Scala Notebook? If not, what is suggested approaches for