Re: Contributed to spark

2017-04-08 Thread Shuai Lin
Links that was helpful to me during learning about the spark source code: - Articles with "spark" tag in this blog: http://hydronitrogen.com/tag/spark.html - Jacek's "mastering apache spark" git book: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ Hope those can help. On Sat, Apr 8,

Re: Cached table details

2017-01-28 Thread Shuai Lin
+1 for Jacek's suggestion FWIW: another possible *hacky* way is to write a package in org.apache.spark.sql namespace so it can access the sparkSession.sharedState.cacheManager. Then use scala reflection to read the cache manager's `cachedData` field, which can provide the list of cached

Re: Dynamic resource allocation to Spark on Mesos

2017-01-28 Thread Shuai Lin
> > An alternative behavior is to launch the job with the best resource offer > Mesos is able to give Michael has just made an excellent explanation about dynamic allocation support in mesos. But IIUC, what you want to achieve is something like (using RAM as an example) : "Launch each executor

Re: mesos or kubernetes ?

2016-08-13 Thread Shuai Lin
Good summary! One more advantage of running spark on mesos: community support. There are quite a big user base that runs spark on mesos, so if you encounter a problem with your deployment, it's very likely you can get the answer by a simple google search, or asking in the spark/mesos user list. By

Re: Saving a pyspark.ml.feature.PCA model

2016-07-19 Thread Shuai Lin
It's added in not-released-yet 2.0.0 version. https://issues.apache.org/jira/browse/SPARK-13036 https://github.com/apache/spark/commit/83302c3b so i guess you need to wait for 2.0 release (or use the current rc4). On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale wrote: >

Re: Dependencies with runing Spark Streaming on Mesos cluster using Python

2016-07-13 Thread Shuai Lin
I think there are two options for you: First you can set `--conf spark.mesos.executor.docker.image= adolphlwq/mesos-for-spark-exector-image:1.6.0.beta2` in your spark submit args, so mesos would launch the executor with your custom image. Or you can remove the `local:` prefix in the --jars flag,

Re: StreamingKmeans Spark doesn't work at all

2016-07-10 Thread Shuai Lin
I would suggest you run the scala version of the example first, so you can tell whether it's a problem of the data you provided or a problem of the java code. On Mon, Jul 11, 2016 at 2:37 AM, Biplob Biswas wrote: > Hi, > > I know i am asking again, but I tried running

Re: KEYS file?

2016-07-10 Thread Shuai Lin
> > at least links to the keys used to sign releases on the > download page +1 for that. On Mon, Jul 11, 2016 at 3:35 AM, Phil Steitz <phil.ste...@gmail.com> wrote: > On 7/10/16 10:57 AM, Shuai Lin wrote: > > Not sure where you see " 0x7C6C105FFC8ED089". I

Re: KEYS file?

2016-07-10 Thread Shuai Lin
Not sure where you see " 0x7C6C105FFC8ED089". I think the release is signed with the key https://people.apache.org/keys/committer/pwendell.asc . I think this tutorial can be helpful: http://www.apache.org/info/verification.html On Mon, Jul 11, 2016 at 12:57 AM, Phil Steitz

Poor performance of using spark sql over gzipped json files

2016-06-24 Thread Shuai Lin
Hi, We have tried to use spark sql to process some gzipped json-format log files stored on S3 or HDFS. But the performance is very poor. For example, here is the code that I run over 20 gzipped files (total size of them is 4GB compressed and ~40GB when decompressed) gzfile =

Questions about Spark On Mesos

2016-03-15 Thread Shuai Lin
Hi list, We (scrapinghub) are planning to deploy spark in a 10+ node cluster, mainly for processing data in HDFS and kafka streaming. We are thinking of using mesos instead of yarn as the cluster resource manager so we can use docker container as the executor and makes deployment easier. But