Re: Contributed to spark

2017-04-08 Thread Shuai Lin
Links that was helpful to me during learning about the spark source code:

- Articles with "spark" tag in this blog:
http://hydronitrogen.com/tag/spark.html
- Jacek's "mastering apache spark" git book:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/

Hope those can help.

On Sat, Apr 8, 2017 at 1:31 AM, Stephen Fletcher  wrote:

> I'd like to eventually contribute to spark, but I'm noticing since spark 2
> the query planner is heavily used throughout Dataset code base. Are there
> any sites I can go to that explain the technical details, more than just
> from a high-level prospective
>


Re: Cached table details

2017-01-28 Thread Shuai Lin
+1 for Jacek's suggestion

FWIW: another possible *hacky* way is to write a package
in org.apache.spark.sql namespace so it can access the
sparkSession.sharedState.cacheManager. Then use scala reflection to read
the cache manager's `cachedData` field, which can provide the list of
cached relations.

https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L47

But this makes use of spark internals so would be subject to changes of it.

On Fri, Jan 27, 2017 at 7:00 AM, Jacek Laskowski  wrote:

> Hi,
>
> I think that the only way to get the information about a cached RDD is to
> use SparkListener and intercept respective events about cached blocks on
> BlockManagers.
>
> Jacek
>
> On 25 Jan 2017 5:54 a.m., "kumar r"  wrote:
>
> Hi,
>
> I have cached some table in Spark Thrift Server. I want to get all cached
> table information. I can see it in 4040 web ui port.
>
> Is there any command or other way to get the cached table details
> programmatically?
>
> Thanks,
> Kumar
>
>
>


Re: Dynamic resource allocation to Spark on Mesos

2017-01-28 Thread Shuai Lin
>
> An alternative behavior is to launch the job with the best resource offer
> Mesos is able to give


Michael has just made an excellent explanation about dynamic allocation
support in mesos. But IIUC, what you want to achieve is something like
(using RAM as an example) : "Launch each executor with at least 1GB RAM,
but if mesos offers 2GB at some moment, then launch an executor with 2GB
RAM".

I wonder what's benefit of that? To reduce the "resource fragmentation"?

Anyway, that is not supported at this moment. In all the supported cluster
managers of spark (mesos, yarn, standalone, and the up-to-coming spark on
kubernetes), you have to specify the cores and memory of each executor.

It may not be supported in the future, because only mesos has the concepts
of offers because of its two-level scheduling model.


On Sat, Jan 28, 2017 at 1:35 AM, Ji Yan  wrote:

> Dear Spark Users,
>
> Currently is there a way to dynamically allocate resources to Spark on
> Mesos? Within Spark we can specify the CPU cores, memory before running
> job. The way I understand is that the Spark job will not run if the CPU/Mem
> requirement is not met. This may lead to decrease in overall utilization of
> the cluster. An alternative behavior is to launch the job with the best
> resource offer Mesos is able to give. Is this possible with the current
> implementation?
>
> Thanks
> Ji
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>


Re: mesos or kubernetes ?

2016-08-13 Thread Shuai Lin
Good summary! One more advantage of running spark on mesos: community
support. There are quite a big user base that runs spark on mesos, so if
you encounter a problem with your deployment, it's very likely you can get
the answer by a simple google search, or asking in the spark/mesos user
list. By contrast it's not likely to achieve the same with spark on k8s.

On Sun, Aug 14, 2016 at 2:40 AM, Michael Gummelt 
wrote:

> Spark has a first-class scheduler for Mesos, whereas it doesn't for
> Kubernetes.  Running Spark on Kubernetes means running Spark in standalone
> mode, wrapped in a Kubernetes service: https://github.com/kubernetes/
> kubernetes/tree/master/examples/spark
>
> So you're effectively comparing standalone vs. Mesos.  For basic purposes,
> standalone works fine.  Mesos adds support for things like docker images,
> security, resource reservations via roles, targeting specific nodes via
> attributes, etc.
>
> The main benefit of Mesos, however, is that you can share the same
> infrastructure with other, non-Spark services.  We have users, for example,
> running Spark on the same cluster as HDFS, Cassandra, Kafka, web apps,
> Jenkins, etc.  You can do this with Kubernetes to some extent, but running
> in standalone means that the Spark "partition" isn't elastic.  You must
> statically partition to exclusively run Spark.
>
> On Sat, Aug 13, 2016 at 11:24 AM, guyoh  wrote:
>
>> My company is trying to decide whether to use kubernetes or mesos. Since
>> we
>> are planning to use Spark in the near future, I was wandering what is the
>> best choice for us.
>> Thanks,
>> Guy
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: Saving a pyspark.ml.feature.PCA model

2016-07-19 Thread Shuai Lin
It's added in not-released-yet 2.0.0 version.

https://issues.apache.org/jira/browse/SPARK-13036
https://github.com/apache/spark/commit/83302c3b

so i guess you need to wait for 2.0 release (or use the current rc4).

On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale  wrote:

> Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib has
> that but mllib does not have PCA afaik. How do people do model persistence
> for inference using the pyspark ml models ? Did not find any documentation
> on model persistency for ml.
>
> --ajinkya
>


Re: Dependencies with runing Spark Streaming on Mesos cluster using Python

2016-07-13 Thread Shuai Lin
I think there are two options for you:

First you can set `--conf spark.mesos.executor.docker.image=
adolphlwq/mesos-for-spark-exector-image:1.6.0.beta2` in your spark submit
args, so mesos would launch the executor with your custom image.

Or you can remove the `local:` prefix in the --jars flag, this way the
executors would download the jars from your spark driver.



On Wed, Jul 13, 2016 at 9:08 PM, Luke Adolph  wrote:

> Update:
> I revuild my mesos-exector-image ,I download
> *spark-streaming-kafka_2.10-1.6.0.jar* on *`/linker/jars`*
>
> I change my submit command:
>
> dcos spark run \ --submit-args='--jars
>> local:/linker/jars/spark-streaming-kafka_2.10-1.6.0.jar  spark2cassandra.py
>> 10.140.0.14:2181 wlu_spark2cassandra' --docker-image
>> adolphlwq/mesos-for-spark-exector-image:1.6.0.beta2
>
>
> Where I get new stderr output on mesos:
>
>
> ​
> I only problem is submit the dependency
> spark-streaming-kafka_2.10-1.6.0.jar to worker.
>
> Thanks.
>
>
> 2016-07-13 18:57 GMT+08:00 Luke Adolph :
>
>> Hi all:
>> My spark runs on mesos.I write a spark streaming app using python, code
>> on GitHub .
>>
>> The app has dependency "
>> *org.apache.spark:spark-streaming-kafka_2.10:1.6.1*".
>>
>> Spark on mesos has two important concepts: Spark Framework and Spark
>> exector.
>>
>> I set my exector run in docker image.The docker image Dockerfile
>> 
>>  is
>> below:
>>
>> # refer '
>>> http://spark.apache.org/docs/latest/running-on-mesos.html#spark-properties'
>>> on 'spark.mesos.executor.docker.image' section
>>
>> FROM ubuntu:14.04
>>> WORKDIR /linker
>>> RUN ln -f -s /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
>>> #download mesos
>>> RUN echo "deb http://repos.mesosphere.io/ubuntu/ trusty main" >
>>> /etc/apt/sources.list.d/mesosphere.list && \
>>> apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF && \
>>> apt-get update && \
>>> apt-get -y install mesos=0.28.1-2.0.20.ubuntu1404 openjdk-7-jre
>>> python-pip git vim curl
>>> RUN git clone https://github.com/adolphlwq/linkerProcessorSample.git &&
>>> \
>>> pip install -r linkerProcessorSample/docker/requirements.txt
>>> RUN curl -fL
>>> http://archive.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
>>> | tar xzf - -C /usr/local && \
>>> apt-get clean
>>> ENV MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so \
>>> JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 \
>>> SPARK_HOME=/usr/local/spark-1.6.0-bin-hadoop2.6
>>> ENV PATH=$JAVA_HOME/bin:$PATH
>>> WORKDIR $SPARK_HOME
>>
>>
>> When I use below command to submit my app program:
>>
>> dcos spark run --submit-args='--packages
>>> org.apache.spark:spark-streaming-kafka_2.10:1.6.1 \
>>>spark2cassandra.py zk topic' \
>>> -docker-image=adolphlwq/mesos-for-spark-exector-image:1.6.0.beta
>>
>>
>> The exector docker container run successfully, but it has no package for
>> *org.apache.spark:spark-streaming-kafka_2.10:1.6.1*.
>>
>> The *stderr* om mesos is:
>>
>> I0713 09:34:52.715551 18124 logging.cpp:188] INFO level logging started!
>>> I0713 09:34:52.717797 18124 fetcher.cpp:424] Fetcher Info:
>>> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/6097419e-c2d0-4e5f-9a91-e5815de640c4-S4","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/home\/ubuntu\/spark2cassandra.py"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.spark_spark-streaming-kafka_2.10-1.6.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/com.101tec_zkclient-0.3.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.kafka_kafka_2.10-0.8.2.1.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.slf4j_slf4j-api-1.7.10.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.spark-project.spark_unused-1.0.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/net.jpountz.lz4_lz4-1.3.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/log4j_log4j-1.2.17.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/com.yammer.metrics_metrics-core-2.2.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.apache.kafka_kafka-clients-0.8.2.1.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/root\/.ivy2\/jars\/org.xerial.snappy_snappy-java-1.1.2.jar"}}],"sandbox_directory":"\/tmp\/mesos\/slaves\/6097419e-c2d0-4e5f-9a91-e5815de640c4-S4\/frameworks\/7399b6f7-5dcd-4a9b-9846-e7948d5ffd11-0024\/executors\/driver-20160713093451-0015\/runs\/84419372-9482-4c58-8f87-4ba528b6885c"}
>>> I0713 09:34:52.719846 18124 fetcher.cpp:379] Fetching URI
>>> '/home/ubuntu/spark2cassandra.py'
>>> I0713 

Re: StreamingKmeans Spark doesn't work at all

2016-07-10 Thread Shuai Lin
I would suggest you run the scala version of the example first, so you can
tell whether it's a problem of the data you provided or a problem of the
java code.

On Mon, Jul 11, 2016 at 2:37 AM, Biplob Biswas 
wrote:

> Hi,
>
> I know i am asking again, but I tried running the same thing on mac as
> well as some answers on the internet suggested it could be an issue with
> the windows environment, but still nothing works.
>
> Can anyone atleast suggest whether its a bug with spark or is it something
> else?
>
> Would be really grateful! Thanks a lot.
>
> Thanks & Regards
> Biplob Biswas
>
> On Thu, Jul 7, 2016 at 5:21 PM, Biplob Biswas 
> wrote:
>
>> Hi,
>>
>> Can anyone care to please look into this issue?  I would really love some
>> assistance here.
>>
>> Thanks a lot.
>>
>> Thanks & Regards
>> Biplob Biswas
>>
>> On Tue, Jul 5, 2016 at 1:00 PM, Biplob Biswas 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> I implemented the streamingKmeans example provided in the spark website
>>> but
>>> in Java.
>>> The full implementation is here,
>>>
>>> http://pastebin.com/CJQfWNvk
>>>
>>> But i am not getting anything in the output except occasional timestamps
>>> like one below:
>>>
>>> ---
>>> Time: 1466176935000 ms
>>> ---
>>>
>>> Also, i have 2 directories:
>>> "D:\spark\streaming example\Data Sets\training"
>>> "D:\spark\streaming example\Data Sets\test"
>>>
>>> and inside these directories i have 1 file each "samplegpsdata_train.txt"
>>> and "samplegpsdata_test.txt" with training data having 500 datapoints and
>>> test data with 60 datapoints.
>>>
>>> I am very new to the spark systems and any help is highly appreciated.
>>>
>>>
>>> //---//
>>>
>>> Now, I also have now tried using the scala implementation available here:
>>>
>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala
>>>
>>>
>>> and even had the training and test file provided in the format specified
>>> in
>>> that file as follows:
>>>
>>>  * The rows of the training text files must be vector data in the form
>>>  * `[x1,x2,x3,...,xn]`
>>>  * Where n is the number of dimensions.
>>>  *
>>>  * The rows of the test text files must be labeled data in the form
>>>  * `(y,[x1,x2,x3,...,xn])`
>>>  * Where y is some identifier. n must be the same for train and test.
>>>
>>>
>>> But I still get no output on my eclipse window ... just the Time!
>>>
>>> Can anyone seriously help me with this?
>>>
>>> Thank you so much
>>> Biplob Biswas
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/StreamingKmeans-Spark-doesn-t-work-at-all-tp27286.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: KEYS file?

2016-07-10 Thread Shuai Lin
>
> at least links to the keys used to sign releases on the
> download page


+1 for that.

On Mon, Jul 11, 2016 at 3:35 AM, Phil Steitz <phil.ste...@gmail.com> wrote:

> On 7/10/16 10:57 AM, Shuai Lin wrote:
> > Not sure where you see " 0x7C6C105FFC8ED089". I
>
> That's the key ID for the key below.
> > think the release is signed with the
> > key https://people.apache.org/keys/committer/pwendell.asc .
>
> Thanks!  That key matches.  The project should publish a KEYS file
> [1] or at least links to the keys used to sign releases on the
> download page.  Could be there is one somewhere and I just can't
> find it.
>
> Phil
>
> [1] http://www.apache.org/dev/release-signing.html#keys-policy
> >
> > I think this tutorial can be
> > helpful: http://www.apache.org/info/verification.html
> >
> > On Mon, Jul 11, 2016 at 12:57 AM, Phil Steitz
> > <phil.ste...@gmail.com <mailto:phil.ste...@gmail.com>> wrote:
> >
> > I can't seem to find a link the the Spark KEYS file.  I am
> > trying to
> > validate the sigs on the 1.6.2 release artifacts and I need to
> > import 0x7C6C105FFC8ED089.  Is there a KEYS file available for
> > download somewhere?  Apologies if I am just missing an obvious
> > link.
> >
> > Phil
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > <mailto:user-unsubscr...@spark.apache.org>
> >
> >
>
>
>


Re: KEYS file?

2016-07-10 Thread Shuai Lin
Not sure where you see " 0x7C6C105FFC8ED089". I think the release is signed
with the key https://people.apache.org/keys/committer/pwendell.asc .

I think this tutorial can be helpful:
http://www.apache.org/info/verification.html

On Mon, Jul 11, 2016 at 12:57 AM, Phil Steitz  wrote:

> I can't seem to find a link the the Spark KEYS file.  I am trying to
> validate the sigs on the 1.6.2 release artifacts and I need to
> import 0x7C6C105FFC8ED089.  Is there a KEYS file available for
> download somewhere?  Apologies if I am just missing an obvious link.
>
> Phil
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Poor performance of using spark sql over gzipped json files

2016-06-24 Thread Shuai Lin
Hi,

We have tried to use spark sql to process some gzipped json-format log
files stored on S3 or HDFS. But the performance is very poor.

For example, here is the code that I run over 20 gzipped files (total size
of them is 4GB compressed and ~40GB when decompressed)

gzfile = 's3n://my-logs-bucket/*.gz' # or 'hdfs://nameservice1/tmp/*.gz'
sc_file = sc.textFile(gzfile)
sc_file.cache()
df = sqlContext.jsonRDD(sc_file)
df.select('*').limit(1).show()

With 6 executors launched, each with 2 cpu cores and 5GB RAM, the
"df.select" operation would always take more than 150 secs to finish,
regardless of whether the files are stored on s3 or HDFS.

BTW we are running spark 1.6.1 on mesos, with fine-grained mode.

Downloading from s3 is fast. In another test within the same environment,
it takes no more than 2 minutes to finish a simple "sc_file.count()" over
500 similar files whose total size is 15GB when compressed, and 400GB when
decompressed.

I thought the bottleneck might be in the json schema auto-inference.
However, I have tried specify the schema explicitly instead of letting
spark infer it, but that makes no notable difference.

Things I plan to try soon:

* Decompress the gz files and save it to HDFS, construct a data frame on
decompressed files, then run sql over it.
* Or save the json files into parquet format on HDFS, and then run sql over
it.

Do you have any suggestions? Thanks!

Regards,
Shuai


Questions about Spark On Mesos

2016-03-15 Thread Shuai Lin
Hi list,

We (scrapinghub) are planning to deploy spark in a 10+ node cluster, mainly
for processing data in HDFS and kafka streaming. We are thinking of using
mesos instead of yarn as the cluster resource manager so we can use docker
container as the executor and makes deployment easier. But there is one
import thing before making the decision: data locality.

If we run spark on mesos, can it achieve good data locality when processing
HDFS data? I think spark on yarn can achieve that out of the box, but not
sure whether spark on mesos could do that.

I've searched through the archive of the list, but didn't find a helpful
answer yet. Any reply is appreciated.

Regards,
Shuai