Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Evan Zamir
Yep, done. https://issues.apache.org/jira/browse/SPARK-17508 On Mon, Sep 12, 2016 at 9:06 AM Nick Pentreath wrote: > Could you create a JIRA ticket for it? > > https://issues.apache.org/jira/browse/SPARK > > On Thu, 8 Sep 2016 at 07:50 evanzamir

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-07 Thread Evan Zamir
t; > On Tue, Sep 6, 2016 at 11:15 PM, Evan Zamir <zamir.e...@gmail.com> wrote: > > I am using the default setting for setting fitIntercept, which *should* > be > > TRUE right? > > > > On Tue, Sep 6, 2016 at 1:38 PM Sean Owen <so...@cloudera.com> wrote: >

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Evan Zamir
I am using the default setting for setting *fitIntercept*, which *should* be TRUE right? On Tue, Sep 6, 2016 at 1:38 PM Sean Owen wrote: > Are you not fitting an intercept / regressing through the origin? with > that constraint it's no longer true that R^2 is necessarily >

[Community] Python support added to Spark Job Server

2016-08-17 Thread Evan Chan
Hi folks, Just a friendly message that we have added Python support to the REST Spark Job Server project. If you are a Python user looking for a RESTful way to manage your Spark jobs, please come have a look at our project! https://github.com/spark-jobserver/spark-jobserver -Evan

Re: How to add custom steps to Pipeline models?

2016-08-14 Thread Evan Zamir
Thanks, but I should have been more clear that I'm trying to do this in PySpark, not Scala. Using an example I found on SO, I was able to implement a Pipeline step in Python, but it seems it is more difficult (perhaps currently impossible) to make it persist to disk (I tried implementing _to_java

Re: Can we use spark inside a web service?

2016-03-14 Thread Evan Chan
nd memory between queries. >>> >>> Note that Mark is running a slightly-modified version of stock Spark. >>> (He's mentioned this in prior posts, as well.) >>> >>> And I have to say that I'm, personally, seeing more and more >>> slightly-mo

Re: Can we use spark inside a web service?

2016-03-14 Thread Evan Chan
>> >> this may not be what people want to hear, but it's a trend that i'm seeing >> lately as more and more customize Spark to their specific use cases. >> >> Anyway, thanks for the good discussion, everyone! This is why we have >> these lists, right! :) >> >>

Re: Can we use spark inside a web service?

2016-03-10 Thread Evan Chan
000 core cluster can run at most >> 1000 simultaneous Tasks, but that doesn't really tell you anything about how >> many Jobs are or can be concurrently tracked by the DAGScheduler, which will >> be apportioning the Tasks from those concurrent Jobs across the available >>

Achieving 700 Spark SQL Queries Per Second

2016-03-10 Thread Evan Chan
700 queries per second in Spark: http://velvia.github.io/Spark-Concurrent-Fast-Queries/ Would love your feedback. thanks, Evan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Evan R. Sparks
in a DataFrame might be a welcome addition. - Evan On Thu, Mar 5, 2015 at 8:43 PM, Wush Wu w...@bridgewell.com wrote: Dear all, I am a new spark user from R. After exploring the schemaRDD, I notice that it is similar to data.frame. Is there a feature like `model.matrix` in R to convert

Re: Spark on teradata?

2015-01-08 Thread Evan R. Sparks
Have you taken a look at the TeradataDBInputFormat? Spark is compatible with arbitrary hadoop input formats - so this might work for you: http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw On Thu, Jan 8, 2015 at 10:53 AM, gen tang gen.tan...@gmail.com

Re: Spark and Stanford CoreNLP

2014-11-25 Thread Evan R. Sparks
will face this issue. HTH, Evan On Tue, Nov 25, 2014 at 8:05 AM, Christopher Manning mann...@stanford.edu wrote: I’m not (yet!) an active Spark user, but saw this thread on twitter … and am involved with Stanford CoreNLP. Could someone explain how things need to be to work better with Spark — since

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan Sparks
We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. As an aside, if you're using it from Scala, have a look at

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
() } and then refer to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks evan.spa...@gmail.com wrote: We have gotten this to work, but it requires

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
Additionally - I strongly recommend using OpenBLAS over the Atlas build from the default Ubuntu repositories. Alternatively, you can build ATLAS on the hardware you're actually going to be running the matrix ops on (the master/workers), but we've seen modest performance gains doing this vs.

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks evan.spa...@gmail.com wrote: We have gotten this to work, but it requires instantiating the CoreNLP

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
You can try recompiling spark with that option, and doing an sbt/sbt publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT (assuming you're building from the 1.1 branch) - sbt or maven (whichever you're compiling your app with) will pick up the version of spark that you just

Re: SparkSQL - can we add new column(s) to parquet files

2014-11-21 Thread Evan Chan
I would expect an SQL query on c would fail because c would not be known in the schema of the older Parquet file. What I'd be very interested in is how to add a new column as an incremental new parquet file, and be able to somehow join the existing and new file, in an efficient way. IE, somehow

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks
For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ . - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks
, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML

Re: why decision trees do binary split?

2014-11-06 Thread Evan R. Sparks
You can imagine this same logic applying to the continuous case. E.g. what if all the quartiles or deciles of a particular value have different behavior - this could capture that too. Of what if some combination of features was highly discriminitive but only into n buckets, rather than two.. you

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks
Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks
, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Evan Chan
/ rebuild the RDD (it tries to only rebuild the missing part, but sometimes it must rebuild everything). Job server can help with 1 or 2, 2 in particular. If you have any questions about job server, feel free to ask at the spark-jobserver google group. I am the maintainer. -Evan On Thu, Oct 23

Re: MLlib linking error Mac OS X

2014-10-20 Thread Evan Sparks
up your program. - Evan On Oct 20, 2014, at 3:54 AM, npomfret nick-nab...@snowmonkey.co.uk wrote: I'm getting the same warning on my mac. Accompanied by what appears to be pretty low CPU usage (http://apache-spark-user-list.1001560.n3.nabble.com/mlib-model-build-and-low-CPU-usage-td16777

Re: Spark speed performance

2014-10-18 Thread Evan Sparks
How many files do you have and how big is each JSON object? Spark works better with a few big files vs many smaller ones. So you could try cat'ing your files together and rerunning the same experiment. - Evan On Oct 18, 2014, at 12:07 PM, jan.zi...@centrum.cz jan.zi...@centrum.cz wrote

Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Evan Samanas
be to backport 'spark.localExecution.enabled' to the 1.0 line. Thanks for all your help! Evan On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu dav...@databricks.com wrote: This is some kind of implementation details, so not documented :-( If you think this is a blocker for you, you could create a JIRA

Re: where are my python lambda functions run in yarn-client mode?

2014-10-10 Thread Evan
Thank you! I was looking for a config variable to that end, but I was looking in Spark 1.0.2 documentation, since that was the version I had the problem with. Is this behavior documented in 1.0.2's documentation? Evan On 10/09/2014 04:12 PM, Davies Liu wrote: When you call rdd.take

Re: How to run kmeans after pca?

2014-09-30 Thread Evan R. Sparks
Caching after doing the multiply is a good idea. Keep in mind that during the first iteration of KMeans, the cached rows haven't yet been materialized - so it is both doing the multiply and the first pass of KMeans all at once. To isolate which part is slow you can run cachedRows.numRows() to

Re: spark1.0 principal component analysis

2014-09-23 Thread Evan R. Sparks
, you can simply run step 1 yourself on your RowMatrix via the (experimental) computeCovariance() method, and then run SVD on the result using a library like breeze. - Evan On Tue, Sep 23, 2014 at 12:49 PM, st553 sthompson...@gmail.com wrote: sowen wrote it seems that the singular values from

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-19 Thread Evan Chan
at 10:40 PM, Evan Chan velvia.git...@gmail.com wrote: SPARK-1671 looks really promising. Note that even right now, you don't need to un-cache the existing table. You can do something like this: newAdditionRdd.registerTempTable(table2) sqlContext.cacheTable(table2) val unionedRdd

Re: Example of Geoprocessing with Spark

2014-09-19 Thread Evan Chan
Hi Abel, Pretty interesting. May I ask how big is your point CSV dataset? It seems you are relying on searching through the FeatureCollection of polygons for which one intersects your point. This is going to be extremely slow. I highly recommend using a SpatialIndex, such as the many that

Re: Better way to process large image data set ?

2014-09-19 Thread Evan Chan
What Sean said. You should also definitely turn on Kryo serialization. The default Java serialization is really really slow if you're gonna move around lots of data.Also make sure you use a cluster with high network bandwidth on. On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen so...@cloudera.com

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-14 Thread Evan Chan
SPARK-1671 looks really promising. Note that even right now, you don't need to un-cache the existing table. You can do something like this: newAdditionRdd.registerTempTable(table2) sqlContext.cacheTable(table2) val unionedRdd = sqlContext.table(table1).unionAll(sqlContext.table(table2)) When

Re: Message Passing among workers

2014-09-03 Thread Evan R. Sparks
Asynchrony is not supported directly - spark's programming model is naturally BSP. I have seen cases where people have instantiated actors with akka on worker nodes to enable message passing, or even used spark's own ActorSystem to do this. But, I do not recommend this, since you lose a bunch of

Re: mllib performance on cluster

2014-09-03 Thread Evan R. Sparks
I spoke with SK offline about this, it looks like the difference in timings came from the fact that he was training 100 models for 100 iterations and taking the total time (vs. my example which trains a single model for 100 iterations). I'm posting my response here, though, because I think it's

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in 1s on a single machine. Are you caching your

Re: Finding previous and next element in a sorted RDD

2014-08-22 Thread Evan Chan
There's no way to avoid a shuffle due to the first and last elements of each partition needing to be computed with the others, but I wonder if there is a way to do a minimal shuffle. On Thu, Aug 21, 2014 at 6:13 PM, cjwang c...@cjwang.us wrote: One way is to do zipWithIndex on the RDD. Then use

Merging two Spark SQL tables?

2014-08-21 Thread Evan Chan
cached too. thanks, Evan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Merging two Spark SQL tables?

2014-08-21 Thread Evan Chan
...@databricks.com wrote: I believe this should work if you run srdd1.unionAll(srdd2). Both RDDs must have the same schema. On Wed, Aug 20, 2014 at 11:30 PM, Evan Chan velvia.git...@gmail.com wrote: Is it possible to merge two cached Spark SQL tables into a single table so it can queried with one

Writeup on Spark SQL with GDELT

2014-08-21 Thread Evan Chan
I just put up a repo with a write-up on how to import the GDELT public dataset into Spark SQL and play around. Has a lot of notes on different import methods and observations about Spark SQL. Feel free to have a look and comment. http://www.github.com/velvia/spark-sql-gdelt

[Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) scala val gdeltT = sqlContext.parquetFile(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/) 14/08/21 19:07:14 INFO : initialize(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005, Configuration: core-default.xml, core-site.xml,

Re: [Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
The underFS is HDFS btw. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote: Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) scala val gdeltT = sqlContext.parquetFile(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/) 14/08/21 19:07:14 INFO

Re: [Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
And it worked earlier with non-parquet directory. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote: The underFS is HDFS btw. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote: Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config

Spark-JobServer moving to a new location

2014-08-21 Thread Evan Chan
-jobserver The git commit history is still there, but unfortunately the pull requests don't migrate over. I'll be contacting each of you with open PRs to move them over to the new location. Happy Hacking! Evan (@velvia) Kelvin (@kelvinchu) Daniel (@dan-null

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Evan Chan
That might not be enough. Reflection is used to determine what the fields are, thus your class might actually need to have members corresponding to the fields in the table. I heard that a more generic method of inputting stuff is coming. On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer

Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks
Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a combiner in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu,

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Evan R. Sparks
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data

Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Evan R. Sparks
The loss functions are represented in the various names of the model families. SVM is hinge loss, LogisticRegression is logistic loss, LinearRegression is linear loss. These are used internally as arguments to the SGD and L-BFGS optimizers. On Thu, Aug 7, 2014 at 6:31 PM, SK

Re: Problem reading from S3 in standalone application

2014-08-06 Thread Evan Sparks
Try s3n:// On Aug 6, 2014, at 12:22 AM, sparkuser2345 hm.spark.u...@gmail.com wrote: I'm getting the same Input path does not exist error also after setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using the format s3://bucket-name/test_data.txt for the

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Computing the variance is similar to this example, you just need to keep around the sum of squares as well. The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
it, but I think it's a nice example of how to think about using spark at a higher level of abstraction. - Evan On Fri, Aug 1, 2014 at 2:00 PM, Sean Owen so...@cloudera.com wrote: Here's the more functional programming-friendly take on the computation (but yeah this is the naive formula

Re: Decision tree classifier in MLlib

2014-07-25 Thread Evan R. Sparks
Can you share the dataset via a gist or something and we can take a look at what's going on? On Fri, Jul 25, 2014 at 10:51 AM, SK skrishna...@gmail.com wrote: yes, the output is continuous. So I used a threshold to get binary labels. If prediction threshold, then class is 0 else 1. I use

Re: Getting the number of slaves

2014-07-24 Thread Evan R. Sparks
Try sc.getExecutorStorageStatus().length SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will give you back an object per executor - the StorageStatus objects are what drives a lot of the Spark Web UI.

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Evan R. Sparks
of your dataset, may or may not be a good idea. There are some tricks you can do to make training multiple models on the same dataset faster, which we're hoping to expose to users in an upcoming release. - Evan On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen so...@cloudera.com wrote: If you call .par

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Evan R. Sparks
There is a method in org.apache.spark.mllib.util.MLUtils called kFold which will automatically partition your dataset for you into k train/test splits at which point you can build k different models and aggregate the results. For example (a very rough sketch - assuming I want to do 10-fold cross

Re: MLLib sample data format

2014-06-22 Thread Evan Sparks
but can double (or more) storage requirements for dense data. - Evan On Jun 22, 2014, at 3:35 PM, Justin Yip yipjus...@gmail.com wrote: Hello, I am looking into a couple of MLLib data files in https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any explanation

Re: Performance problems on SQL JOIN

2014-06-20 Thread Evan R. Sparks
Also - you could consider caching your data after the first split (before the first filter), this will prevent you from retrieving the data from s3 twice. On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng men...@gmail.com wrote: Your data source is S3 and data is used twice. m1.large does not

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-19 Thread Evan R. Sparks
programming model, with Spark you can achieve performance that is comparable to a tuned C++/MPI codebase by leveraging the right libraries locally and thinking carefully about what and when you have to communicate. - Evan On Thu, Jun 19, 2014 at 8:48 AM, ldmtwo larry.d.moore...@intel.com wrote

Re: How do you run your spark app?

2014-06-19 Thread Evan R. Sparks
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like java -cp ... MyDriver. That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit -

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Evan R. Sparks
This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int, hits: Long) val data = sc.parallelize(Array(MyRecord(USA, Franklin, 24, 234), MyRecord(USA, Bob, 55, 108),

Re: pmml with augustus

2014-06-10 Thread Evan R. Sparks
I should point out that if you don't want to take a polyglot approach to languages and reside solely in the JVM, then you can just use plain old java serialization on the Model objects that come out of MLlib's APIs from Java or Scala and load them up in another process and call the relevant

Re: Random Forest on Spark

2014-04-18 Thread Evan R. Sparks
estimate a couple of gigs necessary for heap space for the worker to compute/store the histograms, and I guess 2x that on the master to do the reduce. Again 2GB per worker is pretty tight, because there are overheads of just starting the jvm, launching a worker, loading libraries, etc. - Evan

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Sorry - I meant to say that Multiclass classification, Gradient Boosting, and Random Forest support based on the recent Decision Tree implementation in MLlib is planned and coming soon. On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Multiclass classification

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
. With a huge amount of data (millions or even billions of rows), we found that the depth of 10 is simply not adequate to build high-accuracy models. On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Hmm... can you provide some pointers to examples where deep trees

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks
: Hi, Evan, Just noticed this thread, do you mind sharing more details regarding algorithms targetted at hyperparameter tuning/model selection? or a link to dev git repo for that work. thanks, yi On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Targeting 0.9.0

Re: Status of MLI?

2014-04-02 Thread Evan R. Sparks
Targeting 0.9.0 should work out of the box (just a change to the build.sbt) - I'll push some changes I've been sitting on to the public repo in the next couple of days. On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote: Thanks for the update Evan! In terms of using MLI, I

Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks
Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H shankark

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks
/datasets-released-by-google - Evan On Tue, Feb 25, 2014 at 6:33 PM, 黄远强 hyq...@163.com wrote: Hi all: I am a freshman in Spark community. i dream of being a expert in the field of big data. But i have no idea where to start after i have gone through the published documents in Spark website