Re: Spark performance testing

2016-07-08 Thread charles li
Hi, Andrew, I've got lots of materials when asking google for "*spark performance test*" - https://github.com/databricks/spark-perf - https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf -

Is there a way to dynamic load files [ parquet or csv ] in the map function?

2016-07-08 Thread charles li
hi, guys, is there a way to dynamic load files within the map function. i.e. Can I code as bellow: ​ thanks a lot. ​ -- *___* ​ ​ Quant | Engineer | Boy *___* *blog*:http://litaotao.github.io *github*: www.github.com/litaotao

Preview release of Spark 2.0

2016-05-29 Thread charles li
Here is the link: http://spark.apache.org/news/spark-2.0.0-preview.html congrats, haha, looking forward to 2.0.1, awesome project. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

question about Reynold's talk: " The Future of Real Time"

2016-04-22 Thread charles li
hi, there, the talk *The Future of Real Time in Spark* here https://www.youtube.com/watch?v=oXkxXDG0gNk tells that there will be "BI app integration" on 24:28 of the video. what does he mean the *BI app integration* in that talk? does that mean that they will develop a BI tool like zeppelin,

Re: confusing about Spark SQL json format

2016-03-31 Thread charles li
-- On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY <umesh9...@gmail.com> wrote: > Hi, > Look at below image which is from json.org : > > [image: Inline image 1] > > The above image describes the object formulation of below JSON: > > Object 1=> {"nam

confusing about Spark SQL json format

2016-03-31 Thread charles li
as this post says, that in spark, we can load a json file in this way bellow: *post* : https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html ---

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread charles li
arau <hol...@pigscanfly.ca> > wrote: > >> You probably want to look at the map transformation, and the many more >> defined on RDDs. The function you pass in to map is serialized and the >> computation is distributed. >> >> >> On Monday, March 28, 2016, ch

since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread charles li
use case: have a dataset, and want to use different algorithms on that, and fetch the result. for making this, I think I should distribute my algorithms, and run these algorithms on the dataset at the same time, am I right? but it seems that spark can not parallelize/serialize

Re: what happened if cache a RDD for multiple time?

2016-03-24 Thread charles li
y and/or disk storage > */ > private[spark] def persistRDD(rdd: RDD[_]) { > persistentRdds(rdd.id) = rdd > } > > Hope this helps. > > Best > Yash > > On Thu, Mar 24, 2016 at 1:58 PM, charles li <charles.up...@gmail.com> > wrote: > >> >

what happened if cache a RDD for multiple time?

2016-03-24 Thread charles li
happened to see this problem on stackoverflow: http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812 I think it's very interesting, and I think the answer posted by Aaron sounds promising, but I'm not sure, and I don't find the details

Re: best practices: running multi user jupyter notebook server

2016-03-20 Thread charles li
Hi, andy, I think you can make that with some open source packages/libs built for IPython and Spark. here is one : https://github.com/litaotao/IPython-Dashboard On Thu, Mar 17, 2016 at 1:36 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > We are considering deploying a notebook

Re: best way to do deep learning on spark ?

2016-03-19 Thread charles li
rs, etc. are > currently under development. Please refer to > https://issues.apache.org/jira/browse/SPARK-5575 > > > > Best regards, Alexander > > > > *From:* charles li [mailto:charles.up...@gmail.com] > *Sent:* Wednesday, March 16, 2016 7:01 PM > *To:* u

best way to do deep learning on spark ?

2016-03-19 Thread charles li
Hi, guys, I'm new to MLlib on spark, after reading the document, it seems that MLlib does not support deep learning, I want to know is there any way to implement deep learning on spark ? *Do I must use 3-party package like caffe or tensorflow ?* or *Does deep learning module list in the MLlib

the "DAG Visualiztion" in 1.6 not works fine here

2016-03-15 Thread charles li
sometimes it just shows several *black dots*, and sometimes it can not show the entire graph. did anyone meet this before and how did you fix it? ​ ​ -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

is there any way to make WEB UI auto-refresh?

2016-03-15 Thread charles li
every time I can only get the latest info by refreshing the page, that's a little boring. so is there any way to make the WEB UI auto-refreshing ? great thanks -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Re: rdd cache name

2016-03-02 Thread charles li
your in-memory cache size or cache off-heap or to disk. > > Xinh > > On Wed, Mar 2, 2016 at 1:48 AM, charles li <charles.up...@gmail.com> > wrote: > >> hi, there, I feel a little confused about the *cache* in spark. >> >> first, is there any way to *customize the cach

rdd cache name

2016-03-02 Thread charles li
hi, there, I feel a little confused about the *cache* in spark. first, is there any way to *customize the cached RDD name*, it's not convenient for me when looking at the storage page, there are the kind of RDD in the RDD Name column, I hope to make it as my customized name, kinds of 'rdd 1',

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread charles li
since spark is under actively developing, so take a book to learn it is somehow outdated to some degree. I would like to suggest learn it from several ways as bellow: - spark official document, trust me, you will go through this for several time if you want to learn in well :

how to interview spark developers

2016-02-23 Thread charles li
hi, there, we are going to recruit several spark developers, can some one give some ideas on interviewing candidates, say, spark related problems. great thanks. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

rdd cache priority

2016-02-04 Thread charles li
say I have 2 RDDs, RDD1 and RDD2. both are 20g in memory. and I cache both of them in memory using RDD1.cache() and RDD2.cache() the in the further steps on my app, I never use RDD1 but use RDD2 for lots of time. then here is my question: if there is only 40G memory in my cluster, and here

spark.executor.memory ? is used just for cache RDD or both cache RDD and the runtime of cores on worker?

2016-02-04 Thread charles li
if set spark.executor.memory = 2G for each worker [ 10 in total ] does it mean I can cache 20G RDD in memory ? if so, how about the memory for code running in each process on each worker? thanks. -- and is there any materials about memory management or resource management in spark ? I want to

questions about progress bar status [stuck]?

2016-02-01 Thread charles li
code: --- total = int(1e8) local_collection = range(1, total) rdd = sc.parallelize(local_collection) res = rdd.collect() --- web ui status --- ​ problems: --- 1. from the status bar, it seems that the there should be about half tasks done, but it just say there is

how to introduce spark to your colleague if he has no background about *** spark related

2016-01-31 Thread charles li
*Apache Spark™* is a fast and general engine for large-scale data processing. it's a good profile of spark, but it's really too short for lots of people if then have little background in this field. ok, frankly, I'll give a tech-talk about spark later this week, and now I'm writing a slide about

confusing about start ipython notebook with spark between 1.3.x and 1.6.x

2016-01-31 Thread charles li
I used to use spark 1.3.x before, and explore my data in ipython [3.2] notebook, which was very stable. but I came across an error " Java gateway process exited before sending the driver its port number " my code is as bellow: ``` import pyspark from pyspark import SparkConf sc_conf =

best practice : how to manage your Spark cluster ?

2016-01-20 Thread charles li
I've put a thread before: pre-install 3-party Python package on spark cluster currently I use *Fabric* to manage my cluster , but it's not enough for me, and I believe there is a much better way to *manage and monitor* the cluster. I believe there really exists some open source manage tools

rdd.foreach return value

2016-01-18 Thread charles li
code snippet ​ the 'print' actually print info on the worker node, but I feel confused where the 'return' value goes to. for I get nothing on the driver node. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Re: rdd.foreach return value

2016-01-18 Thread charles li
ou can return element in the way shown in the snippet. > > On Mon, Jan 18, 2016 at 7:34 PM, charles li <charles.up...@gmail.com> > wrote: > >> code snippet >> >> >> ​ >> the 'print' actually print info on the worker node, but I feel confused >>

Re: rdd.foreach return value

2016-01-18 Thread charles li
s and calls the function being > passed. That's it. It doesn't collect the values and don't return any new > modified RDD. > > > On Mon, Jan 18, 2016 at 11:10 PM, charles li <charles.up...@gmail.com> > wrote: > >> >> hi, great thanks to david and ted, I know that t

Re: rdd.foreach return value

2016-01-18 Thread charles li
ou can return element in the way shown in the snippet. > > On Mon, Jan 18, 2016 at 7:34 PM, charles li <charles.up...@gmail.com> > wrote: > >> code snippet >> >> >> ​ >> the 'print' actually print info on the worker node, but I feel confused >>

Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-11 Thread charles li
cache is the default storage level of persist, and it is lazy [ not cached indeed ] until the first time it is computed. ​ On Tue, Jan 12, 2016 at 5:13 AM, ponkin wrote: > Hi, > > Here is my use case : > I have kafka topic. The job is fairly simple - it reads topic and

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
: How many partitions did you use and how many CPU cores in total? The former shouldn't be much larger than the latter. Could you also check the shuffle size from the WebUI? -Xiangrui On Fri, Jul 25, 2014 at 4:10 AM, Charles Li littlee1...@gmail.com wrote: Hi Xiangrui, Thanks for your

Questions about disk IOs

2014-07-01 Thread Charles Li
Hi Spark, I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G. The weight size is around 1,300,000. I am quite confused that the performance is very close whether the data is cached or not. The program is simple: points = sc.hadoopFIle(int,