Re: Spark performance testing

2016-07-08 Thread charles li
Hi, Andrew, I've got lots of materials when asking google for "*spark performance test*" - https://github.com/databricks/spark-perf - https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf - http://people.cs.vt.edu/~butt

Is there a way to dynamic load files [ parquet or csv ] in the map function?

2016-07-08 Thread charles li
hi, guys, is there a way to dynamic load files within the map function. i.e. Can I code as bellow: ​ thanks a lot. ​ -- *___* ​ ​ Quant | Engineer | Boy *___* *blog*:http://litaotao.github.io *github*: www.github.com/litaotao

Preview release of Spark 2.0

2016-05-29 Thread charles li
Here is the link: http://spark.apache.org/news/spark-2.0.0-preview.html congrats, haha, looking forward to 2.0.1, awesome project. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

question about Reynold's talk: " The Future of Real Time"

2016-04-22 Thread charles li
hi, there, the talk *The Future of Real Time in Spark* here https://www.youtube.com/watch?v=oXkxXDG0gNk tells that there will be "BI app integration" on 24:28 of the video. what does he mean the *BI app integration* in that talk? does that mean that they will develop a BI tool like zeppelin, hue

Re: confusing about Spark SQL json format

2016-03-31 Thread charles li
--- On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY wrote: > Hi, > Look at below image which is from json.org : > > [image: Inline image 1] > > The above image describes the object formulation of below JSON: > > Object 1=> {"name":"Yin", &

confusing about Spark SQL json format

2016-03-31 Thread charles li
as this post says, that in spark, we can load a json file in this way bellow: *post* : https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html --- sqlContext.jsonFile(fil

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread charles li
robably want to look at the map transformation, and the many more >> defined on RDDs. The function you pass in to map is serialized and the >> computation is distributed. >> >> >> On Monday, March 28, 2016, charles li wrote: >> >>> >>> use case: h

since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread charles li
use case: have a dataset, and want to use different algorithms on that, and fetch the result. for making this, I think I should distribute my algorithms, and run these algorithms on the dataset at the same time, am I right? but it seems that spark can not parallelize/serialize algorithms/function

Re: what happened if cache a RDD for multiple time?

2016-03-24 Thread charles li
age > */ > private[spark] def persistRDD(rdd: RDD[_]) { > persistentRdds(rdd.id) = rdd > } > > Hope this helps. > > Best > Yash > > On Thu, Mar 24, 2016 at 1:58 PM, charles li > wrote: > >> >> happened to see this problem on stackoverflow: >&g

what happened if cache a RDD for multiple time?

2016-03-24 Thread charles li
happened to see this problem on stackoverflow: http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812 I think it's very interesting, and I think the answer posted by Aaron sounds promising, but I'm not sure, and I don't find the details o

Re: best practices: running multi user jupyter notebook server

2016-03-20 Thread charles li
Hi, andy, I think you can make that with some open source packages/libs built for IPython and Spark. here is one : https://github.com/litaotao/IPython-Dashboard On Thu, Mar 17, 2016 at 1:36 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > We are considering deploying a notebook serve

Re: best way to do deep learning on spark ?

2016-03-19 Thread charles li
layers, etc. are > currently under development. Please refer to > https://issues.apache.org/jira/browse/SPARK-5575 > > > > Best regards, Alexander > > > > *From:* charles li [mailto:charles.up...@gmail.com] > *Sent:* Wednesday, March 16, 2016 7:01 PM > *To:* user

best way to do deep learning on spark ?

2016-03-19 Thread charles li
Hi, guys, I'm new to MLlib on spark, after reading the document, it seems that MLlib does not support deep learning, I want to know is there any way to implement deep learning on spark ? *Do I must use 3-party package like caffe or tensorflow ?* or *Does deep learning module list in the MLlib de

the "DAG Visualiztion" in 1.6 not works fine here

2016-03-15 Thread charles li
sometimes it just shows several *black dots*, and sometimes it can not show the entire graph. did anyone meet this before and how did you fix it? ​ ​ -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

is there any way to make WEB UI auto-refresh?

2016-03-15 Thread charles li
every time I can only get the latest info by refreshing the page, that's a little boring. so is there any way to make the WEB UI auto-refreshing ? great thanks -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Re: rdd cache name

2016-03-02 Thread charles li
y cache size or cache off-heap or to disk. > > Xinh > > On Wed, Mar 2, 2016 at 1:48 AM, charles li > wrote: > >> hi, there, I feel a little confused about the *cache* in spark. >> >> first, is there any way to *customize the cached RDD name*, it's not >

rdd cache name

2016-03-02 Thread charles li
hi, there, I feel a little confused about the *cache* in spark. first, is there any way to *customize the cached RDD name*, it's not convenient for me when looking at the storage page, there are the kind of RDD in the RDD Name column, I hope to make it as my customized name, kinds of 'rdd 1', 'rrd

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread charles li
since spark is under actively developing, so take a book to learn it is somehow outdated to some degree. I would like to suggest learn it from several ways as bellow: - spark official document, trust me, you will go through this for several time if you want to learn in well : http://spark.

how to interview spark developers

2016-02-23 Thread charles li
hi, there, we are going to recruit several spark developers, can some one give some ideas on interviewing candidates, say, spark related problems. great thanks. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

spark.executor.memory ? is used just for cache RDD or both cache RDD and the runtime of cores on worker?

2016-02-04 Thread charles li
if set spark.executor.memory = 2G for each worker [ 10 in total ] does it mean I can cache 20G RDD in memory ? if so, how about the memory for code running in each process on each worker? thanks. -- and is there any materials about memory management or resource management in spark ? I want to p

rdd cache priority

2016-02-04 Thread charles li
say I have 2 RDDs, RDD1 and RDD2. both are 20g in memory. and I cache both of them in memory using RDD1.cache() and RDD2.cache() the in the further steps on my app, I never use RDD1 but use RDD2 for lots of time. then here is my question: if there is only 40G memory in my cluster, and here I

questions about progress bar status [stuck]?

2016-02-01 Thread charles li
code: --- total = int(1e8) local_collection = range(1, total) rdd = sc.parallelize(local_collection) res = rdd.collect() --- web ui status --- ​ problems: --- 1. from the status bar, it seems that the there should be about half tasks done, but it just say there is no

how to introduce spark to your colleague if he has no background about *** spark related

2016-01-31 Thread charles li
*Apache Spark™* is a fast and general engine for large-scale data processing. it's a good profile of spark, but it's really too short for lots of people if then have little background in this field. ok, frankly, I'll give a tech-talk about spark later this week, and now I'm writing a slide about

confusing about start ipython notebook with spark between 1.3.x and 1.6.x

2016-01-31 Thread charles li
I used to use spark 1.3.x before, and explore my data in ipython [3.2] notebook, which was very stable. but I came across an error " Java gateway process exited before sending the driver its port number " my code is as bellow: ``` import pyspark from pyspark import SparkConf sc_conf = SparkCon

best practice : how to manage your Spark cluster ?

2016-01-20 Thread charles li
I've put a thread before: pre-install 3-party Python package on spark cluster currently I use *Fabric* to manage my cluster , but it's not enough for me, and I believe there is a much better way to *manage and monitor* the cluster. I believe there really exists some open source manage tools whic

Re: rdd.foreach return value

2016-01-18 Thread charles li
s and calls the function being > passed. That's it. It doesn't collect the values and don't return any new > modified RDD. > > > On Mon, Jan 18, 2016 at 11:10 PM, charles li > wrote: > >> >> hi, great thanks to david and ted, I know that the content o

Re: rdd.foreach return value

2016-01-18 Thread charles li
Unit = withScope { > > I don't think you can return element in the way shown in the snippet. > > On Mon, Jan 18, 2016 at 7:34 PM, charles li > wrote: > >> code snippet >> >> >> ​ >> the 'p

Re: rdd.foreach return value

2016-01-18 Thread charles li
the way shown in the snippet. > > On Mon, Jan 18, 2016 at 7:34 PM, charles li > wrote: > >> code snippet >> >> >> ​ >> the 'print' actually print info on the worker node, but I feel confused >

rdd.foreach return value

2016-01-18 Thread charles li
code snippet ​ the 'print' actually print info on the worker node, but I feel confused where the 'return' value goes to. for I get nothing on the driver node. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-11 Thread charles li
cache is the default storage level of persist, and it is lazy [ not cached indeed ] until the first time it is computed. ​ On Tue, Jan 12, 2016 at 5:13 AM, ponkin wrote: > Hi, > > Here is my use case : > I have kafka topic. The job is fairly simple - it reads topic and save > data to several hd

Re: Snappy error when driver is running in JBoss

2015-01-06 Thread Charles Li
Hi Thanks for the reply! I did do a echo $CLASSPATH, but I got nothing. Since we are running inside jboss, I guess the class path is not set? People did mention that JBoss loads snappy-java multiple times. But I cannot find a way to solve that problem. Cheers On Jan 6, 2015, at 5:3

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
any partitions did you use and how many CPU cores in total? The > former shouldn't be much larger than the latter. Could you also check > the shuffle size from the WebUI? -Xiangrui > > On Fri, Jul 25, 2014 at 4:10 AM, Charles Li wrote: >> Hi Xiangrui, >> >> Thanks fo

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
own On Jul 2, 2014, at 0:08, Xiangrui Meng wrote: > Try to reduce number of partitions to match the number of cores. We > will add treeAggregate to reduce the communication cost. > > PR: https://github.com/apache/spark/pull/1110 > > -Xiangrui > > On Tue, Jul 1, 2014 at

Questions about disk IOs

2014-07-01 Thread Charles Li
Hi Spark, I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G. The weight size is around 1,300,000. I am quite confused that the performance is very close whether the data is cached or not. The program is simple: points = sc.hadoopFIle(int, SequenceFileInput