Hi, Andrew, I've got lots of materials when asking google for "*spark
performance test*"
- https://github.com/databricks/spark-perf
-
https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf
- http://people.cs.vt.edu/~butt
hi, guys, is there a way to dynamic load files within the map function.
i.e.
Can I code as bellow:
thanks a lot.
--
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io
*github*: www.github.com/litaotao
Here is the link: http://spark.apache.org/news/spark-2.0.0-preview.html
congrats, haha, looking forward to 2.0.1, awesome project.
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
hi, there, the talk *The Future of Real Time in Spark* here
https://www.youtube.com/watch?v=oXkxXDG0gNk tells that there will be "BI
app integration" on 24:28 of the video.
what does he mean the *BI app integration* in that talk? does that mean
that they will develop a BI tool like zeppelin, hue
---
On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY
wrote:
> Hi,
> Look at below image which is from json.org :
>
> [image: Inline image 1]
>
> The above image describes the object formulation of below JSON:
>
> Object 1=> {"name":"Yin", &
as this post says, that in spark, we can load a json file in this way
bellow:
*post* :
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
---
sqlContext.jsonFile(fil
robably want to look at the map transformation, and the many more
>> defined on RDDs. The function you pass in to map is serialized and the
>> computation is distributed.
>>
>>
>> On Monday, March 28, 2016, charles li wrote:
>>
>>>
>>> use case: h
use case: have a dataset, and want to use different algorithms on that, and
fetch the result.
for making this, I think I should distribute my algorithms, and run these
algorithms on the dataset at the same time, am I right?
but it seems that spark can not parallelize/serialize algorithms/function
age
> */
> private[spark] def persistRDD(rdd: RDD[_]) {
> persistentRdds(rdd.id) = rdd
> }
>
> Hope this helps.
>
> Best
> Yash
>
> On Thu, Mar 24, 2016 at 1:58 PM, charles li
> wrote:
>
>>
>> happened to see this problem on stackoverflow:
>&g
happened to see this problem on stackoverflow:
http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812
I think it's very interesting, and I think the answer posted by Aaron
sounds promising, but I'm not sure, and I don't find the details o
Hi, andy, I think you can make that with some open source packages/libs
built for IPython and Spark.
here is one : https://github.com/litaotao/IPython-Dashboard
On Thu, Mar 17, 2016 at 1:36 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:
> We are considering deploying a notebook serve
layers, etc. are
> currently under development. Please refer to
> https://issues.apache.org/jira/browse/SPARK-5575
>
>
>
> Best regards, Alexander
>
>
>
> *From:* charles li [mailto:charles.up...@gmail.com]
> *Sent:* Wednesday, March 16, 2016 7:01 PM
> *To:* user
Hi, guys, I'm new to MLlib on spark, after reading the document, it seems
that MLlib does not support deep learning, I want to know is there any way
to implement deep learning on spark ?
*Do I must use 3-party package like caffe or tensorflow ?*
or
*Does deep learning module list in the MLlib de
sometimes it just shows several *black dots*, and sometimes it can not show
the entire graph.
did anyone meet this before and how did you fix it?
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
every time I can only get the latest info by refreshing the page, that's a
little boring.
so is there any way to make the WEB UI auto-refreshing ?
great thanks
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
y cache size or cache off-heap or to disk.
>
> Xinh
>
> On Wed, Mar 2, 2016 at 1:48 AM, charles li
> wrote:
>
>> hi, there, I feel a little confused about the *cache* in spark.
>>
>> first, is there any way to *customize the cached RDD name*, it's not
>
hi, there, I feel a little confused about the *cache* in spark.
first, is there any way to *customize the cached RDD name*, it's not
convenient for me when looking at the storage page, there are the kind of
RDD in the RDD Name column, I hope to make it as my customized name, kinds
of 'rdd 1', 'rrd
since spark is under actively developing, so take a book to learn it is
somehow outdated to some degree.
I would like to suggest learn it from several ways as bellow:
- spark official document, trust me, you will go through this for
several time if you want to learn in well : http://spark.
hi, there, we are going to recruit several spark developers, can some one
give some ideas on interviewing candidates, say, spark related problems.
great thanks.
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
if set spark.executor.memory = 2G for each worker [ 10 in total ]
does it mean I can cache 20G RDD in memory ? if so, how about the memory
for code running in each process on each worker?
thanks.
--
and is there any materials about memory management or resource management
in spark ? I want to p
say I have 2 RDDs, RDD1 and RDD2.
both are 20g in memory.
and I cache both of them in memory using RDD1.cache() and RDD2.cache()
the in the further steps on my app, I never use RDD1 but use RDD2 for lots
of time.
then here is my question:
if there is only 40G memory in my cluster, and here I
code:
---
total = int(1e8)
local_collection = range(1, total)
rdd = sc.parallelize(local_collection)
res = rdd.collect()
---
web ui status
---
problems:
---
1. from the status bar, it seems that the there should be about half tasks
done, but it just say there is no
*Apache Spark™* is a fast and general engine for large-scale data
processing.
it's a good profile of spark, but it's really too short for lots of people
if then have little background in this field.
ok, frankly, I'll give a tech-talk about spark later this week, and now I'm
writing a slide about
I used to use spark 1.3.x before, and explore my data in ipython [3.2]
notebook, which was very stable. but I came across an error
" Java gateway process exited before sending the driver its port number "
my code is as bellow:
```
import pyspark
from pyspark import SparkConf
sc_conf = SparkCon
I've put a thread before: pre-install 3-party Python package on spark
cluster
currently I use *Fabric* to manage my cluster , but it's not enough for me,
and I believe there is a much better way to *manage and monitor* the
cluster.
I believe there really exists some open source manage tools whic
s and calls the function being
> passed. That's it. It doesn't collect the values and don't return any new
> modified RDD.
>
>
> On Mon, Jan 18, 2016 at 11:10 PM, charles li
> wrote:
>
>>
>> hi, great thanks to david and ted, I know that the content o
Unit = withScope {
>
> I don't think you can return element in the way shown in the snippet.
>
> On Mon, Jan 18, 2016 at 7:34 PM, charles li
> wrote:
>
>> code snippet
>>
>>
>>
>> the 'p
the way shown in the snippet.
>
> On Mon, Jan 18, 2016 at 7:34 PM, charles li
> wrote:
>
>> code snippet
>>
>>
>>
>> the 'print' actually print info on the worker node, but I feel confused
>
code snippet
the 'print' actually print info on the worker node, but I feel confused
where the 'return' value
goes to. for I get nothing on the driver node.
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
cache is the default storage level of persist, and it is lazy [ not cached
indeed ] until the first time it is computed.
On Tue, Jan 12, 2016 at 5:13 AM, ponkin wrote:
> Hi,
>
> Here is my use case :
> I have kafka topic. The job is fairly simple - it reads topic and save
> data to several hd
Hi
Thanks for the reply! I did do a echo $CLASSPATH, but I got nothing. Since
we are running inside jboss, I guess the class path is not set?
People did mention that JBoss loads snappy-java multiple times. But I
cannot find a way to solve that problem.
Cheers
On Jan 6, 2015, at 5:3
any partitions did you use and how many CPU cores in total? The
> former shouldn't be much larger than the latter. Could you also check
> the shuffle size from the WebUI? -Xiangrui
>
> On Fri, Jul 25, 2014 at 4:10 AM, Charles Li wrote:
>> Hi Xiangrui,
>>
>> Thanks fo
own
On Jul 2, 2014, at 0:08, Xiangrui Meng wrote:
> Try to reduce number of partitions to match the number of cores. We
> will add treeAggregate to reduce the communication cost.
>
> PR: https://github.com/apache/spark/pull/1110
>
> -Xiangrui
>
> On Tue, Jul 1, 2014 at
Hi Spark,
I am running LBFGS on our user data. The data size with Kryo serialisation is
about 210G. The weight size is around 1,300,000. I am quite confused that the
performance is very close whether the data is cached or not.
The program is simple:
points = sc.hadoopFIle(int, SequenceFileInput
34 matches
Mail list logo