Hi, Andrew, I've got lots of materials when asking google for "*spark
performance test*"
- https://github.com/databricks/spark-perf
-
https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf
-
hi, guys, is there a way to dynamic load files within the map function.
i.e.
Can I code as bellow:
thanks a lot.
--
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io
*github*: www.github.com/litaotao
Here is the link: http://spark.apache.org/news/spark-2.0.0-preview.html
congrats, haha, looking forward to 2.0.1, awesome project.
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
hi, there, the talk *The Future of Real Time in Spark* here
https://www.youtube.com/watch?v=oXkxXDG0gNk tells that there will be "BI
app integration" on 24:28 of the video.
what does he mean the *BI app integration* in that talk? does that mean
that they will develop a BI tool like zeppelin,
--
On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY <umesh9...@gmail.com>
wrote:
> Hi,
> Look at below image which is from json.org :
>
> [image: Inline image 1]
>
> The above image describes the object formulation of below JSON:
>
> Object 1=> {"nam
as this post says, that in spark, we can load a json file in this way
bellow:
*post* :
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
---
arau <hol...@pigscanfly.ca>
> wrote:
>
>> You probably want to look at the map transformation, and the many more
>> defined on RDDs. The function you pass in to map is serialized and the
>> computation is distributed.
>>
>>
>> On Monday, March 28, 2016, ch
use case: have a dataset, and want to use different algorithms on that, and
fetch the result.
for making this, I think I should distribute my algorithms, and run these
algorithms on the dataset at the same time, am I right?
but it seems that spark can not parallelize/serialize
y and/or disk storage
> */
> private[spark] def persistRDD(rdd: RDD[_]) {
> persistentRdds(rdd.id) = rdd
> }
>
> Hope this helps.
>
> Best
> Yash
>
> On Thu, Mar 24, 2016 at 1:58 PM, charles li <charles.up...@gmail.com>
> wrote:
>
>>
>
happened to see this problem on stackoverflow:
http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812
I think it's very interesting, and I think the answer posted by Aaron
sounds promising, but I'm not sure, and I don't find the details
Hi, andy, I think you can make that with some open source packages/libs
built for IPython and Spark.
here is one : https://github.com/litaotao/IPython-Dashboard
On Thu, Mar 17, 2016 at 1:36 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:
> We are considering deploying a notebook
rs, etc. are
> currently under development. Please refer to
> https://issues.apache.org/jira/browse/SPARK-5575
>
>
>
> Best regards, Alexander
>
>
>
> *From:* charles li [mailto:charles.up...@gmail.com]
> *Sent:* Wednesday, March 16, 2016 7:01 PM
> *To:* u
Hi, guys, I'm new to MLlib on spark, after reading the document, it seems
that MLlib does not support deep learning, I want to know is there any way
to implement deep learning on spark ?
*Do I must use 3-party package like caffe or tensorflow ?*
or
*Does deep learning module list in the MLlib
sometimes it just shows several *black dots*, and sometimes it can not show
the entire graph.
did anyone meet this before and how did you fix it?
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
every time I can only get the latest info by refreshing the page, that's a
little boring.
so is there any way to make the WEB UI auto-refreshing ?
great thanks
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
your in-memory cache size or cache off-heap or to disk.
>
> Xinh
>
> On Wed, Mar 2, 2016 at 1:48 AM, charles li <charles.up...@gmail.com>
> wrote:
>
>> hi, there, I feel a little confused about the *cache* in spark.
>>
>> first, is there any way to *customize the cach
hi, there, I feel a little confused about the *cache* in spark.
first, is there any way to *customize the cached RDD name*, it's not
convenient for me when looking at the storage page, there are the kind of
RDD in the RDD Name column, I hope to make it as my customized name, kinds
of 'rdd 1',
since spark is under actively developing, so take a book to learn it is
somehow outdated to some degree.
I would like to suggest learn it from several ways as bellow:
- spark official document, trust me, you will go through this for
several time if you want to learn in well :
hi, there, we are going to recruit several spark developers, can some one
give some ideas on interviewing candidates, say, spark related problems.
great thanks.
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
say I have 2 RDDs, RDD1 and RDD2.
both are 20g in memory.
and I cache both of them in memory using RDD1.cache() and RDD2.cache()
the in the further steps on my app, I never use RDD1 but use RDD2 for lots
of time.
then here is my question:
if there is only 40G memory in my cluster, and here
if set spark.executor.memory = 2G for each worker [ 10 in total ]
does it mean I can cache 20G RDD in memory ? if so, how about the memory
for code running in each process on each worker?
thanks.
--
and is there any materials about memory management or resource management
in spark ? I want to
code:
---
total = int(1e8)
local_collection = range(1, total)
rdd = sc.parallelize(local_collection)
res = rdd.collect()
---
web ui status
---
problems:
---
1. from the status bar, it seems that the there should be about half tasks
done, but it just say there is
*Apache Spark™* is a fast and general engine for large-scale data
processing.
it's a good profile of spark, but it's really too short for lots of people
if then have little background in this field.
ok, frankly, I'll give a tech-talk about spark later this week, and now I'm
writing a slide about
I used to use spark 1.3.x before, and explore my data in ipython [3.2]
notebook, which was very stable. but I came across an error
" Java gateway process exited before sending the driver its port number "
my code is as bellow:
```
import pyspark
from pyspark import SparkConf
sc_conf =
I've put a thread before: pre-install 3-party Python package on spark
cluster
currently I use *Fabric* to manage my cluster , but it's not enough for me,
and I believe there is a much better way to *manage and monitor* the
cluster.
I believe there really exists some open source manage tools
code snippet
the 'print' actually print info on the worker node, but I feel confused
where the 'return' value
goes to. for I get nothing on the driver node.
--
*--*
a spark lover, a quant, a developer and a good man.
http://github.com/litaotao
ou can return element in the way shown in the snippet.
>
> On Mon, Jan 18, 2016 at 7:34 PM, charles li <charles.up...@gmail.com>
> wrote:
>
>> code snippet
>>
>>
>>
>> the 'print' actually print info on the worker node, but I feel confused
>>
s and calls the function being
> passed. That's it. It doesn't collect the values and don't return any new
> modified RDD.
>
>
> On Mon, Jan 18, 2016 at 11:10 PM, charles li <charles.up...@gmail.com>
> wrote:
>
>>
>> hi, great thanks to david and ted, I know that t
ou can return element in the way shown in the snippet.
>
> On Mon, Jan 18, 2016 at 7:34 PM, charles li <charles.up...@gmail.com>
> wrote:
>
>> code snippet
>>
>>
>>
>> the 'print' actually print info on the worker node, but I feel confused
>>
cache is the default storage level of persist, and it is lazy [ not cached
indeed ] until the first time it is computed.
On Tue, Jan 12, 2016 at 5:13 AM, ponkin wrote:
> Hi,
>
> Here is my use case :
> I have kafka topic. The job is fairly simple - it reads topic and
:
How many partitions did you use and how many CPU cores in total? The
former shouldn't be much larger than the latter. Could you also check
the shuffle size from the WebUI? -Xiangrui
On Fri, Jul 25, 2014 at 4:10 AM, Charles Li littlee1...@gmail.com wrote:
Hi Xiangrui,
Thanks for your
Hi Spark,
I am running LBFGS on our user data. The data size with Kryo serialisation is
about 210G. The weight size is around 1,300,000. I am quite confused that the
performance is very close whether the data is cached or not.
The program is simple:
points = sc.hadoopFIle(int,
32 matches
Mail list logo