from:"Jack Kolokasis"

Re: Is memory-only no-disk Spark possible? [Marketing Mail]

2021-08-20 Thread Jack Kolokasis


Hello Jacek,

On 20/8/21 2:49 μ.μ., Jacek Laskowski wrote:

Hi,

I've been exploring BlockManager and the stores for a while now and am 
tempted to say that a memory-only Spark setup would be possible 
(except shuffle blocks). Is this correct?

Correct.


What about shuffle blocks? Do they have to be stored on disk (in 
DiskStore)?

Well, by default Spark stores shuffle blocks on disk.


I think broadcast variables are in-memory first so except on-disk 
storage level explicitly used (by Spark devs), there's no reason not 
to have Spark in-memory only.


(I was told that one of the differences between Trino/Presto vs Spark 
SQL is that Trino keeps all processing in-memory only and will blow up 
while Spark uses disk to avoid OOMEs).


Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski 
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski 





Best,
Iacovos

Re: About how to read spark source code with a good way [Marketing Mail]

2020-08-19 Thread Jack Kolokasis

Hi Joyan,

check this link: https://github.com/jackkolokasis/SparkInternals

Thanks
Iacovos

On 19/8/20 9:09 π.μ., joyan sil wrote:

Hi Jack and Spark experts,

Further to the question asked in this thread, what are some 
recommended resources (blog/videos) that have helped you to deep dive 
into the spark source code.

Thanks

Regards
Joyan

On Wed, Aug 19, 2020 at 11:06 AM Jack Kolokasis 
mailto:koloka...@ics.forth.gr>> wrote:

Hi,

 From my experience, I suggest to read both blogs and source code.
Blogs
will give you the high-level knowledge for the different parts of the
source code.

Iacovos

On 18/8/20 3:53 μ.μ., 2400 wrote:
> Hi everyone ，
>
> I am an engineer, I have been using spark, and I want to try to
make
> spark better. I want to be a commitor. Before that, I want to know
> spark thoroughly, so who knows how to make it better Read spark
source
> code, or recommend related blogs for me to learn.
> please somebody can help me ,let's make spark better.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>

Re: About how to read spark source code with a good way [Marketing Mail]

2020-08-18 Thread Jack Kolokasis


Hi,

From my experience, I suggest to read both blogs and source code. Blogs 
will give you the high-level knowledge for the different parts of the 
source code.


Iacovos

On 18/8/20 3:53 ??.??., 2400 wrote:

Hi everyone ??

I am an engineer, I have been using spark, and I want to try to make 
spark better. I want to be a commitor. Before that, I want to know 
spark thoroughly, so who knows how to make it better Read spark source 
code, or recommend related blogs for me to learn.

please somebody can help me ,let's make spark better.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Exact meaning of spark.memory.storageFraction in spark 2.3.x [Marketing Mail] [Marketing Mail]

2020-03-20 Thread Jack Kolokasis

This is just a counter to show you the size of cached RDDs. If it is 
zero means that no caching has occurred. Also, even storage memory is 
used for computing the counter will show as zero.


Iacovos

On 20/3/20 4:51 μ.μ., Michel Sumbul wrote:

Hi,

Thanks for the very quick reply!
If I see the metrics "storage memory", always at 0, does that mean 
that the memory is neither used for caching or computing?


Thanks,
Michel

<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail> 
	Garanti sans virus. www.avast.com 
<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail> 




Le ven. 20 mars 2020 à 14:45, Jack Kolokasis <mailto:koloka...@ics.forth.gr>> a écrit :


Hello Michel,

Spark seperates executors memory using an adaptive boundary between
storage and execution memory. If there is no caching and execution
memory needs more space, then it will use a portion of the storage
memory.

If your program does not use caching then you can reduce storage
memory.

Iacovos

On 20/3/20 4:40 μ.μ., msumbul wrote:
> Hello,
>
> Im asking mysef the exact meaning of the setting of
> spark.memory.storageFraction.
> The documentation mention:
>
> "Amount of storage memory immune to eviction, expressed as a
fraction of the
> size of the region set aside by spark.memory.fraction. The
higher this is,
> the less working memory may be available to execution and tasks
may spill to
> disk more often"
>
> Does that mean that if there is no caching that part of the
memory will not
> be used at all?
> In the spark UI, in the tab "Executor", I can see that the
"storage memory"
> is always zero. Does that mean that that part of the memory is
never used at
> all and I can reduce it or never used for storage specifically?
>
> Thanks in advance for your help,
> Michel
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
>
-
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>

Re: Exact meaning of spark.memory.storageFraction in spark 2.3.x [Marketing Mail]

2020-03-20 Thread Jack Kolokasis


Hello Michel,

Spark seperates executors memory using an adaptive boundary between 
storage and execution memory. If there is no caching and execution 
memory needs more space, then it will use a portion of the storage memory.


If your program does not use caching then you can reduce storage memory.

Iacovos

On 20/3/20 4:40 μ.μ., msumbul wrote:

Hello,

Im asking mysef the exact meaning of the setting of
spark.memory.storageFraction.
The documentation mention:

"Amount of storage memory immune to eviction, expressed as a fraction of the
size of the region set aside by spark.memory.fraction. The higher this is,
the less working memory may be available to execution and tasks may spill to
disk more often"

Does that mean that if there is no caching that part of the memory will not
be used at all?
In the spark UI, in the tab "Executor", I can see that the "storage memory"
is always zero. Does that mean that that part of the memory is never used at
all and I can reduce it or never used for storage specifically?

Thanks in advance for your help,
Michel



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Calculate Task Memory Usage

2019-10-11 Thread Jack Kolokasis


Hello to all,

I am trying to calculate how much memory each task in Spark consumes. Is 
there any way to measure this ?


Thanks,
Iacovos

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Shuffle Spill to Disk

2019-09-28 Thread Jack Kolokasis


Hello,

I am trying to measure how many bytes spill to disk in shuffle operation 
and I get always zero. This is not correct because the spark local disk 
is utilized.


Can anyone explain me why the spill counter is zero?

Thanks,
Iacovos

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark and Java10

2019-07-06 Thread Jack Kolokasis


Hello,

    I try to use Apache Spark v.2.3.1 using JAVA 10 but i can not. 
Spark documentation refers that Spark works using Java8+ . So, has 
anyone tried to use Apache Spark with Java 10 ?


Thanks for your help,
Iacovos

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: installation of spark

2019-06-04 Thread Jack Kolokasis

Hello,

at first you will need to make sure that JAVA is installed, or
install it otherwise. Then install scala and a build tool (sbt or
maven). In my point of view, IntelliJ IDEA is a good option to create
your Spark applications. At the end you have to install a distributed
file system e.g HDFS.

I think there is no an all-in-one configuration. But there are
examples about how to configure you Spark cluster (e.g
https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-standalone-example-2-workers-on-1-node-cluster.adoc).

Best,
--Iacovos
On 5/6/19 5:50 π.μ., ya wrote:

Dear list,

I am very new to spark, and I am having trouble installing it on my
mac. I have following questions, please give me some guidance. Thank
you very much.

1. How many and what software should I install before installing
spark? I have been searching online, people discussing their
experiences on this topic with different opinions, some says there is
no need to install hadoop before install spark, some says hadoop has
to be installed before spark. Some other people say scala has to be
installed, whereas others say scala is included in spark, and it is
installed automatically once spark in installed. So I am confused what
to install for a start.

2. Is there an simple way to configure these software? for instance,
an all-in-one configuration file? It takes forever for me to configure
things before I can really use it for data analysis.

I hope my questions make sense. Thank you very much.

Best regards,

Re: Difference between Checkpointing and Persist

2019-04-18 Thread Jack Kolokasis


Hi,

    in my point of view a good approach is first persist your data in 
StorageLevel.Memory_And_Disk and then perform join. This will accelerate 
your computation because data will be presented in memory and in your 
local intermediate storage device.


--Iacovos

On 4/18/19 8:49 PM, Subash Prabakar wrote:

Hi All,

I have a doubt about checkpointing and persist/saving.

Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - 
saving is to just persist intermediate result before joining)



Which of the above is faster and whats the difference?


Thanks,
Subash


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Load Time from HDFS

2019-04-02 Thread Jack Kolokasis


Hello,

    I want to ask if there any way to measure HDFS data loading time at 
the start of my program. I tried to add an action e.g count() after val 
data = sc.textFile() call. But I notice that my program takes more time 
to finish than before adding count call. Is there any other way to do it ?


Thanks,
--Iacovos

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Profiler

2019-03-27 Thread Jack Kolokasis

Thanks for your reply.  Your help is very valuable and all these links 
are helpful (especially your example)


Best Regards

--Iacovos

On 3/27/19 10:42 PM, Luca Canali wrote:


I find that the Spark metrics system is quite useful to gather 
resource utilization metrics of Spark applications, including CPU, 
memory and I/O.


If you are interested an example how this works for us at: 
https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark 

If instead you are rather looking at ways to instrument your Spark 
code with performance metrics, Spark task metrics and event listeners 
are quite useful for that. See also 
https://github.com/apache/spark/blob/master/docs/monitoring.md and 
https://github.com/LucaCanali/sparkMeasure


Regards,

Luca

*From:*manish ranjan 
*Sent:* Tuesday, March 26, 2019 15:24
*To:* Jack Kolokasis 
*Cc:* user 
*Subject:* Re: Spark Profiler

I have found ganglia very helpful in understanding network I/o , CPU 
and memory usage  for a given spark cluster.


I have not used , but have heard good things about Dr Elephant ( which 
I think was contributed by LinkedIn but not 100%sure).


On Tue, Mar 26, 2019, 5:59 AM Jack Kolokasis <mailto:koloka...@ics.forth.gr>> wrote:


Hello all,

 I am looking for a spark profiler to trace my application to
find
the bottlenecks. I need to trace CPU usage, Memory Usage and I/O
usage.

I am looking forward for your reply.

--Iacovos


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>

Spark Profiler

2019-03-26 Thread Jack Kolokasis


Hello all,

    I am looking for a spark profiler to trace my application to find 
the bottlenecks. I need to trace CPU usage, Memory Usage and I/O usage.


I am looking forward for your reply.

--Iacovos


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Measure Serialization / De-serialization Time

2018-11-15 Thread Jack Kolokasis


Hello all,

    I am running a simple Word Count application using storage level 
MEMORY_ONLY in one case and OFF_HEAP on the other. I see that the 
execution time while I ran my application off-heap is higher than 
on-heap. So, I am looking where this time goes.


One thought at first is that this time goes on serialization and 
deserialization time of the objects.


    - I want to measure that, is there any way ?

    - Also do you have any opinion where this time goes ?

My experimental setup is : 1 driver and 1 executor (on the same 
machine). Executor has 1GB memory and 1 core. The offHeap size is 1GB.


Thanks for your reply,
--Iacovos


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

StorageLevel: OffHeap

2018-11-08 Thread Jack Kolokasis


Hello everyone,
    I am running a simple word count in Spark and I persist my RDDs 
using StorageLevel.OFF_HEAP. While I am running the application, i see 
through the Spark Web UI that are persisted in Disk.  Why this happen??

Can anyone tell me how off heap storage Level work ??

Thanks for your help,
--Iacovos Kolokasis

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Is memory-only no-disk Spark possible? [Marketing Mail]

Re: About how to read spark source code with a good way [Marketing Mail]

Re: About how to read spark source code with a good way [Marketing Mail]

Re: Exact meaning of spark.memory.storageFraction in spark 2.3.x [Marketing Mail] [Marketing Mail]

Re: Exact meaning of spark.memory.storageFraction in spark 2.3.x [Marketing Mail]

Calculate Task Memory Usage

Shuffle Spill to Disk

Spark and Java10

Re: installation of spark

Re: Difference between Checkpointing and Persist

Load Time from HDFS

Re: Spark Profiler

Spark Profiler

Measure Serialization / De-serialization Time

StorageLevel: OffHeap

15 matches

Site Navigation

Mail list logo

Footer information