Re: Is memory-only no-disk Spark possible? [Marketing Mail]

2021-08-20 Thread Jack Kolokasis

Hello Jacek,

On 20/8/21 2:49 μ.μ., Jacek Laskowski wrote:

Hi,

I've been exploring BlockManager and the stores for a while now and am 
tempted to say that a memory-only Spark setup would be possible 
(except shuffle blocks). Is this correct?

Correct.


What about shuffle blocks? Do they have to be stored on disk (in 
DiskStore)?

Well, by default Spark stores shuffle blocks on disk.


I think broadcast variables are in-memory first so except on-disk 
storage level explicitly used (by Spark devs), there's no reason not 
to have Spark in-memory only.


(I was told that one of the differences between Trino/Presto vs Spark 
SQL is that Trino keeps all processing in-memory only and will blow up 
while Spark uses disk to avoid OOMEs).


Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski 
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski 





Best,
Iacovos


Re: About how to read spark source code with a good way [Marketing Mail]

2020-08-19 Thread Jack Kolokasis

Hi Joyan,

check this link: https://github.com/jackkolokasis/SparkInternals

Thanks
Iacovos

On 19/8/20 9:09 π.μ., joyan sil wrote:

Hi Jack and Spark experts,

Further to the question asked in this thread, what are some 
recommended resources (blog/videos) that have helped you to deep dive 
into the spark source code.

Thanks

Regards
Joyan

On Wed, Aug 19, 2020 at 11:06 AM Jack Kolokasis 
mailto:koloka...@ics.forth.gr>> wrote:


Hi,

 From my experience, I suggest to read both blogs and source code.
Blogs
will give you the high-level knowledge for the different parts of the
source code.

Iacovos

On 18/8/20 3:53 μ.μ., 2400 wrote:
> Hi everyone ,
>
> I am an engineer, I have been using spark, and I want to try to
make
> spark better. I want to be a commitor. Before that, I want to know
> spark thoroughly, so who knows how to make it better Read spark
source
> code, or recommend related blogs for me to learn.
> please somebody can help me ,let's make spark better.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>



Re: About how to read spark source code with a good way [Marketing Mail]

2020-08-18 Thread Jack Kolokasis

Hi,

From my experience, I suggest to read both blogs and source code. Blogs 
will give you the high-level knowledge for the different parts of the 
source code.


Iacovos

On 18/8/20 3:53 ??.??., 2400 wrote:

Hi everyone ??

I am an engineer, I have been using spark, and I want to try to make 
spark better. I want to be a commitor. Before that, I want to know 
spark thoroughly, so who knows how to make it better Read spark source 
code, or recommend related blogs for me to learn.

please somebody can help me ,let's make spark better.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Exact meaning of spark.memory.storageFraction in spark 2.3.x [Marketing Mail] [Marketing Mail]

2020-03-20 Thread Jack Kolokasis
This is just a counter to show you the size of cached RDDs. If it is 
zero means that no caching has occurred. Also, even storage memory is 
used for computing the counter will show as zero.


Iacovos

On 20/3/20 4:51 μ.μ., Michel Sumbul wrote:

Hi,

Thanks for the very quick reply!
If I see the metrics "storage memory", always at 0, does that mean 
that the memory is neither used for caching or computing?


Thanks,
Michel

<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail> 
	Garanti sans virus. www.avast.com 
<https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail> 




Le ven. 20 mars 2020 à 14:45, Jack Kolokasis <mailto:koloka...@ics.forth.gr>> a écrit :


Hello Michel,

Spark seperates executors memory using an adaptive boundary between
storage and execution memory. If there is no caching and execution
memory needs more space, then it will use a portion of the storage
memory.

If your program does not use caching then you can reduce storage
memory.

Iacovos

On 20/3/20 4:40 μ.μ., msumbul wrote:
> Hello,
>
> Im asking mysef the exact meaning of the setting of
> spark.memory.storageFraction.
> The documentation mention:
>
> "Amount of storage memory immune to eviction, expressed as a
fraction of the
> size of the region set aside by spark.memory.fraction. The
higher this is,
> the less working memory may be available to execution and tasks
may spill to
> disk more often"
>
> Does that mean that if there is no caching that part of the
memory will not
> be used at all?
> In the spark UI, in the tab "Executor", I can see that the
"storage memory"
> is always zero. Does that mean that that part of the memory is
never used at
> all and I can reduce it or never used for storage specifically?
>
> Thanks in advance for your help,
> Michel
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
>
-
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>



Re: Exact meaning of spark.memory.storageFraction in spark 2.3.x [Marketing Mail]

2020-03-20 Thread Jack Kolokasis

Hello Michel,

Spark seperates executors memory using an adaptive boundary between 
storage and execution memory. If there is no caching and execution 
memory needs more space, then it will use a portion of the storage memory.


If your program does not use caching then you can reduce storage memory.

Iacovos

On 20/3/20 4:40 μ.μ., msumbul wrote:

Hello,

Im asking mysef the exact meaning of the setting of
spark.memory.storageFraction.
The documentation mention:

"Amount of storage memory immune to eviction, expressed as a fraction of the
size of the region set aside by spark.memory.fraction. The higher this is,
the less working memory may be available to execution and tasks may spill to
disk more often"

Does that mean that if there is no caching that part of the memory will not
be used at all?
In the spark UI, in the tab "Executor", I can see that the "storage memory"
is always zero. Does that mean that that part of the memory is never used at
all and I can reduce it or never used for storage specifically?

Thanks in advance for your help,
Michel



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Calculate Task Memory Usage

2019-10-11 Thread Jack Kolokasis

Hello to all,

I am trying to calculate how much memory each task in Spark consumes. Is 
there any way to measure this ?


Thanks,
Iacovos

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Shuffle Spill to Disk

2019-09-28 Thread Jack Kolokasis

Hello,

I am trying to measure how many bytes spill to disk in shuffle operation 
and I get always zero. This is not correct because the spark local disk 
is utilized.


Can anyone explain me why the spill counter is zero?

Thanks,
Iacovos

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark and Java10

2019-07-06 Thread Jack Kolokasis

Hello,

    I try to use Apache Spark v.2.3.1 using JAVA 10 but i can not. 
Spark documentation refers that Spark works using Java8+ . So, has 
anyone tried to use Apache Spark with Java 10 ?


Thanks for your help,
Iacovos

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: installation of spark

2019-06-04 Thread Jack Kolokasis

Hello,

    at first you will need to make sure that JAVA is installed, or 
install it otherwise. Then install scala and a build tool (sbt or 
maven). In my point of view, IntelliJ IDEA is a good option to create 
your Spark applications.  At the end you have to install a distributed 
file system e.g HDFS.


    I think there is no an all-in-one configuration. But there are 
examples about how to configure you Spark cluster (e.g 
https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-standalone-example-2-workers-on-1-node-cluster.adoc).


Best,
--Iacovos
On 5/6/19 5:50 π.μ., ya wrote:

Dear list,

I am very new to spark, and I am having trouble installing it on my 
mac. I have following questions, please give me some guidance. Thank 
you very much.


1. How many and what software should I install before installing 
spark? I have been searching online, people discussing their 
experiences on this topic with different opinions, some says there is 
no need to install hadoop before install spark, some says hadoop has 
to be installed before spark. Some other people say scala has to be 
installed, whereas others say scala is included in spark, and it is 
installed automatically once spark in installed. So I am confused what 
to install for a start.


2.  Is there an simple way to configure these software? for instance, 
an all-in-one configuration file? It takes forever for me to configure 
things before I can really use it for data analysis.


I hope my questions make sense. Thank you very much.

Best regards,

YA


Re: Difference between Checkpointing and Persist

2019-04-18 Thread Jack Kolokasis

Hi,

    in my point of view a good approach is first persist your data in 
StorageLevel.Memory_And_Disk and then perform join. This will accelerate 
your computation because data will be presented in memory and in your 
local intermediate storage device.


--Iacovos

On 4/18/19 8:49 PM, Subash Prabakar wrote:

Hi All,

I have a doubt about checkpointing and persist/saving.

Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - 
saving is to just persist intermediate result before joining)



Which of the above is faster and whats the difference?


Thanks,
Subash


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Load Time from HDFS

2019-04-02 Thread Jack Kolokasis

Hello,

    I want to ask if there any way to measure HDFS data loading time at 
the start of my program. I tried to add an action e.g count() after val 
data = sc.textFile() call. But I notice that my program takes more time 
to finish than before adding count call. Is there any other way to do it ?


Thanks,
--Iacovos

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Profiler

2019-03-27 Thread Jack Kolokasis
Thanks for your reply.  Your help is very valuable and all these links 
are helpful (especially your example)


Best Regards

--Iacovos

On 3/27/19 10:42 PM, Luca Canali wrote:


I find that the Spark metrics system is quite useful to gather 
resource utilization metrics of Spark applications, including CPU, 
memory and I/O.


If you are interested an example how this works for us at: 
https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark 

If instead you are rather looking at ways to instrument your Spark 
code with performance metrics, Spark task metrics and event listeners 
are quite useful for that. See also 
https://github.com/apache/spark/blob/master/docs/monitoring.md and 
https://github.com/LucaCanali/sparkMeasure


Regards,

Luca

*From:*manish ranjan 
*Sent:* Tuesday, March 26, 2019 15:24
*To:* Jack Kolokasis 
*Cc:* user 
*Subject:* Re: Spark Profiler

I have found ganglia very helpful in understanding network I/o , CPU 
and memory usage  for a given spark cluster.


I have not used , but have heard good things about Dr Elephant ( which 
I think was contributed by LinkedIn but not 100%sure).


On Tue, Mar 26, 2019, 5:59 AM Jack Kolokasis <mailto:koloka...@ics.forth.gr>> wrote:


Hello all,

 I am looking for a spark profiler to trace my application to
find
the bottlenecks. I need to trace CPU usage, Memory Usage and I/O
usage.

I am looking forward for your reply.

--Iacovos


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>



Spark Profiler

2019-03-26 Thread Jack Kolokasis

Hello all,

    I am looking for a spark profiler to trace my application to find 
the bottlenecks. I need to trace CPU usage, Memory Usage and I/O usage.


I am looking forward for your reply.

--Iacovos


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Measure Serialization / De-serialization Time

2018-11-15 Thread Jack Kolokasis

Hello all,

    I am running a simple Word Count application using storage level 
MEMORY_ONLY in one case and OFF_HEAP on the other. I see that the 
execution time while I ran my application off-heap is higher than 
on-heap. So, I am looking where this time goes.


One thought at first is that this time goes on serialization and 
deserialization time of the objects.


    - I want to measure that, is there any way ?

    - Also do you have any opinion where this time goes ?

My experimental setup is : 1 driver and 1 executor (on the same 
machine). Executor has 1GB memory and 1 core. The offHeap size is 1GB.


Thanks for your reply,
--Iacovos


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



StorageLevel: OffHeap

2018-11-08 Thread Jack Kolokasis

Hello everyone,
    I am running a simple word count in Spark and I persist my RDDs 
using StorageLevel.OFF_HEAP. While I am running the application, i see 
through the Spark Web UI that are persisted in Disk.  Why this happen??

Can anyone tell me how off heap storage Level work ??

Thanks for your help,
--Iacovos Kolokasis

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org