Re: JavaSerializerInstance is slow

2021-09-02 Thread Antonin Delpeuch (lists)
Hi Kohki,

Serialization of tasks happens in local mode too and as far as I am
aware there is no way to disable this (although it would definitely be
useful in my opinion).

You can see the local mode as a testing mode, in which you would want to
catch any serialization errors, before they appear in production.

There are also some important bugs that are present in local mode and
are not deemed worth fixing because it is not intended to be used in
production (https://issues.apache.org/jira/browse/SPARK-5300).

I think there would definitely be interest in having a reliable and
efficient local mode in Spark but it's a pretty different use case than
what Spark originally focused on.

Antonin

On 03/09/2021 05:56, Kohki Nishio wrote:
> I'm seeing many threads doing deserialization of a task, I understand
> since lambda is involved, we can't use Kryo for those purposes.
> However I'm running it in local mode, this serialization is not really
> necessary, no?
>
> Is there any trick I can apply to get rid of this thread contention ?
> I'm seeing many of the below threads in thread dumps ... 
>
>
> "Executor task launch worker for task 11.0 in stage 15472514.0 (TID
> 19788863)" #732821 daemon prio=5 os_prio=0 tid=0x7f02581b2800
> nid=0x355d waiting for monitor entry [0x7effd1e3f000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:400)
> - waiting to lock <0x7f0f7246edf8> (a java.lang.Object)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> at
> scala.runtime.LambdaDeserializer$.deserializeLambda(LambdaDeserializer.scala:51)
> at
> scala.runtime.LambdaDeserialize.deserializeLambda(LambdaDeserialize.java:38) 
>
>
> Thanks
> -Kohki

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Async API to save RDDs?

2020-08-05 Thread Antonin Delpeuch (lists)
Hi,

The RDD API provides async variants of a few RDD methods, which let the
user execute the corresponding jobs asynchronously. This makes it
possible to cancel the jobs for instance:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/AsyncRDDActions.html

There does not seem to be async versions of the save methods such as
`saveAsTextFile`:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#saveAsTextFile-java.lang.String-

Is there another way to start such jobs and get a handle on them (such
as the job id)? Specifically, I would like to be able to stop save jobs
on user request.

Thank you,
Antonin

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
Hi Juan,

Of course! My prototype is here:
https://github.com/OpenRefine/OpenRefine/tree/spark-prototype

I suspect it can be quite hard for you to jump in the code at this stage
of the project, but here are some concise pointers:

The or-spark module contains the Spark-based implementation of our
datamodel. The tasks themselves are generated by the application code
(in the "main" module).

You can try the prototype as a user (clone the repo, checkout the branch
and hit ./refine). If you import a small CSV file via the Clipboard
pane, you can then run a few operations on it and observe the tasks in
Spark's web UI.

I would be happy to give you any additional pointers (perhaps off-list?)
if you want to have a close look.

One general question I have for the list is: do you have a good way to
inspect and optimize the serialization of tasks?

Thank you so much for all your help so far!
Antonin


On 04/07/2020 19:19, Juan Martín Guillén wrote:
> Would you be able to send the code you are running?
> That would be great if you include some sample data.
> Is that possible?
> 
> 
> El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists)
>  escribió:
> 
> 
> Hi Stephen and Juan,
> 
> Thanks both for your replies - you are right, I used the wrong
> terminology! The local mode is what fits our needs best (and what I have
> benchmarking so far).
> 
> That being said, the problems I mention are still applicable to this
> context. There is still a serialization overhead (which can be observed
> from the web UI), which is really noticeable as a user.
> 
> For instance, to display the paginated grid in the tool's UI, I need to
> run a simple job (filterByRange), and Spark's own overheads account for
> about half of the overall execution time.
> 
> Intuitively, when running in local mode there should not be any need for
> serializing tasks to pass them between threads, so that is what I am
> trying to eliminate.
> 
> Regards,
> Antonin
> 
> On 04/07/2020 17:49, Juan Martín Guillén wrote:
>> Hi Antonin.
>>
>> It seems you are confusing Standalone with Local mode. They are 2
>> different modes.
>>
>> From Spark in Action book: "In local mode, there is only one executor in
>> the same client JVM as the driver, but
>> this executor can spawn several threads to run tasks.
>> In local mode, Spark uses your client process as the single executor in
>> the cluster,
>> and the number of threads specified determines how many tasks can be
>> executed in parallel."
>>
>> I am pretty sure this is the mode your use case is more suited to.
>>
>> What you are referring to, I think, is to run an Standalone Cluster
>> locally, something that does not make too much sense resources wise and
>> is what may be considered only for testing purposes.
>>
>> Running Spark in Local mode is totally fine and supported for
>> non-cluster (local) environments.
>>
>> Here the options you have to connect you Spark application to:
>>
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
>>
>> Regards,
>> Juan Martín.
>>
>>
>>
>>
>> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists)
>> mailto:li...@antonin.delpeuch.eu>> escribió:
>>
>>
>> Hi,
>>
>> I am working on revamping the architecture of OpenRefine, an ETL tool,
>> to execute workflows on datasets which do not fit in RAM.
>>
>> Spark's RDD API is a great fit for the tool's operations, and provides
>> everything we need: partitioning and lazy evaluation.
>>
>> However, OpenRefine is a lightweight tool that runs locally, on the
>> users' machine, and we want to preserve this use case. Running Spark in
>> standalone mode works, but I have read at a couple of places that the
>> standalone mode is only intended for development and testing. This is
>> confirmed by my experience with it so far:
>> - the overhead added by task serialization and scheduling is significant
>> even in standalone mode. This makes sense for testing, since you want to
>> test serialization as well, but to run Spark in production locally, we
>> would need to bypass serialization, which is not possible as far as I
> know;
>> - some bugs that manifest themselves only in local mode are not getting
>> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
>> it seems dangerous to base a production system on standalone Spark.
>>
>> So, we cannot use Spark as default runner in the tool. Do you know any
>> alternative which would be designed for local use? A library which w

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
Hi Stephen and Juan,

Thanks both for your replies - you are right, I used the wrong
terminology! The local mode is what fits our needs best (and what I have
benchmarking so far).

That being said, the problems I mention are still applicable to this
context. There is still a serialization overhead (which can be observed
from the web UI), which is really noticeable as a user.

For instance, to display the paginated grid in the tool's UI, I need to
run a simple job (filterByRange), and Spark's own overheads account for
about half of the overall execution time.

Intuitively, when running in local mode there should not be any need for
serializing tasks to pass them between threads, so that is what I am
trying to eliminate.

Regards,
Antonin

On 04/07/2020 17:49, Juan Martín Guillén wrote:
> Hi Antonin.
> 
> It seems you are confusing Standalone with Local mode. They are 2
> different modes.
> 
> From Spark in Action book: "In local mode, there is only one executor in
> the same client JVM as the driver, but
> this executor can spawn several threads to run tasks.
> In local mode, Spark uses your client process as the single executor in
> the cluster,
> and the number of threads specified determines how many tasks can be
> executed in parallel."
> 
> I am pretty sure this is the mode your use case is more suited to.
> 
> What you are referring to, I think, is to run an Standalone Cluster
> locally, something that does not make too much sense resources wise and
> is what may be considered only for testing purposes.
> 
> Running Spark in Local mode is totally fine and supported for
> non-cluster (local) environments.
> 
> Here the options you have to connect you Spark application to:
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
> 
> Regards,
> Juan Martín.
> 
> 
> 
> 
> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists)
>  escribió:
> 
> 
> Hi,
> 
> I am working on revamping the architecture of OpenRefine, an ETL tool,
> to execute workflows on datasets which do not fit in RAM.
> 
> Spark's RDD API is a great fit for the tool's operations, and provides
> everything we need: partitioning and lazy evaluation.
> 
> However, OpenRefine is a lightweight tool that runs locally, on the
> users' machine, and we want to preserve this use case. Running Spark in
> standalone mode works, but I have read at a couple of places that the
> standalone mode is only intended for development and testing. This is
> confirmed by my experience with it so far:
> - the overhead added by task serialization and scheduling is significant
> even in standalone mode. This makes sense for testing, since you want to
> test serialization as well, but to run Spark in production locally, we
> would need to bypass serialization, which is not possible as far as I know;
> - some bugs that manifest themselves only in local mode are not getting
> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
> it seems dangerous to base a production system on standalone Spark.
> 
> So, we cannot use Spark as default runner in the tool. Do you know any
> alternative which would be designed for local use? A library which would
> provide something similar to the RDD API, but for parallelization with
> threads in the same JVM, not machines in a cluster?
> 
> If there is no such thing, it should not be too hard to write our
> homegrown implementation, which would basically be Java streams with
> partitioning. I have looked at Apache Beam's direct runner, but it is
> also designed for testing so does not fit our bill for the same reasons.
> 
> We plan to offer a Spark-based runner in any case - but I do not think
> it can be used as the default runner.
> 
> Cheers,
> Antonin
> 
> 
> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> <mailto:user-unsubscr...@spark.apache.org>
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RDD-like API for entirely local workflows?

2020-07-04 Thread Antonin Delpeuch (lists)
Hi,

I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.

Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.

However, OpenRefine is a lightweight tool that runs locally, on the
users' machine, and we want to preserve this use case. Running Spark in
standalone mode works, but I have read at a couple of places that the
standalone mode is only intended for development and testing. This is
confirmed by my experience with it so far:
- the overhead added by task serialization and scheduling is significant
even in standalone mode. This makes sense for testing, since you want to
test serialization as well, but to run Spark in production locally, we
would need to bypass serialization, which is not possible as far as I know;
- some bugs that manifest themselves only in local mode are not getting
a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
it seems dangerous to base a production system on standalone Spark.

So, we cannot use Spark as default runner in the tool. Do you know any
alternative which would be designed for local use? A library which would
provide something similar to the RDD API, but for parallelization with
threads in the same JVM, not machines in a cluster?

If there is no such thing, it should not be too hard to write our
homegrown implementation, which would basically be Java streams with
partitioning. I have looked at Apache Beam's direct runner, but it is
also designed for testing so does not fit our bill for the same reasons.

We plan to offer a Spark-based runner in any case - but I do not think
it can be used as the default runner.

Cheers,
Antonin





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org