Unsubscribe

2023-08-01 Thread Alex Landa
Unsubscribe


Re: Monitor Spark Applications

2019-09-15 Thread Alex Landa
Hi Raman,

The banzaicloud jar can also cover the JMX exports.

Thanks,
Alex

On Fri, Sep 13, 2019 at 8:46 AM raman gugnani 
wrote:

> Hi Alex,
>
> Thanks will check this out.
>
> Can it be done directly as spark also exposes the  metrics  or JVM. In
> this my one doubt is how to assign fixed JMX ports to driver and executors.
>
> @Alex,
> Is there any difference in fetching data via JMX or using banzaicloud jar.
>
>
> On Fri, 13 Sep 2019 at 10:47, Alex Landa  wrote:
>
>> Hi,
>> We are starting to use https://github.com/banzaicloud/spark-metrics .
>> Keep in mind that their solution is for Spark for K8s, to make it work
>> for Spark on Yarn you have to copy the dependencies of the spark-metrics
>> into Spark Jars folders on all the Spark machines (took me a while to
>> figure).
>>
>> Thanks,
>> Alex
>>
>> On Fri, Sep 13, 2019 at 7:58 AM raman gugnani 
>> wrote:
>>
>>> Hi Team,
>>>
>>> I am new to spark. I am using spark on hortonworks dataplatform with
>>> amazon EC2 machines. I am running spark in cluster mode with yarn.
>>>
>>> I need to monitor individual JVMs and other Spark metrics with
>>> *prometheus*.
>>>
>>> Can anyone suggest the solution to do the same.
>>>
>>> --
>>> Raman Gugnani
>>>
>>
>
> --
> Raman Gugnani
>


Re: Monitor Spark Applications

2019-09-12 Thread Alex Landa
Hi,
We are starting to use https://github.com/banzaicloud/spark-metrics .
Keep in mind that their solution is for Spark for K8s, to make it work for
Spark on Yarn you have to copy the dependencies of the spark-metrics into
Spark Jars folders on all the Spark machines (took me a while to figure).

Thanks,
Alex

On Fri, Sep 13, 2019 at 7:58 AM raman gugnani 
wrote:

> Hi Team,
>
> I am new to spark. I am using spark on hortonworks dataplatform with
> amazon EC2 machines. I am running spark in cluster mode with yarn.
>
> I need to monitor individual JVMs and other Spark metrics with
> *prometheus*.
>
> Can anyone suggest the solution to do the same.
>
> --
> Raman Gugnani
>


Re: How to combine all rows into a single row in DataFrame

2019-08-19 Thread Alex Landa
Hi,

It sounds similar to what we do in our application.
We don't serialize every row, but instead we group first the rows into the
wanted representation and then apply protobuf serialization using map and
lambda.
I suggest not to serialize the entire DataFrame into a single protobuf
message since it may cause OOM errors.

Thanks,
Alex

On Mon, Aug 19, 2019 at 11:24 PM Rishikesh Gawade 
wrote:

> Hi All,
> I have been trying to serialize a dataframe in protobuf format. So far, I
> have been able to serialize every row of the dataframe by using map
> function and the logic for serialization within the same(within the lambda
> function). The resultant dataframe consists of rows in serialized format(1
> row = 1 serialized message).
> I wish to form a single protobuf serialized message for this dataframe and
> in order to do that i need to combine all the serialized rows using some
> custom logic very similar to the one used in map operation.
> I am assuming that this would be possible by using the reduce operation on
> the dataframe, however, i am unaware of how to go about it.
> Any suggestions/approach would be much appreciated.
>
> Thanks,
> Rishikesh
>


Re: Spark Standalone - Failing to pass extra java options to the driver in cluster mode

2019-08-19 Thread Alex Landa
Thanks Jungtaek Lim,
I upgraded the cluster to 2.4.3 and it worked fine.

Thanks,
Alex

On Mon, Aug 19, 2019 at 10:01 PM Jungtaek Lim  wrote:

> Hi Alex,
>
> you seem to hit SPARK-26606 [1] which has been fixed in 2.4.1. Could you
> try it out with latest version?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-26606
>
> On Tue, Aug 20, 2019 at 3:43 AM Alex Landa  wrote:
>
>> Hi,
>>
>> We are using Spark Standalone 2.4.0 in production and publishing our
>> Scala app using cluster mode.
>> I saw that extra java options passed to the driver don't actually pass.
>> A submit example:
>> *spark-submit --deploy-mode cluster --master spark://:7077
>> --driver-memory 512mb --conf
>> "spark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError" --class
>> App  app.jar *
>>
>> Doesn't pass *-XX:+HeapDumpOnOutOfMemoryError *as a JVM argument, but
>> pass instead
>> *-Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError*I
>> created a test app for it:
>>
>> val spark = SparkSession.builder()
>>   .master("local")
>>   .appName("testApp").getOrCreate()
>> import spark.implicits._
>>
>> // get a RuntimeMXBean reference
>> val runtimeMxBean = ManagementFactory.getRuntimeMXBean
>>
>> // get the jvm's input arguments as a list of strings
>> val listOfArguments = runtimeMxBean.getInputArguments
>>
>> // print the arguments
>> listOfArguments.asScala.foreach(a => println(s"ARG: $a"))
>>
>>
>> I see that for client mode I get :
>> ARG: -XX:+HeapDumpOnOutOfMemoryError
>> while in cluster mode I get:
>> ARG: -Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError
>>
>> Would appreciate your help how to work around this issue.
>> Thanks,
>> Alex
>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


Spark Standalone - Failing to pass extra java options to the driver in cluster mode

2019-08-19 Thread Alex Landa
Hi,

We are using Spark Standalone 2.4.0 in production and publishing our Scala
app using cluster mode.
I saw that extra java options passed to the driver don't actually pass.
A submit example:
*spark-submit --deploy-mode cluster --master spark://:7077
--driver-memory 512mb --conf
"spark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError" --class
App  app.jar *

Doesn't pass *-XX:+HeapDumpOnOutOfMemoryError *as a JVM argument, but pass
instead
*-Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError*I created
a test app for it:

val spark = SparkSession.builder()
  .master("local")
  .appName("testApp").getOrCreate()
import spark.implicits._

// get a RuntimeMXBean reference
val runtimeMxBean = ManagementFactory.getRuntimeMXBean

// get the jvm's input arguments as a list of strings
val listOfArguments = runtimeMxBean.getInputArguments

// print the arguments
listOfArguments.asScala.foreach(a => println(s"ARG: $a"))


I see that for client mode I get :
ARG: -XX:+HeapDumpOnOutOfMemoryError
while in cluster mode I get:
ARG: -Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError

Would appreciate your help how to work around this issue.
Thanks,
Alex


Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-23 Thread Alex Landa
Hi Keith,

I don't think that we keep such references.
But we do experience exceptions during the job execution that we catch and
retry (timeouts/network issues from different data sources).
Can they affect RDD cleanup?

Thanks,
Alex

On Sun, Jul 21, 2019 at 10:49 PM Keith Chapman 
wrote:

> Hi Alex,
>
> Shuffle files in spark are deleted when the object holding a reference to
> the shuffle file on disk goes out of scope (is garbage collected by the
> JVM).  Could it be the case that you are keeping these objects alive?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
>
> On Sun, Jul 21, 2019 at 12:19 AM Alex Landa  wrote:
>
>> Thanks,
>> I looked into these options, the cleaner periodic interval is set to 30
>> min by default.
>> The block option for shuffle -
>> *spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by
>> default.
>> What are the implications of setting it to true?
>> Will it make the driver slower?
>>
>> Thanks,
>> Alex
>>
>> On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail <
>> prathmesh.ran...@gmail.com> wrote:
>>
>>> This is the job of ContextCleaner. There are few a property that you can
>>> tweak to see if that helps:
>>> spark.cleaner.periodicGC.interval
>>> spark.cleaner.referenceTracking
>>> spark.cleaner.referenceTracking.blocking.shuffle
>>>
>>> Regards
>>> Prathmesh Ranaut
>>>
>>> On Jul 21, 2019, at 11:31 AM, Alex Landa  wrote:
>>>
>>> Hi,
>>>
>>> We are running a long running Spark application ( which executes lots of
>>> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
>>> We see that old shuffle files ( a week old for example ) are not deleted
>>> during the execution of the application, which leads to out of disk space
>>> errors on the executor.
>>> If we re-deploy the application, the Spark cluster take care of the
>>> cleaning
>>> and deletes the old shuffle data (since we have
>>> /-Dspark.worker.cleanup.enabled=true/ in the worker config).
>>> I don't want to re-deploy our app every week or two, but to be able to
>>> configure spark to clean old shuffle data (as it should).
>>>
>>> How can I configure Spark to delete old shuffle data during the life
>>> time of
>>> the application (not after)?
>>>
>>>
>>> Thanks,
>>> Alex
>>>
>>>


Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Alex Landa
Thanks,
I looked into these options, the cleaner periodic interval is set to 30 min
by default.
The block option for shuffle -
*spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by
default.
What are the implications of setting it to true?
Will it make the driver slower?

Thanks,
Alex

On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail <
prathmesh.ran...@gmail.com> wrote:

> This is the job of ContextCleaner. There are few a property that you can
> tweak to see if that helps:
> spark.cleaner.periodicGC.interval
> spark.cleaner.referenceTracking
> spark.cleaner.referenceTracking.blocking.shuffle
>
> Regards
> Prathmesh Ranaut
>
> On Jul 21, 2019, at 11:31 AM, Alex Landa  wrote:
>
> Hi,
>
> We are running a long running Spark application ( which executes lots of
> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
> We see that old shuffle files ( a week old for example ) are not deleted
> during the execution of the application, which leads to out of disk space
> errors on the executor.
> If we re-deploy the application, the Spark cluster take care of the
> cleaning
> and deletes the old shuffle data (since we have
> /-Dspark.worker.cleanup.enabled=true/ in the worker config).
> I don't want to re-deploy our app every week or two, but to be able to
> configure spark to clean old shuffle data (as it should).
>
> How can I configure Spark to delete old shuffle data during the life time
> of
> the application (not after)?
>
>
> Thanks,
> Alex
>
>


Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Alex Landa
Hi,

We are running a long running Spark application ( which executes lots of
quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0.
We see that old shuffle files ( a week old for example ) are not deleted
during the execution of the application, which leads to out of disk space
errors on the executor.
If we re-deploy the application, the Spark cluster take care of the cleaning
and deletes the old shuffle data (since we have
/-Dspark.worker.cleanup.enabled=true/ in the worker config).
I don't want to re-deploy our app every week or two, but to be able to
configure spark to clean old shuffle data (as it should).

How can I configure Spark to delete old shuffle data during the life time of
the application (not after)?


Thanks,
Alex