Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-06 Thread Mark Hamstra
The latency to start a Spark Job is nowhere close to 2-4 seconds under
typical conditions. You appear to be creating a new Spark Application
everytime instead of running multiple Jobs in one Application.

On Fri, Jul 6, 2018 at 3:12 AM Tien Dat  wrote:

> Dear Timothy,
>
> It works like a charm now.
>
> BTW (don't judge me if I am to greedy :-)), the latency to start a Spark
> job
> is around 2-4 seconds, unless I am not aware of some awesome optimization
> on
> Spark. Do you know if Spark community is working on reducing this latency?
>
> Best
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Unable to see the table created using saveAsTable From Beeline. Please help!

2018-07-06 Thread anna stax
I am  running spark 2.1.0 on AWS EMR

In my Zeppelin Note I am creating a table

df.write
.format("parquet")
  .saveAsTable("default.1test")

and I see the table when I

spark.catalog.listTables().show()
+++---+-+---+
|name|database|description|tableType|isTemporary|
+++---+-+---+
|   1test| default|   null|  MANAGED|  false|


>From Beeline client, I don’t see the table

0: jdbc:hive2://localhost:10001/> show tables;
+---++--+--+
| database  | tableName  | isTemporary  |
+---++--+--+
+---++--+--+
No rows selected (0.115 seconds)


Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-06 Thread Timothy Chen
I know there are some community efforts shown in Spark summits before,
mostly around reusing the same Spark context with multiple “jobs”.

I don’t think reducing Spark job startup time is a community priority afaik.

Tim
On Fri, Jul 6, 2018 at 7:12 PM Tien Dat  wrote:

> Dear Timothy,
>
> It works like a charm now.
>
> BTW (don't judge me if I am to greedy :-)), the latency to start a Spark
> job
> is around 2-4 seconds, unless I am not aware of some awesome optimization
> on
> Spark. Do you know if Spark community is working on reducing this latency?
>
> Best
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Retry option and range resource configuration for Spark job on Mesos

2018-07-06 Thread Timothy Chen
Hi Tien,

There is no retry on the job level as we expect the user to retry, and as
you mention we tolerate tasks retry already.

There is no request/limit type resource configuration that you described in
Mesos (yet).

So for 2) that’s not possible at the moment.

Tim


On Fri, Jul 6, 2018 at 11:42 PM Tien Dat  wrote:

> Dear all,
>
> We are running Spark with Mesos as the resource manager. We are interesting
> in some aspect, such as:
>
> 1, Is it possible to configure a specific job with a number of maximum
> retries?
> I meant here is the retry at job level, NOT the /spark.task.maxFailures/
> which is for the task with a job.
>
> 2, Is it possible to set a job with a range of resource, such as: at least
> 20 CPU cores, at most 30 CPU cores and at least 20GB of mem, at most 40GB?
>
> Thank you in advance.
>
> Best
> Tien Dat
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to avoid duplicate column names after join with multiple conditions

2018-07-06 Thread Gokula Krishnan D
Nirav,

withColumnRenamed() API might help but it does not different column and
renames all the occurrences of the given column. either use select() API
and rename as you want.



Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Mon, Jul 2, 2018 at 5:52 PM, Nirav Patel  wrote:

> Expr is `df1(a) === df2(a) and df1(b) === df2(c)`
>
> How to avoid duplicate column 'a' in result? I don't see any api that
> combines both. Rename manually?
>
>
>
> [image: What's New with Xactly] 
>
> 
> 
>    
> 


Retry option and range resource configuration for Spark job on Mesos

2018-07-06 Thread Tien Dat
Dear all,

We are running Spark with Mesos as the resource manager. We are interesting
in some aspect, such as:

1, Is it possible to configure a specific job with a number of maximum
retries?
I meant here is the retry at job level, NOT the /spark.task.maxFailures/
which is for the task with a job.

2, Is it possible to set a job with a range of resource, such as: at least
20 CPU cores, at most 30 CPU cores and at least 20GB of mem, at most 40GB?

Thank you in advance.

Best 
Tien Dat



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.3 Kubernetes error

2018-07-06 Thread purna pradeep
> Hello,
>
>
>
> When I’m trying to set below options to spark-submit command on k8s Master
> getting below error in spark-driver pod logs
>
>
>
> --conf spark.executor.extraJavaOptions=" -Dhttps.proxyHost=myhost
> -Dhttps.proxyPort=8099 -Dhttp.useproxy=true -Dhttps.protocols=TLSv1.2" \
>
> --conf spark.driver.extraJavaOptions="--Dhttps.proxyHost=myhost
> -Dhttps.proxyPort=8099 -Dhttp.useproxy=true -Dhttps.protocols=TLSv1.2" \
>
>
>
> But when I tried to set these extraJavaoptions as system.properties in the
> spark application jar everything works fine.
>
>
>
> 2018-06-11 21:26:28 ERROR SparkContext:91 - Error initializing
> SparkContext.
>
> org.apache.spark.SparkException: External scheduler cannot be instantiated
>
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2747)
>
> at
> org.apache.spark.SparkContext.init(SparkContext.scala:492)
>
> at
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
>
> at
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
>
> at
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
>
> at scala.Option.getOrElse(Option.scala:121)
>
> at
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
>
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException:
> Operation: [get]  for kind: [Pod]  with name:
> [test-657e2f715ada3f91ae32c588aa178f63-driver]  in namespace: [test]
> failed.
>
> at
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
>
> at
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
>
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
>
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
>
> at
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.init(KubernetesClusterSchedulerBackend.scala:70)
>
> at
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:120)
>
> at
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>
> ... 12 more
>
> Caused by: javax.net.ssl.SSLHandshakeException:
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target
>
> at
> sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
>
> at
> sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1959)
>
> at
> sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
>
> at
> sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
>
> at
> sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
>
> at
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
>
> at
> sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
>
> at
> sun.security.ssl.Handshaker.process_record(Handshaker.java:961)
>
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1072)
>
> at
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
>
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1413)
>
> at
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1397)
>
> at
> okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281)
>
> at
> okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251)
>
> at
> okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151)
>
> at
> okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
>
> at
> okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
>
> at
> okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
>
> at
> okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
>
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>
> at
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>
> at
> okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
>
> at
> 

Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-06 Thread Tien Dat
Dear Timothy,

It works like a charm now.

BTW (don't judge me if I am to greedy :-)), the latency to start a Spark job
is around 2-4 seconds, unless I am not aware of some awesome optimization on
Spark. Do you know if Spark community is working on reducing this latency?

Best



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-06 Thread Timothy Chen
Got it, then you can have an extracted Spark directory on each host on the
same location, and don’t specify SPARK_EXECUTOR_URI. Instead, set
spark.mesos.executor.home to that directory.

This should effectively do what you want, which avoids extracting and
fetching and just executed the command.

Tim
On Fri, Jul 6, 2018 at 5:57 PM Tien Dat  wrote:

> Thank you for your answer.
>
> The think it I actually pointed to a local binary file. And Mesos locally
> copied the binary file to a specific folder in /var/lib/mesos/... and
> extract it to every time it launched an Spark executor. With the fetch
> cache, the copy time is reduced, but the reduction is not much since the
> file is stored at local any way.
> The process that takes more time is the extraction.
> Finally, since Mesos make a new folder for extracting the Spark binary each
> time a new Spark job runs, the disk usage increases gradually.
>
> Therefore, our expectation is to have Spark running on Mesos without this
> binary extraction, as well as without storing the same binary every time
> new
> Spark job runs.
>
> Does that make sense to you? And do you have any idea how to deal with
> this?
>
> Best
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to branch a Stream / have multiple Sinks / do multiple Queries on one Stream

2018-07-06 Thread Amiya Mishra
Hi Tathagata,

Is there any limitation of below code while writing to multiple file ?

val inputdf:DataFrame =
sparkSession.readStream.schema(schema).format("csv").option("delimiter",",").csv("src/main/streamingInput")
query1 =
inputdf.writeStream.option("path","first_output").option("checkpointLocation","checkpointloc").format("csv").start()
query2 =
inputdf.writeStream.option("path","second_output").option("checkpointLocation","checkpoint2").format("csv").start()
sparkSession.streams.awaitAnyTermination()


And what will be the release date of spark 2.4.0 ?

Thanks
Amiya







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-06 Thread Tien Dat
Thank you for your answer.

The think it I actually pointed to a local binary file. And Mesos locally
copied the binary file to a specific folder in /var/lib/mesos/... and
extract it to every time it launched an Spark executor. With the fetch
cache, the copy time is reduced, but the reduction is not much since the
file is stored at local any way.
The process that takes more time is the extraction.
Finally, since Mesos make a new folder for extracting the Spark binary each
time a new Spark job runs, the disk usage increases gradually.

Therefore, our expectation is to have Spark running on Mesos without this
binary extraction, as well as without storing the same binary every time new
Spark job runs.

Does that make sense to you? And do you have any idea how to deal with this?

Best





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-06 Thread Timothy Chen
If its available locally on each host, then don’t specify a remote url but
a local file uri instead.

We have a fetcher cache in Mesos a while ago, I believe there is
integration in the Spark framework if you look at the documentation as
well. With the fetcher cache enabled Mesos agent will cache the same remote
binary as well.

Tim
On Fri, Jul 6, 2018 at 5:00 PM Tien Dat  wrote:

> Dear all,
>
> We are running Spark with Mesos as the master for resource management.
> In our cluster, there are jobs that require very short response time (near
> real time applications), which usually around 3-5 seconds.
>
> In order to Spark to execute with Mesos, one has to specify the
> SPARK_EXECUTOR_URI configuration, which defines the location where Mesos
> can
> fetch the Spark binary every time it launches new job.
> We noticed that the fetching and extraction of the Spark binary repeats
> every time we run, even though the binary is basically the same. More
> importantly, fetching and extracting this file can lead to 2-3 seconds of
> latency, which is fatal for our near real-time application. Besides, after
> running many Spark jobs, the Spark binary tar is cumulated and occupies a
> large disk space.
>
> As a result, we wonder if there is a workaround to avoid this fetching and
> extracting process, given that the Spark binary is available locally at
> each
> of the Mesos agent?
>
> Please don't hesitate to ask me if you have any further information needed.
> Thank you in advance.
>
> Best regards
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[SPARK on MESOS] Avoid re-fetching Spark binary

2018-07-06 Thread Tien Dat
Dear all,

We are running Spark with Mesos as the master for resource management.
In our cluster, there are jobs that require very short response time (near
real time applications), which usually around 3-5 seconds.

In order to Spark to execute with Mesos, one has to specify the
SPARK_EXECUTOR_URI configuration, which defines the location where Mesos can
fetch the Spark binary every time it launches new job.
We noticed that the fetching and extraction of the Spark binary repeats
every time we run, even though the binary is basically the same. More
importantly, fetching and extracting this file can lead to 2-3 seconds of
latency, which is fatal for our near real-time application. Besides, after
running many Spark jobs, the Spark binary tar is cumulated and occupies a
large disk space.

As a result, we wonder if there is a workaround to avoid this fetching and
extracting process, given that the Spark binary is available locally at each
of the Mesos agent?

Please don't hesitate to ask me if you have any further information needed.
Thank you in advance.

Best regards



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2018-07-06 Thread BEKHTI, Abdelmajid
unsubscribe