Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread Fawze Abujaber
It's super amazing  i see it was tested on spark 2.0.0 and above, what
about Spark 1.6 which is still part of Cloudera's main versions?

We have a vast Spark applications with version 1.6.0

On Thu, Mar 22, 2018 at 6:38 AM, Holden Karau  wrote:

> Super exciting! I look forward to digging through it this weekend.
>
> On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
> ravishankar.n...@gmail.com> wrote:
>
>> Excellent. You filled a missing link.
>>
>> Best,
>> Passion
>>
>> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia 
>> wrote:
>>
>>> Hi,
>>>
>>> Happy to announce the availability of Sparklens as open source project.
>>> It helps in understanding the  scalability limits of spark applications and
>>> can be a useful guide on the path towards tuning applications for lower
>>> runtime or cost.
>>>
>>> Please clone from here: https://github.com/qubole/sparklens
>>> Old blogpost: https://www.qubole.com/blog/introducing-quboles-
>>> spark-tuning-tool/
>>>
>>> thanks,
>>> rohitk
>>>
>>> PS: Thanks for the patience. It took couple of months to get back on
>>> this.
>>>
>>>
>>>
>>>
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread Holden Karau
Super exciting! I look forward to digging through it this weekend.

On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:

> Excellent. You filled a missing link.
>
> Best,
> Passion
>
> On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia 
> wrote:
>
>> Hi,
>>
>> Happy to announce the availability of Sparklens as open source project.
>> It helps in understanding the  scalability limits of spark applications and
>> can be a useful guide on the path towards tuning applications for lower
>> runtime or cost.
>>
>> Please clone from here: https://github.com/qubole/sparklens
>> Old blogpost:
>> https://www.qubole.com/blog/introducing-quboles-spark-tuning-tool/
>>
>> thanks,
>> rohitk
>>
>> PS: Thanks for the patience. It took couple of months to get back on
>> this.
>>
>>
>>
>>
>>
> --
Twitter: https://twitter.com/holdenkarau


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread रविशंकर नायर
Excellent. You filled a missing link.

Best,
Passion

On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia  wrote:

> Hi,
>
> Happy to announce the availability of Sparklens as open source project. It
> helps in understanding the  scalability limits of spark applications and
> can be a useful guide on the path towards tuning applications for lower
> runtime or cost.
>
> Please clone from here: https://github.com/qubole/sparklens
> Old blogpost: https://www.qubole.com/blog/introducing-quboles-
> spark-tuning-tool/
>
> thanks,
> rohitk
>
> PS: Thanks for the patience. It took couple of months to get back on this.
>
>
>
>
>


Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread Rohit Karlupia
Hi,

Happy to announce the availability of Sparklens as open source project. It
helps in understanding the  scalability limits of spark applications and
can be a useful guide on the path towards tuning applications for lower
runtime or cost.

Please clone from here: https://github.com/qubole/sparklens
Old blogpost: https://www.qubole.com/blog/introducing-quboles-spark-
tuning-tool/

thanks,
rohitk

PS: Thanks for the patience. It took couple of months to get back on this.


Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-21 Thread kant kodali
Hi All,

Is there a mutable dataframe spark structured streaming 2.3.0? I am
currently reading from Kafka and if I cannot parse the messages that I get
from Kafka I want to write them to say some "dead_queue" topic.

I wonder what is the best way to do this?

Thanks!


Re: [Structured Streaming] Application Updates in Production

2018-03-21 Thread Tathagata Das
Why do you want to start the new code in parallel to the old one? Why not
stop the old one, and then start the new one? Structured Streaming ensures
that all checkpoint information (offsets and state) are future-compatible
(as long as state schema is unchanged), hence new code should be able to
pick exactly where the old code left off.

TD

On Wed, Mar 21, 2018 at 11:56 AM, Priyank Shrivastava <
priy...@asperasoft.com> wrote:

> I am using Structured Streaming with Spark 2.2.  We are using Kafka as our
> source and are using checkpoints for failure recovery and e2e exactly once
> guarantees.  I would like to get some more information on how to handle
> updates to the application when there is a change in stateful operations
> and/or output schema.
>
> As some of the sources suggest I can start the updated application
> parallelly with the old application until it catches up with the old
> application in terms of data, and then kill the old one.  But then the new
> application will have to re-read/re-process all the data in kafka which
> could take a long time.
>
> I want to AVOID this re-processing of the data in the newly deployed
> updated application.
>
> One way I can think of is for the application to keep writing the offsets
> into something in addition to the checkpoint directory, for example in
> zookeeper/hdfs.  And then, on an update of the application, I command Kafka
> readstream() to start reading from the offsets stored in this new location
> (zookeeper/hdfs) - since the updated application can't read from the
> checkpoint directory which is now deemed incompatible.
>
> So a couple of questions:
> 1.  Is the above-stated solution a valid solution?
> 2.  If yes, How can I automate the detection of whether the application is
> being restarted because of a failure/maintenance or because of code changes
> to stateful operations and/or output schema?
>
> Any guidance, example or information source is appreciated.
>
> Thanks,
> Priyank
>


Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-21 Thread Gourav Sengupta
Hi Lucas,

Thanks a ton for responding.

have you used livy and SPARK in EMR? I am genuinely not sure how adding a
spark-submit in EMR is hard, it is just one line of code.

I must be missing something here


Regards,
Gourav Sengupta

On Wed, Mar 21, 2018 at 2:37 PM, lucas.g...@gmail.com 
wrote:

> Speaking from experience, if you're already operating a kubernetes
> cluster.  Getting a spark workload operating there is nearly an order of
> magnitude simpler than working with / around EMR.
>
> That's not say EMR is excessively hard, just that Kubernetes is easier,
> all the steps to getting your application deployed are well documented and
> ultimately the whole process is more visible.
>
> Also, thanks for the link Yinan!  I'll be investigating that project!
>
> We have in the past used EMR for larger workloads and as soon as we
> announced our users could run those workloads on our k8s cluster everyone
> immediately moved their workloads over.  This was despite having Spark on
> K8s still in an infant state.  No one has expressed interest in moving back.
>
> G
>
> On 21 March 2018 at 07:32, Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> just out of curiosity, but since it in AWS, is there any specific reason
>> not to use EMR? Or any particular reason to use Kubernetes?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Mar 21, 2018 at 2:47 AM, purna pradeep 
>> wrote:
>>
>>> Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3
>>> ,now i want to run spark-submit from AWS lambda function to k8s
>>> master,would like to know if there is any REST interface to run Spark
>>> submit on k8s Master
>>
>>
>>
>


Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-21 Thread Josh Goldsborough
Purna,

It's a bit tangental to your original question but heads up that Amazon EKS
is in Preview right now:
https://aws.amazon.com/eks/

I don't know if it actually allows a nice interface between k8s hosted
Spark & Lamda functions (my suspicion is it won't fix your problem), but
might be something worth taking a peak at, since you're already doing AWS
hosted K8s stuff.

On Wed, Mar 21, 2018 at 9:37 AM, lucas.g...@gmail.com 
wrote:

> Speaking from experience, if you're already operating a kubernetes
> cluster.  Getting a spark workload operating there is nearly an order of
> magnitude simpler than working with / around EMR.
>
> That's not say EMR is excessively hard, just that Kubernetes is easier,
> all the steps to getting your application deployed are well documented and
> ultimately the whole process is more visible.
>
> Also, thanks for the link Yinan!  I'll be investigating that project!
>
> We have in the past used EMR for larger workloads and as soon as we
> announced our users could run those workloads on our k8s cluster everyone
> immediately moved their workloads over.  This was despite having Spark on
> K8s still in an infant state.  No one has expressed interest in moving back.
>
> G
>
> On 21 March 2018 at 07:32, Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> just out of curiosity, but since it in AWS, is there any specific reason
>> not to use EMR? Or any particular reason to use Kubernetes?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Mar 21, 2018 at 2:47 AM, purna pradeep 
>> wrote:
>>
>>> Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3
>>> ,now i want to run spark-submit from AWS lambda function to k8s
>>> master,would like to know if there is any REST interface to run Spark
>>> submit on k8s Master
>>
>>
>>
>


[Structured Streaming] Application Updates in Production

2018-03-21 Thread Priyank Shrivastava
I am using Structured Streaming with Spark 2.2.  We are using Kafka as our
source and are using checkpoints for failure recovery and e2e exactly once
guarantees.  I would like to get some more information on how to handle
updates to the application when there is a change in stateful operations
and/or output schema.

As some of the sources suggest I can start the updated application
parallelly with the old application until it catches up with the old
application in terms of data, and then kill the old one.  But then the new
application will have to re-read/re-process all the data in kafka which
could take a long time.

I want to AVOID this re-processing of the data in the newly deployed
updated application.

One way I can think of is for the application to keep writing the offsets
into something in addition to the checkpoint directory, for example in
zookeeper/hdfs.  And then, on an update of the application, I command Kafka
readstream() to start reading from the offsets stored in this new location
(zookeeper/hdfs) - since the updated application can't read from the
checkpoint directory which is now deemed incompatible.

So a couple of questions:
1.  Is the above-stated solution a valid solution?
2.  If yes, How can I automate the detection of whether the application is
being restarted because of a failure/maintenance or because of code changes
to stateful operations and/or output schema?

Any guidance, example or information source is appreciated.

Thanks,
Priyank


Re: HadoopDelegationTokenProvider

2018-03-21 Thread Marcelo Vanzin
They should be available in the current user.

UserGroupInformation.getCurrentUser().getCredentials()

On Wed, Mar 21, 2018 at 7:32 AM, Jorge Machado  wrote:
> Hey spark group,
>
> I want to create a Delegation Token Provider for Accumulo I have One
> Question:
>
> How can I get the token that I added to the credentials from the Executor
> side ?  the SecurityManager class is private…
>
> Thanks
>
>
> Jorge Machado
>
>
>
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-21 Thread lucas.g...@gmail.com
Speaking from experience, if you're already operating a kubernetes
cluster.  Getting a spark workload operating there is nearly an order of
magnitude simpler than working with / around EMR.

That's not say EMR is excessively hard, just that Kubernetes is easier, all
the steps to getting your application deployed are well documented and
ultimately the whole process is more visible.

Also, thanks for the link Yinan!  I'll be investigating that project!

We have in the past used EMR for larger workloads and as soon as we
announced our users could run those workloads on our k8s cluster everyone
immediately moved their workloads over.  This was despite having Spark on
K8s still in an infant state.  No one has expressed interest in moving back.

G

On 21 March 2018 at 07:32, Gourav Sengupta 
wrote:

> Hi,
>
> just out of curiosity, but since it in AWS, is there any specific reason
> not to use EMR? Or any particular reason to use Kubernetes?
>
>
> Regards,
> Gourav Sengupta
>
> On Wed, Mar 21, 2018 at 2:47 AM, purna pradeep 
> wrote:
>
>> Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3
>> ,now i want to run spark-submit from AWS lambda function to k8s
>> master,would like to know if there is any REST interface to run Spark
>> submit on k8s Master
>
>
>


HadoopDelegationTokenProvider

2018-03-21 Thread Jorge Machado
Hey spark group, 

I want to create a Delegation Token Provider for Accumulo I have One Question: 

How can I get the token that I added to the credentials from the Executor side 
?  the SecurityManager class is private…

Thanks


Jorge Machado







Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-21 Thread Gourav Sengupta
Hi,

just out of curiosity, but since it in AWS, is there any specific reason
not to use EMR? Or any particular reason to use Kubernetes?


Regards,
Gourav Sengupta

On Wed, Mar 21, 2018 at 2:47 AM, purna pradeep 
wrote:

> Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3
> ,now i want to run spark-submit from AWS lambda function to k8s
> master,would like to know if there is any REST interface to run Spark
> submit on k8s Master


Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-21 Thread purna pradeep
Thanks Yinan,

Looks like this is stil in alpha version.

Would like to know if there is any rest-interface for spark2.3 job
submission similar to spark 2.2 as I need to submit spark applications to
k8 master based on different events (cron or s3 file based trigger)

On Tue, Mar 20, 2018 at 11:50 PM Yinan Li  wrote:

> One option is the Spark Operator
> . It allows
> specifying and running Spark applications on Kubernetes using Kubernetes
> custom resources objects. It takes SparkApplication CRD objects and
> automatically submits the applications to run on a Kubernetes cluster.
>
> Yinan
>
> On Tue, Mar 20, 2018 at 7:47 PM, purna pradeep 
> wrote:
>
>> Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3
>> ,now i want to run spark-submit from AWS lambda function to k8s
>> master,would like to know if there is any REST interface to run Spark
>> submit on k8s Master
>
>
>


Wait for 30 seconds before terminating Spark Streaming

2018-03-21 Thread Aakash Basu
Hi,

Using: *Spark 2.3 + Kafka 0.10*


How to wait for 30 seconds after the latest stream and if there's no more
streaming data, gracefully exit.

Is it running -

query.awaitTermination(30)


Or is it something else?

I tried with this, keeping -

option("startingOffsets", "latest")

for both my input streams be joined.

Am first running the Spark job, and then pushing both the csv file data
into the two respective Kafka topics, but getting the following error -


ERROR MicroBatchExecution:91 - Query [id =
21c96e5c-770d-4d59-8893-4401217120b6, runId =
e03d7ac9-97b6-442e-adf4-dd232f9ed616] terminated with error
org.apache.spark.SparkException: Writing job aborted.


When I keep -

option("startingOffsets", "earliest")

The first batch output works perfectly and then terminates after the given
time.

Please help!

Thanks,
Aakash.


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-21 Thread Aakash Basu
Thanks Chris!

On Fri, Mar 16, 2018 at 10:13 PM, Bowden, Chris  wrote:

> 2. You must decide. If multiple streaming queries are launched in a single
> / simple application, only you can dictate if a single failure should cause
> the application to exit. If you use spark.streams.awaitAnyTermination be
> aware it returns / throws if _any_ streaming query terminates. A more
> complex application may keep track of many streaming queries and attempt to
> relaunch them with lower latency for certain types of failures.
>
>
> 3a. I'm not very familiar with py, but I doubt you need the sleep
>
> 3b. Kafka consumer tuning is simply a matter of passing appropriate config
> keys to the source's options if desired
>
> 3c. I would argue the most obvious improvement would be a more structured
> and compact data format if CSV isn't required.
>
> --
> *From:* Aakash Basu 
> *Sent:* Friday, March 16, 2018 9:12:39 AM
> *To:* sagar grover
> *Cc:* Bowden, Chris; Tathagata Das; Dylan Guedes; Georg Heiler; user;
> jagrati.go...@myntra.com
>
> *Subject:* Re: Multiple Kafka Spark Streaming Dataframe Join query
>
> Hi all,
>
> From the last mail queries in the bottom, query 1's doubt has been
> resolved, I was already guessing so, that I resent same columns from Kafka
> producer multiple times, hence the join gave duplicates.
>
> Retested with fresh Kafka feed and problem was solved.
>
> But, the other queries still persists, would anyone like to reply? :)
>
> Thanks,
> Aakash.
>
> On 16-Mar-2018 3:57 PM, "Aakash Basu"  wrote:
>
> Hi all,
>
> The code was perfectly alright, just the package I was submitting had to
> be the updated one (marked green below). The join happened but the output
> has many duplicates (even though the *how *parameter is by default *inner*)
> -
>
> Spark Submit:
>
> /home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 
> /home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Stream_Join.py
>
>
>
> Code:
>
> from pyspark.sql import SparkSession
> import time
> from pyspark.sql.functions import split, col
>
> class test:
>
>
> spark = SparkSession.builder \
> .appName("DirectKafka_Spark_Stream_Stream_Join") \
> .getOrCreate()
>
> table1_stream = 
> (spark.readStream.format("kafka").option("startingOffsets", 
> "earliest").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "test1").load())
>
> table2_stream = 
> (spark.readStream.format("kafka").option("startingOffsets", 
> "earliest").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "test2").load())
>
>
> query1 = table1_stream.select('value')\
> .withColumn('value', table1_stream.value.cast("string")) \
> .withColumn("ID", split(col("value"), ",").getItem(0)) \
> .withColumn("First_Name", split(col("value"), ",").getItem(1)) \
> .withColumn("Last_Name", split(col("value"), ",").getItem(2)) \
> .drop('value')
>
> query2 = table2_stream.select('value') \
> .withColumn('value', table2_stream.value.cast("string")) \
> .withColumn("ID", split(col("value"), ",").getItem(0)) \
> .withColumn("Department", split(col("value"), ",").getItem(1)) \
> .withColumn("Date_joined", split(col("value"), ",").getItem(2)) \
> .drop('value')
>
> joined_Stream = query1.join(query2, "Id")
>
> a = query1.writeStream.format("console").start()
> b = query2.writeStream.format("console").start()
> c = joined_Stream.writeStream.format("console").start()
>
> time.sleep(10)
>
> a.awaitTermination()
> b.awaitTermination()
> c.awaitTermination()
>
>
> Output -
>
> +---+--+-+---+---+
> | ID|First_Name|Last_Name| Department|Date_joined|
> +---+--+-+---+---+
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3| Tobit|Robardley| Accounting|   8/3/2006|
> |  3|