Re: Performance Issue

2019-01-08 Thread Gourav Sengupta
Hi,

Can you please let us know the SPARK version, and the query, and whether
the data is in parquet format or not, and where is it stored?

Regards,
Gourav Sengupta

On Wed, Jan 9, 2019 at 1:53 AM 大啊  wrote:

> What is your performance issue?
>
>
>
>
>
> At 2019-01-08 22:09:24, "Tzahi File"  wrote:
>
> Hello,
>
> I have some performance issue running SQL query on Spark.
>
> The query contains one parquet partitioned table (partition by date) one
> each partition is about 200gb and simple table with about 100 records. The
> spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface
> for running the SQL query.
>
> After searching after how to improve my query I have added to the
> configuration the above settings:
> spark.sql.shuffle.partitions=1000
> spark.dynamicAllocation.maxExecutors=200
>
> There wasn't any significant improvement. I'm looking for any ideas
> to improve my running time.
>
>
> Thanks!
> Tzahi
>
>
>
>
>


[Spark SQL] Failure Scenarios involving JDBC and SQL databases

2019-01-08 Thread Ramon Tuason
Hi all,

I'm writing a data source that shares similarities with Spark's own JDBC 
implementation, and I'd like to ask a question about how Spark handles failure 
scenarios involving JDBC and SQL databases. To my understanding, if an executor 
dies while it's running a task, Spark will revive the executor and try to 
re-run that task. However, how does this play out in the context of data 
integrity and Spark's JDBC data source API (e.g. 
df.write.format("jdbc").option(...).save())?

In the savePartition function of 
JdbcUtils.scala,
 we see Spark calling the commit and rollback functions of the Java connection 
object generated from the database url/credentials provided by the user 
(screenshot below). Can someone provide some guidance on what exactly happens 
under certain failure scenarios? For example, if an executor dies right after 
commit() finishes or before rollback() is called, does Spark try to re-run the 
task and write the same data partition again, essentially creating duplicate 
committed rows in the database? What happens if the executor dies in the middle 
of calling commit() or rollback()?

Thanks for your help!

[cid:image001.png@01D4A77C.C29ADFA0]


Re:Performance Issue

2019-01-08 Thread 大啊
What is your performance issue?






At 2019-01-08 22:09:24, "Tzahi File"  wrote:

Hello, 


I have some performance issue running SQL query on Spark. 


The query contains one parquet partitioned table (partition by date) one each 
partition is about 200gb and simple table with about 100 records. The spark 
cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running 
the SQL query. 


After searching after how to improve my query I have added to the configuration 
the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200


There wasn't any significant improvement. I'm looking for any ideas to improve 
my running time.




Thanks! 
Tzahi 




Is it possible to rate limit an UDP?

2019-01-08 Thread email
I have a data frame for which I apply an UDF that calls a REST web service.
This web service is distributed in only a few nodes and it won't be able to
handle a massive load from Spark. 

 

Is it possible to rate limit this UDP? For example , something like 100
op/s. 

 

If not , what are the options? Is splitting the df an option? 

 

I've read a similar question in Stack overflow [1] and the solution suggests
Spark Streaming , but my application does not involve streaming. Do I need
to turn the operations into a streaming workflow to achieve something like
that? 

 

Current Workflow : Hive -> Spark ->  Service

 

Thank you

 

[1]
https://stackoverflow.com/questions/43953882/how-to-rate-limit-a-spark-map-o
peration



Performance Issue

2019-01-08 Thread Tzahi File
Hello,

I have some performance issue running SQL query on Spark.

The query contains one parquet partitioned table (partition by date) one
each partition is about 200gb and simple table with about 100 records. The
spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface
for running the SQL query.

After searching after how to improve my query I have added to the
configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas
to improve my running time.


Thanks!
Tzahi