Re: OOM Error

2019-09-06 Thread Upasana Sharma
Is it a streaming job? On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry wrote: > I have a Spark job that consists of a large number of Window operations > and hence involves large shuffles. I have roughly 900 GiBs of data, > although I am using a large enough cluster (10 * m5.4xlarge instances). I >

OOM Error

2019-09-06 Thread Ankit Khettry
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various

Re: how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-06 Thread Shyam P
Difficult things in spark is debugging and tuning.

Re: how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-06 Thread David Zhou
Not yet. Learning spark On Fri, Sep 6, 2019 at 2:17 PM Shyam P wrote: > cool ,but did you find a way or anyhelp or clue ? > > On Fri, Sep 6, 2019 at 11:40 PM David Zhou wrote: > >> I have the same question with yours >> >> On Thu, Sep 5, 2019 at 9:18 PM Shyam P wrote: >> >>> Hi, >>> >>> I am

Re: how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-06 Thread Shyam P
cool ,but did you find a way or anyhelp or clue ? On Fri, Sep 6, 2019 at 11:40 PM David Zhou wrote: > I have the same question with yours > > On Thu, Sep 5, 2019 at 9:18 PM Shyam P wrote: > >> Hi, >> >> I am using spark-sql-2.4.1v to streaming in my PoC. >> >> how to refresh the loaded

Question on streaming job wait and re-run

2019-09-06 Thread David Zhou
Hi, My streaming job consumes data from kafka and writes them into Cassandra. Current status: Cassandra is not stable. Streaming job crashed when it can't write data into Cassandra. Streaming job has check point. Usually, the Cassandra cluster will come back in 4 hours. Finally, I start the

Re: how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-06 Thread David Zhou
I have the same question with yours On Thu, Sep 5, 2019 at 9:18 PM Shyam P wrote: > Hi, > > I am using spark-sql-2.4.1v to streaming in my PoC. > > how to refresh the loaded dataframe from hdfs/cassandra table every time > new batch of stream processed ? What is the practice followed in general

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Sethupathi T
Gabor, Thanks for the clarification. Thanks On Fri, Sep 6, 2019 at 12:38 AM Gabor Somogyi wrote: > Sethupathi, > > Let me extract then the important part what I've shared: > > 1. "This ensures that each Kafka source has its own consumer group that > does not face interference from any other

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Gabor Somogyi
Sethupathi, Let me extract then the important part what I've shared: 1. "This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer" 2. Consumers may eat the data from each other, offset calculation may give back wrong result (that's

Re: DataSourceV2: pushFilters() is not invoked for each read call - spark 2.3.2

2019-09-06 Thread Hyukjin Kwon
I believe this issue was fixed in Spark 2.4. Spark DataSource V2 has been still being radically developed - It is not complete yet until now. So, I think the feasible option to get through at the current moment is: 1. upgrade to higher Spark versions 2. disable filter push down at your

Anonymous functions cannot be found

2019-09-06 Thread Yuta Morisawa
Hi I'm trying to use sparkContext.addJar method for adding new jar files like TomCat. But in some cases, it does not work well. The error message says an Executor can not load an anonymous function. Why anonymous functions cannot be loaded in spite of adding a jar to all Executors? This is

DataSourceV2: pushFilters() is not invoked for each read call - spark 2.3.2

2019-09-06 Thread Shubham Chaurasia
Hi, I am using spark v2.3.2. I have an implementation of DSV2. Here is what is happening: 1) Obtained a dataframe using MyDataSource scala> val df1 = spark.read.format("com.shubham.MyDataSource").load > MyDataSource.MyDataSource > MyDataSource.createReader: Going to create a new