Spark stage stuck

2020-06-02 Thread Manjunath Shetty H
Hi, I have running multiple jobs in same driver with FAIR scheduling enabled. Intermittently one of the Stage gets stuck and not completing even after long time. Each job flow is something like this * Create JDBC RDD to load data from SQL Server * Create temporary table * Query

Re: Parallelising JDBC reads in spark

2020-05-25 Thread Manjunath Shetty H
Thanks Dhaval for the suggestion, but in the case i mentioned in previous mail still data can be missed as the row number will change. - Manjunath From: Dhaval Patel Sent: Monday, May 25, 2020 3:01 PM To: Manjunath Shetty H Subject: Re: Parallelising JDBC

Re: Parallelising JDBC reads in spark

2020-05-25 Thread Manjunath Shetty H
Thanks Georg for the suggestion, but at this point changing the design is not really the option. Any other pointer would be helpful. Thanks Manjunath From: Georg Heiler Sent: Monday, May 25, 2020 11:52 AM To: Manjunath Shetty H Cc: Mike Artz ; user Subject

Re: Parallelising JDBC reads in spark

2020-05-25 Thread Manjunath Shetty H
Hi Georg, Thanks for the response, can please elaborate what do mean by change data capture ? Thanks Manjunath From: Georg Heiler Sent: Monday, May 25, 2020 11:14 AM To: Manjunath Shetty H Cc: Mike Artz ; user Subject: Re: Parallelising JDBC reads in spark

Re: Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Shetty H Cc: user Subject: Re: Parallelising JDBC reads in spark Does anything different happened when you set the isolationLevel to do Dirty Reads i.e. "READ_UNCOMMITTED" On Sun, May 24, 2020 at 7:50 PM Manjunath Shetty H mailto:manjunathshe...@live.com>> wrote: Hi, We a

Parallelising JDBC reads in spark

2020-05-24 Thread Manjunath Shetty H
Hi, We are writing a ETL pipeline using Spark, that fetch the data from SQL server in batch mode (every 15mins). Problem we are facing when we try to parallelising single table reads into multiple tasks without missing any data. We have tried this, * Use `ROW_NUMBER` window function in

How to change Dataframe schema

2020-05-16 Thread Manjunath Shetty H
Hi, I have a dataframe with some columns and data that is fetched from JDBC, as i have to maintain the schema consistent in the ORC file i have to apply different schema for that dataframe. Column names will be same, but Data or Schema may contain some extra columns. Is there any way i can

Spark ORC store written timestamp as column

2020-04-15 Thread Manjunath Shetty H
Hi All, Is there anyway to store the exact written timestamp in the ORC file through spark ?. Use case something like `current_timestamp()` function in SQL. Generating in the program will not be equal to actual write time in ORC/hdfs file. Any suggestions will be helpful. Thanks Manjunath

Spark 1.6 and ORC bucketed queries

2020-04-01 Thread Manjunath Shetty H
Hi, Is it possible to do ORC bucked queries in Spark 1.6 ? Folder structure is like this: / bucket1.orc bucket2.orc bucket3.orc And the Spark SQL query will be like `select * from where partition = partition1 and bucket =

Spark SQL join ORC and non ORC tables in hive

2020-03-24 Thread Manjunath Shetty H
Hi, i am on spark 1.6. I am getting error if i try to run a hive query in Spark that involves joining ORC and non-ORC tables in hive. Find the error below, any help would be appreciated org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenExchange

Re: Saving Spark run stats and run watermark

2020-03-18 Thread Manjunath Shetty H
Thanks for suggestion Netanel, Sorry for less information, I am specifically looking for something inside Hadoop ecosystem. - Manjunath From: Netanel Malka Sent: Wednesday, March 18, 2020 5:26 PM To: Manjunath Shetty H Subject: Re: Saving Spark run stats

Saving Spark run stats and run watermark

2020-03-18 Thread Manjunath Shetty H
Hi All, Want to save each spark batch run stats (start, end, ID etc) and watermark ( Last processed timestamp from external data source). We have tried Hive JDBC, but it is very slow due MR jobs it will trigger. Cant save to normal Hive tables as it will create lots of small files in HDFS.

Re: Optimising multiple hive table join and query in spark

2020-03-16 Thread Manjunath Shetty H
o., 16. März 2020 um 04:27 Uhr schrieb Manjunath Shetty H mailto:manjunathshe...@live.com>>: Hi Georg, Thanks for the suggestion. Can you please explain bit more about what you meant exactly ? Bdw i am on Spark 1.6 - Manjunath From: Georg Heiler m

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Manjunath Shetty H
Only partitioned and Join keys are not sorted coz those are written incrementally with batch jobs From: Georg Heiler Sent: Sunday, March 15, 2020 8:30:53 PM To: Manjunath Shetty H Cc: ayan guha ; Magnus Nilsson ; user Subject: Re: Optimising multiple hive

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Manjunath Shetty H
ugh. I really would like a native way to tell catalyst not to reshuffle just because you use more columns in the join condition. On Sun, Mar 15, 2020 at 6:04 AM Manjunath Shetty H mailto:manjunathshe...@live.com>> wrote: Hi All, We have 10 tables in data warehouse (hdfs/hive) written using OR

Optimising multiple hive table join and query in spark

2020-03-14 Thread Manjunath Shetty H
Hi All, We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We are serving a usecase on top of that by joining 4-5 tables using Hive as of now. But it is not fast as we wanted it to be, so we are thinking of using spark for this use case. Any suggestion on this ? Is it

Re: Way to get the file name of the output when doing ORC write from dataframe

2020-03-04 Thread Manjunath Shetty H
Or is there any way to provide a Unique file name to the ORC write function itself ? Any suggestions will be helpful. Regards Manjunath Shetty From: Manjunath Shetty H Sent: Wednesday, March 4, 2020 2:28 PM To: user Subject: Way to get the file name

Re: How to collect Spark dataframe write metrics

2020-03-04 Thread Manjunath Shetty H
Thanks Zohar, Will try that - Manjunath From: Zohar Stiro Sent: Tuesday, March 3, 2020 1:49 PM To: Manjunath Shetty H Cc: user Subject: Re: How to collect Spark dataframe write metrics Hi, to get DataFrame level write metrics you can take a look

Way to get the file name of the output when doing ORC write from dataframe

2020-03-04 Thread Manjunath Shetty H
Hi, I wanted to know if there is any way to get the output file name that `Dataframe.orc()` will write to ?. This is needed to track which file is written by which job during incremental batch jobs. Thanks Manjunath

How to collect Spark dataframe write metrics

2020-03-01 Thread Manjunath Shetty H
Hi all, Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?. If it is a good practice to follow then how to get the DataFrame level write metrics ? Any pointers would

Re: Convert each partition of RDD to Dataframe

2020-02-28 Thread Manjunath Shetty H
Minack Sent: Thursday, February 27, 2020 8:51 PM To: Manjunath Shetty H ; user@spark.apache.org Subject: Re: Convert each partition of RDD to Dataframe Manjunath, You can define your DataFrame in parallel in a multi-threaded driver. Enrico Am 27.02.20 um 15:50 schrieb Manjunath Shetty H: Hi

Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H
the different tables in the first place? Enrico Am 27.02.20 um 14:53 schrieb Manjunath Shetty H: Hi Vinodh, Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet will be helpful. To explain the problem more, * I have 10 partitions , each partition loads

Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H
.. On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H mailto:manjunathshe...@live.com>> wrote: Hello All, In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Datafra

Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H
Hello All, In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema. How to get the