Re: Query around Spark Checkpoints

2020-09-28 Thread Debabrata Ghosh
t in case of structured streaming it should be file location. >> But main question in why do you want to checkpoint in >> Nosql, as it's eventual consistence. >> >> >> Regards >> Amit >> >> On Sunday, September 27, 2020, Debabrata Ghosh >> wrote:

Query around Spark Checkpoints

2020-09-27 Thread Debabrata Ghosh
Hi, I had a query around Spark checkpoints - Can I store the checkpoints in NoSQL or Kafka instead of Filesystem ? Regards, Debu

Spark : Very simple query failing [Needed help please]

2020-09-18 Thread Debabrata Ghosh
Hi, I needed some help from you on the attached Spark problem please. I am running the following query: >>> df_location = spark.sql("""select dt from ql_raw_zone.ext_ql_location where ( lat between 41.67 and 45.82) and (lon between -86.74 and -82.42 ) and year=2020 and month=9 and

Refreshing static data with streaming data at regular Intervals

2020-07-21 Thread Debabrata Ghosh
Hi All, We have a Static DataFrame with as follows. -- id|time_stamp| -- |1|1540527851| |2|1540525602| |3|1530529187| |4|1520529185| |5|1510529182| |6|1578945709| -- We also have live stream of events, a Streaming DataFrame which contains id and updated

Re: Spark Streaming not working

2020-04-10 Thread Debabrata Ghosh
Any solution please ? On Fri, Apr 10, 2020 at 11:04 PM Debabrata Ghosh wrote: > Hi, > I have a spark streaming application where Kafka is producing > records but unfortunately spark streaming isn't able to consume those. > > I am hitting the following error: > > 20

Re: Spark Streaming not working

2020-04-10 Thread Debabrata Ghosh
On Fri, Apr 10, 2020 at 11:14 PM Srinivas V wrote: > Check if your broker details are correct, verify if you have network > connectivity to your client box and Kafka broker server host. > > On Fri, Apr 10, 2020 at 11:04 PM Debabrata Ghosh > wrote: > >> Hi, >>

Spark Streaming not working

2020-04-10 Thread Debabrata Ghosh
Hi, I have a spark streaming application where Kafka is producing records but unfortunately spark streaming isn't able to consume those. I am hitting the following error: 20/04/10 17:28:04 ERROR Executor: Exception in task 0.5 in stage 0.0 (TID 24) java.lang.AssertionError: assertion

Environment variable for deleting .sparkStaging

2020-02-13 Thread Debabrata Ghosh
Greetings All ! I have got plenty of application directories lying around sparkStaging , such as .sparkStaging/application_1580703507814_0074 Would you please be able to help advise me which variable I need to set in spark-env.sh so that the sparkStaging applications aren't preserved after the

Need help regarding logging / log4j.properties

2019-10-30 Thread Debabrata Ghosh
Greetings All ! I needed some help in obtaining the application logs but I am really confused where it's currently located. Please allow me to explain my problem: 1. I am running the Spark application (written in Java) in a Hortonworks Data Platform Hadoop cluster 2. My spark-submit command is

Unable to write data from Spark into a Hive Managed table

2019-08-09 Thread Debabrata Ghosh
Hi , I am using Hortonworks Data Platform 3.1. I am unable to write data from Spark into a Hive Managed table but am able to do so in a Hive External table. Would you please help get me with a resolution. Thanks, Debu

Best Practice for Writing data into a Hive table

2019-04-13 Thread Debabrata Ghosh
Hi, Please can you let me know which of the following options would be a best practice for writing data into a Hive table : Option 1: outputDataFrame.write .mode(SaveMode.Overwrite) .format("csv") .save("hdfs_path") Option 2: Get the data from a dataframe and

Help Required - Unable to run spark-submit on YARN client mode

2018-05-08 Thread Debabrata Ghosh
Hi Everyone, I have been trying to run spark-shell in YARN client mode, but am getting lot of ClosedChannelException errors, however the program works fine on local mode. I am using spark 2.2.0 build for Hadoop 2.7.3. If you are familiar with this error, please can you help with the possible

Re: Calling Pyspark functions in parallel

2018-03-19 Thread Debabrata Ghosh
/10/30/introducing- > vectorized-udfs-for-pyspark.html > > > Sent from my iPhone > Pardon the dumb thumb typos :) > > On Mar 18, 2018, at 10:54 PM, Debabrata Ghosh <mailford...@gmail.com> > wrote: > > Hi, > My dataframe is having 2000 row

Calling Pyspark functions in parallel

2018-03-18 Thread Debabrata Ghosh
Hi, My dataframe is having 2000 rows. For processing each row it consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds , which is a very high time. Further, I am contemplating to run the function in parallel. For example, I would like to divide the

Pyspark Error: Unable to read a hive table with transactional property set as 'True'

2018-03-02 Thread Debabrata Ghosh
Hi All, Greetings ! I needed some help to read a Hive table via Pyspark for which the transactional property is set to 'True' (In other words ACID property is enabled). Following is the entire stacktrace and the description of the hive table. Would you please be able to help

Re: Efficient way to compare the current row with previous row contents

2018-02-12 Thread Debabrata Ghosh
Georg - Thanks ! Will you be able to help me with a few examples please. Thanks in advance again ! Cheers, D On Mon, Feb 12, 2018 at 6:03 PM, Georg Heiler <georg.kf.hei...@gmail.com> wrote: > You should look into window functions for spark sql. > Debabrata Ghosh <mailford...@gma

Efficient way to compare the current row with previous row contents

2018-02-12 Thread Debabrata Ghosh
Hi, Greetings ! I needed some efficient way in pyspark to execute a comparison (on all the attributes) between the current row and the previous row. My intent here is to leverage the distributed framework of Spark to the best extent so that can achieve a good

Need help with String Concat Operation

2017-10-18 Thread Debabrata Ghosh
Hi, I am having a dataframe column (name of the column is CTOFF) and I intend to prefix with '0' in case the length of the column is 3. Unfortunately, I am unable to acheive my goal and wonder whether you can help me here. Command which I am executing: ctoff_dedup_prep_temp =

Re: How to flatten a row in PySpark

2017-10-13 Thread Debabrata Ghosh
;nicholas.hakobian@ > rallyhealth.com> wrote: > >> Using explode on the 4th column, followed by an explode on the 5th column >> would produce what you want (you might need to use split on the columns >> first if they are not already an array). >> >

How to flatten a row in PySpark

2017-10-12 Thread Debabrata Ghosh
Hi, Greetings ! I am having data in the format of the following row: ABZ|ABZ|AF|2,3,7,8,B,C,D,E,J,K,L,M,P,Q,T,U,X,Y|1,2,3,4,5|730 I want to convert it into several rows in the format below: ABZ|ABZ|AF|2|1|730 ABZ|ABZ|AF|3+1|730 . . . ABZ|ABZ|AF|3|1|730 ABZ|ABZ|AF|3|2|730

Unable to run Spark Jobs in yarn cluster mode

2017-10-10 Thread Debabrata Ghosh
Hi All, I am constantly hitting an error : "ApplicationMaster: SparkContext did not initialize after waiting for 100 ms" while running my Spark code in yarn cluster mode. Here is the command what I am using :* spark-submit --master yarn --deploy-mode cluster spark_code.py*

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-09-30 Thread Debabrata Ghosh
Ayan, Did you get to work the HBase Connection through Pyspark as well ? I have got the Spark - HBase connection working with Scala (via HBasecontext). However, but I eventually want to get this working within a Pyspark code - Would you have some suitable code snippets or

NullPointerException error while saving Scala Dataframe to HBase

2017-09-30 Thread Debabrata Ghosh
Dear All, Greetings ! I am repeatedly hitting a NullPointerException error while saving a Scala Dataframe to HBase. Please can you help resolving this for me. Here is the code snippet: scala> def catalog = s"""{ ||"table":{"namespace":"default", "name":"table1"},

Needed some best practices to integrate Spark with HBase

2017-09-29 Thread Debabrata Ghosh
Dear All, Greetings ! I needed some best practices for integrating Spark with HBase. Would you be able to point me to some useful resources / URL's to your convenience please. Thanks, Debu

Need some help around a Spark Error

2017-07-25 Thread Debabrata Ghosh
Hi, While executing a SparkSQL query, I am hitting the following error. Wonder, if you can please help me with a possible cause and resolution. Here is the stacktrace for the same: 07/25/2017 02:41:58 PM - DataPrep.py 323 - __main__ - ERROR - An error occurred while calling

Re: Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Debabrata Ghosh
Roman >> [image: https://]about.me/alonso.isidoro.roman >> >> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links> >> >> 2017-06-09 14:50 GMT+02:00 Debabrata Ghosh <mailford...@gmail.com>: >> >>> Hi, >>> I need some help / guidance in performance tuning >>> Spark code written in Scala. Can you please help. >>> >>> Thanks >>> >> >> >

Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Debabrata Ghosh
Hi, I need some help / guidance in performance tuning Spark code written in Scala. Can you please help. Thanks