Spark SQL check timestamp with other table and update a column.

2020-11-18 Thread anbutech
Hi Team, i want to update a col3 in table 1 if col1 from table2 is less than col1 in table1 and update each record in table 1.I 'am not getting the correct output. Table 1: col1|col2|col3 2020-11-17T20:50:57.777+|1|null Table 2: col1|col2|col3 2020-11-17T21:19:06.508+|1|win

regexp_extract regex for extracting the columns from string

2020-08-09 Thread anbutech
Hi All, I have a following info.in the data column. <1000> date=2020-08-01 time=20:50:04 name=processing id=123 session=new packt=20 orgin=null address=null dest=fgjglgl here I want to create a separate column for the above key value pairs after the integer <1000> separated by spaces. Is there

Re: Overwrite Mode not Working Correctly in spark 3.0.0

2020-07-19 Thread anbutech
Hi, When im using option 1,it is completely overwrite the whole table.this is not expected here.im running for multiple tables with different hours. When im using option 2,im getting the following error Predicate references non-partition column 'json_feeds_flatten_data'. Only the partition

Schedule/Orchestrate spark structured streaming job

2020-07-19 Thread anbutech
Hi Team, I'm very new to spark structured streaming.could you please guide me how to Schedule/Orchestrate spark structured streaming job.Any scheduler similar like airflow.I knew airflow doesn't support streaming jobs. Thanks Anbu -- Sent from:

Overwrite Mode not Working Correctly in spark 3.0.0

2020-07-19 Thread anbutech
Hi Team, I'm facing weird behavior in the pyspark dataframe(databricks delta spark 3.0.0 supported) I have tried the below two options to write the processed datafame data into delta table with respect to the partition columns in the table.Actually overwrite mode completely overwrite the whole

Pyspark and snowflake Column Mapping

2020-05-05 Thread anbutech
Hi Team, While working on the json data and we flattened the unstrucured data into structured format.so here we are having spark data types like Array> fields and Array data type columns in the databricks delta table. while loading the data from databricks spark connector to snowflake we

Get Size of a column in Bytes Pyspark Dataframe

2020-04-16 Thread anbutech
Hello All, I have a column in a dataframe which i struct type.I want to find the size of the column in bytes.it is getting failed while loading in snowflake. I could see size functions avialable to get the length.how to calculate the size in bytes for a column in pyspark dataframe.

How to handle Null values in Array of struct elements in pyspark

2020-04-08 Thread anbutech
Hello All, We have a data in a column in pyspark dataframe having array of struct type having multiple nested fields present.if the value is not blank it will save the data in the same array of struct type in spark delta table. please advise on the below case: if the same column coming as blank

Pyspark Convert Struct Type to Map Type

2020-02-28 Thread anbutech
Hello Sir, Could you please advise the below scenario in pyspark 2.4.3 in data-bricks to load the data into the delta table. I want to load the dataframe with this column "data" into the table as Map type in the data-bricks spark delta table.could you please advise on this scenario.how to

Performance tuning on the Databricks pyspark 2.4.4

2020-01-21 Thread anbutech
Hi sir, Could you please help me on the below two cases in the databricks pyspark data processing terabytes of json data read from aws s3 bucket. case 1: currently I'm reading multiple tables sequentially to get the day count from each table for ex: table_list.csv having one column with

Re: Record count query parallel processing in databricks spark delta lake

2020-01-20 Thread anbutech
Thank you Farhan so much for the help. please help me on the design approach of this problem.what is the best way to achieve this code to get the results better. I have some clarification on the code. want to take daily record count of ingestion source vs databricks delta lake table vs

Record count query parallel processing in databricks spark delta lake

2020-01-17 Thread anbutech
Hi, I have a question on the design of monitoring pyspark script on the large number of source json data coming from more than 100 kafka topics. These multiple topics are store under separate bucket in aws s3.each of the kafka topics having more Terabytes of json data with respect to the

Merge multiple different s3 logs using pyspark 2.4.3

2020-01-08 Thread anbutech
Hello, version = spark 2.4.3 I have 3 different sources json logs data which having same schema(same columns order) in the raw data and want to add one new column as "src_category" for all the 3 different source to distinguish the source category and merge all the 3 different sources into

Flatten log data Using Pyspark

2019-11-29 Thread anbutech
Hi, I have a raw source data frame having 2 columns as below timestamp 2019-11-29 9:30:45 message_log <123>NOV 29 10:20:35 ips01 sfids: connection: tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1 how do we break above each key value as separate columns

Re: Explode/Flatten Map type Data Using Pyspark

2019-11-14 Thread anbutech
Hello Guha, The number of keys will be different for each event id.for example if the event id:005 it is has 10 keys then i have to flatten all those 10 keys in the final output.here there is no fixed number of keys for each event id. 001 -> 2 keys 002 -> 4 keys 003 -> 5 keys above event id

Explode/Flatten Map type Data Using Pyspark

2019-11-14 Thread anbutech
Hello Sir, I have a scenario to flatten the different combinations of map type(key value) in a column called eve_data like below: How do we flatten the map type into proper columns using pyspark 1) Source Dataframe having 2 columns(event id,data) eve_id,eve_data 001, "k1":"abc",

Spark scala/Hive scenario

2019-08-07 Thread anbutech
Hi All, I have a scenario in (Spark scala/Hive): Day 1: i have a file with 5 columns which needs to be processed and loaded into hive tables. day2: Next day the same feeds(file) has 8 columns(additional fields) which needs to be processed and loaded into hive tables How do we approach this

Spark 2.4 scala 2.12 Regular Expressions Approach

2019-07-15 Thread anbutech
Hi All, Could you please help me to fix the below issue using spark 2.4 , scala 2.12 How do we extract's the multiple values in the given file name pattern using spark/scala regular expression.please give me some idea on the below approach. object Driver { private val filePattern =

Spark Write method not ignoring double quotes in the csv file

2019-07-11 Thread anbutech
Hello All, Could you please help me to fix the below questions Question 1: I have tried the below options while writing the final data in a csv file to ignore double quotes in the same csv file .nothing is worked. I'm using spark version 2.2 and scala version 2.11 . option("quote", "\"")

Re: Spark 2.2 With Column usage

2019-06-08 Thread anbutech
Thanks Jacek Laskowski Sir.but i didn't get the point here please advise the below one are you expecting: dataset1.as("t1) join(dataset3.as("t2"), col(t1.col1) === col(t2.col1), JOINTYPE.Inner ) .join(dataset4.as("t3"), col(t3.col1) === col(t1.col1), JOINTYPE.Inner)

Spark 2.2 With Column usage

2019-06-07 Thread anbutech
Hi Sir, Could you please advise to fix the below issue in the withColumn in the spark 2.2 scala 2.11 joins def processing(spark:SparkSession, dataset1:Dataset[Reference], dataset2:Dataset[DataCore], dataset3:Dataset[ThirdPartyData] , dataset4:Dataset[OtherData]