Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-15 Thread Felix Kizhakkel Jose
>>> print("DataFrame md is empty") >>> >>> That value batchId can be used for each Batch. >>> >>> >>> Otherwise you can do this >>> >>> >>> startval = 1 >>> df = df.withColumn('id', monotonica

How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Felix Kizhakkel Jose
Hello, I am using Spark Structured Streaming to sink data from Kafka to AWS S3. I am wondering if its possible for me to introduce a uniquely incrementing identifier for each record as we do in RDBMS (incrementing long id)? This would greatly benefit to range prune while reading based on this ID.

In built Optimizer on Spark

2021-03-21 Thread Felix Kizhakkel Jose
Hello, Is there any in-built optimizer in Spark as in Flink, to avoid manual configuration tuning to achieve better performance of your structured streaming pipeline? Or is there any work happening to achieve this? Regards, Felix K Jose

Re: How to modify a field in a nested struct using pyspark

2021-01-29 Thread Felix Kizhakkel Jose
94/mse > > On Fri, Jan 29, 2021 at 12:15 PM Adam Binford wrote: > >> I think they're voting on the next release candidate starting sometime >> next week. So hopefully barring any other major hurdles within the next few >> weeks. >> >> On Fri, Jan 29, 2021, 1:01

Re: How to modify a field in a nested struct using pyspark

2021-01-29 Thread Felix Kizhakkel Jose
nesting which is nice. > > On Fri, Jan 29, 2021 at 11:32 AM Felix Kizhakkel Jose < > felixkizhakkelj...@gmail.com> wrote: > >> Hello All, >> >> I am using pyspark structured streaming and I am getting timestamp fields >> as plain long (milliseconds),

How to modify a field in a nested struct using pyspark

2021-01-29 Thread Felix Kizhakkel Jose
Hello All, I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type a sample json object object: { "id":{ "value": "f40b2e22-4003-4d90-afd3-557bc013b05e", "type": "UUID",

How to modify a field in a nested struct using pyspark

2021-01-29 Thread Felix Kizhakkel Jose
Hello All, I am using pyspark structured streaming and I am getting timestamp fields as plain long (milliseconds), so I have to modify these fields into a timestamp type a sample json object object: { "id":{ "value": "f40b2e22-4003-4d90-afd3-557bc013b05e", "type": "UUID",