>>> print("DataFrame md is empty")
>>>
>>> That value batchId can be used for each Batch.
>>>
>>>
>>> Otherwise you can do this
>>>
>>>
>>> startval = 1
>>> df = df.withColumn('id', monotonica
Hello,
I am using Spark Structured Streaming to sink data from Kafka to AWS S3. I
am wondering if its possible for me to introduce a uniquely incrementing
identifier for each record as we do in RDBMS (incrementing long id)?
This would greatly benefit to range prune while reading based on this ID.
Hello,
Is there any in-built optimizer in Spark as in Flink, to avoid manual
configuration tuning to achieve better performance of your
structured streaming pipeline?
Or is there any work happening to achieve this?
Regards,
Felix K Jose
94/mse
>
> On Fri, Jan 29, 2021 at 12:15 PM Adam Binford wrote:
>
>> I think they're voting on the next release candidate starting sometime
>> next week. So hopefully barring any other major hurdles within the next few
>> weeks.
>>
>> On Fri, Jan 29, 2021, 1:01
nesting which is nice.
>
> On Fri, Jan 29, 2021 at 11:32 AM Felix Kizhakkel Jose <
> felixkizhakkelj...@gmail.com> wrote:
>
>> Hello All,
>>
>> I am using pyspark structured streaming and I am getting timestamp fields
>> as plain long (milliseconds),
Hello All,
I am using pyspark structured streaming and I am getting timestamp fields
as plain long (milliseconds), so I have to modify these fields into a
timestamp type
a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",
Hello All,
I am using pyspark structured streaming and I am getting timestamp fields
as plain long (milliseconds), so I have to modify these fields into a
timestamp type
a sample json object object:
{
"id":{
"value": "f40b2e22-4003-4d90-afd3-557bc013b05e",
"type": "UUID",