Re: Python UDF to convert timestamps (performance question)

2017-08-30 Thread Brian Wylie
Tathagata, Thanks, your explanation was great. The suggestion worked well with the only minutia is that I needed to have the TS field brought in as a DoubleType() or the time got truncated. Thanks again, -Brian On Wed, Aug 30, 2017 at 1:34 PM, Tathagata Das

Re: Python UDF to convert timestamps (performance question)

2017-08-30 Thread Tathagata Das
1. Generally, adding columns, etc. will not affect performance because the Spark's optimizer will automatically figure out columns that are not needed and eliminate in the optimization step. So that should never be a concern. 2. Again, this is generally not a concern as the optimizer will take

Python UDF to convert timestamps (performance question)

2017-08-30 Thread Brian Wylie
Hi All, I'm using structured streaming in Spark 2.2. I'm using PySpark and I have data (from a Kafka publisher) where the timestamp is a float that looks like this: 1379288667.631940 So here's my code (which is working fine) # SUBSCRIBE: Setup connection to Kafka Stream raw_data =