MySQL query continually add IS NOT NULL onto a query even though I don't request it

2020-03-31 Thread Hamish Whittal
Hi folks, 1) First Problem: I'm querying MySQL. I submit a query like this: out = wam.select('message_id', 'business_id', 'info', 'entered_system_date', 'auto_update_time').filter("auto_update_time >= '2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time']) But what I see in the

Re: HDFS file

2020-03-31 Thread Som Lima
Hi Jane Try this example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala Som On Tue, 31 Mar 2020, 21:34 jane thorpe, wrote: > hi, > > Are there setup instructions on the website for >

HDFS file

2020-03-31 Thread jane thorpe
hi, Are there setup instructions on the website for spark-3.0.0-preview2-bin-hadoop2.7I can run same program for hdfs format val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ +

Re: Unablee to get to_timestamp with Timezone Information

2020-03-31 Thread Chetan Khatri
Sorry misrepresentation the question also. Thanks for your great help. What I want is the time zone information as it is 2020-04-11T20:40:00-05:00 in timestamp datatype. so I can write to downstream application as it is. I can correct the lacking UTC offset info. On Tue, Mar 31, 2020 at 1:15 PM

Re: Unablee to get to_timestamp with Timezone Information

2020-03-31 Thread Magnus Nilsson
And to answer your question (sorry, read too fast). The string is not in proper ISO8601. Extended form must be used throughout, ie 2020-04-11T20:40:00-05:00, there's a colon (:) lacking in the UTC offset info. br, Magnus On Tue, Mar 31, 2020 at 7:11 PM Magnus Nilsson wrote: > Timestamps

Re: Unablee to get to_timestamp with Timezone Information

2020-03-31 Thread Magnus Nilsson
Timestamps aren't timezoned. If you parse ISO8601 strings they will be converted to UTC automatically. If you parse timestamps without timezone they will converted to the the timezone the server Spark is running on uses. You can change the timezone Spark uses with

Unablee to get to_timestamp with Timezone Information

2020-03-31 Thread Chetan Khatri
Hi Spark Users, I am losing the timezone value from below format, I tried couple of formats but not able to make it. Can someone throw lights? scala> val sampleDF = Seq("2020-04-11T20:40:00-0500").toDF("value") sampleDF: org.apache.spark.sql.DataFrame = [value: string] scala>

Design pattern to invert a large map

2020-03-31 Thread Patrick McCarthy
I'm not a software engineer by training and I hope that there's an existing best practice for the problem I'm trying to solve. I'm using Spark 2.4.5, Hadoop 2.7, Hive 1.2. I have a large table (terabytes) from an external source (which is beyond my control) where the data is stored in a key-value

Re: spark structured streaming GroupState returns weird values from sate

2020-03-31 Thread Jungtaek Lim
That seems to come from the difference how Spark infers schema and create serializer / deserializer for Java beans to construct bean encoder. When inferring schema for Java beans, all properties which have getter methods are considered. When creating serializer / deserializer, only properties

Re: spark structured streaming GroupState returns weird values from sate

2020-03-31 Thread Srinivas V
Never mind. It got resolved after I removed extra two getter methods (to calculate duration) I created in my State specific Java bean (ProductSessionInformation). But I am surprised why it has created so much problem. I guess when this bean is converted to Scala class it may not be taking care of