Re: Spark MySQL Invalid DateTime value killing job

2019-06-05 Thread Anthony May
Murphy's Law striking after asking the question, I just discovered the solution: The jdbc url should set the zeroDateTimeBehavior option. https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

Spark MySQL Invalid DateTime value killing job

2019-06-05 Thread Anthony May
Hi, We have a legacy process of scraping a MySQL Database. The Spark job uses the DataFrame API and MySQL JDBC driver to read the tables and save them as JSON files. One table has DateTime columns that contain values invalid for java.sql.Timestamp so it's throwing the exception:

Blog post: DataFrame.transform -- Spark function composition

2019-06-05 Thread Daniel Mateus Pires
Hi everyone! I just published this blog post on how Spark Scala custom transformations can be re-arranged to better be composed and used within .transform: https://medium.com/@dmateusp/dataframe-transform-spark-function-composition-eb8ec296c108 I found the discussions in this group to be

Spark on K8S - --packages not working for cluster mode?

2019-06-05 Thread pacuna
I'm trying to run a sample code that reads a file from s3 so I need the aws sdk and aws hadoop dependencies. If I assemble these deps into the main jar everything works fine. But when I try using --packages, the deps are not seen by the pods. This is my submit command: spark-submit --master

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-05 Thread Jungtaek Lim
Nice to hear you're investigating the issue deeply. Btw, if attaching code is not easy, maybe you could share logical/physical plan on any batch: "detail" in SQL tab would show up the plan as string. Plans from sequential batches would be much helpful - and streaming query status in these batch

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-05 Thread Joe Ammann
Hi Jungtaek Thanks for your response! I actually have set watermarks on all the streams A/B/C with the respective event time column A/B/C_LAST_MOD. So I think this should not be the reason. Of course, the event time on the C stream (the "optional one") progresses much slower than on the other

[Pyspark 2.4] Best way to define activity within different time window

2019-06-05 Thread Rishi Shah
Hi All, Is there a best practice around calculating daily, weekly, monthly, quarterly, yearly active users? One approach is to create a window of daily bitmap and aggregate it based on period later. However I was wondering if anyone has a better approach to tackling this problem.. -- Regards,

Re: installation of spark

2019-06-05 Thread Alonso Isidoro Roman
When using osx, it is recommended to install java, scala and spark using brew. Run these commands on a terminal: brew update brew install scala brew install sbt brew cask install java brew install spark There is no need to install HDFS, you can use your local file system without a

spark ./build/mvn test failed on aarch64

2019-06-05 Thread Tianhua huang
Hi all, Recently I run './build/mvn test' of spark on aarch64, and master and branch-2.4 are all failled, the log pieces as below: .. [INFO] T E S T S [INFO] --- [INFO] Running org.apache.spark.util.kvstore.LevelDBTypeInfoSuite [INFO] Tests

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-05 Thread Jungtaek Lim
I would suspect that rows are never evicted in state in second join. To determine whether the row is NOT matched to other side, Spark should check whether the row is ever matched before evicted. You need to set watermark either B_LAST_MOD or C_LAST_MOD. If you already did but not exposed to here,

spark ./build/mvn test failed on aarch64

2019-06-05 Thread Tianhua huang
Hi all, Recently I run './build/mvn test' of spark on aarch64, and master and branch-2.4 are all failled, the log pieces as below: .. [INFO] T E S T S [INFO] --- [INFO] Running org.apache.spark.util.kvstore.LevelDBTypeInfoSuite [INFO] Tests