Re: CVEs

2021-07-12 Thread Eric Richardson
Hi Sean and Holden, I decided it was best to send an email so I could share all my findings with the team. I think it should be relatively easy to fix with updates but I am not that good at working on the repo. I tried but ended up with some roadblocks that were going to take some time to figure

Performance Improvement with Hive/Thrift Server

2021-07-12 Thread Artemis User
We are trying to switch from Postgres to the Spark's built-in Hive with Thrift server as the data sink to persist the ML result data, with the hope that Hive would improve the ML pipeline performance. However, it turned out that it took significantly longer for Hive to persist dataframes (via

Why planInputPartitions is called multiple times in a micro-batch?

2021-07-12 Thread kineret M
Hi, I'm developing a new Spark connector using data source v2 API (spark 3.1.1). I noticed that the planInputPartitions method (in MicroBatchStream) is called twice every micro-batch. What the motivation/reason is? Thanks, Kineret