Re: Use our own metastore with Spark SQL

2019-10-14 Thread Zhu, Luke
I had a similar issue this summer while prototyping Spark on K8s. I ended up sticking with Hive Metastore 2 to meet time goals. Not sure if I was using it correctly, but I only needed Hadoop + Hive JARs; I did not need to run HDFS, YARN, etc. Using the metastore with an s3a warehouse.dir path

Semantics of Manual Offset Commit for Kafka Spark Streaming

2019-10-14 Thread Andre Piwoni
When using manual Kafka offset commit in Spark streaming job and application fails to process current batch without committing offset in executor, is it expected behavior that next batch will be processed and offset will be moved to next batch regardless of application failure to commit? It

Re: Spark 2.4.3 - Structured Streaming - high on Storage Memory

2019-10-14 Thread puneetloya
Hi the amazing spark team, I was closely following these issues, https://issues.apache.org/jira/browse/SPARK-27648 and then recently this: https://issues.apache.org/jira/browse/SPARK-29055 Looks like all of it is fixed in this pull request: https://github.com/apache/spark/pull/25973 and it was

Use our own metastore with Spark SQL

2019-10-14 Thread xweb
Is it possible to use our own metastore instead of Hive Metastore with Spark SQL? Can you please point me to some docs or code I can look at to get it done? We are moving away from everything Hadoop. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Spark pipeRDD vs ML

2019-10-14 Thread pratik4891
What I was wondering after reading about spark pipe RDD is that we can execute any python code (including machine learning ) . The code is going to execute in distributed manner as well. So if we can run machine learning code in distributed manner with pipeRDD what's the usefulness of Spark ML.