Join pushdown on two external tables from the same external source?
I'm trying to figure out how to multiple tables from a single external source directly in spark sql. Say I do the following in spark SQL: CREATE OR REPLACE TEMPORARY VIEW t1 USING jdbc OPTIONS ( dbtable 't1' ...) CREATE OR REPLACE TEMPORARY VIEW t2 USING jdbc OPTIONS ( dbtable 't2' ...) SELECT * from t1 join t2 on t1.id = t2.id limit 10; This query will result in a full table select from t1 and t2 in my jdbc source, which isn't great, but understandable given how I have defined the tables. An optimized query would perhaps only need to select 10 rows from the underlying database. This would work using the scala API (not sure exactly what the spark SQL equivalent is, or if there is any): spark.read.jdbc("jdbc:...", "(SELECT * from t1 join t2 on t1.id = t2.id limit 10) as t", new java.util.Properties) However, this method seems cumbersome to use for every query I might want to run on my remote jdbc DB (requires writing a query in a string, and doesn't use spark sql). Ideally, I would want something like defining an entire database using the JDBC source, so that queries using only tables from that source could be entirely pushed down to the underlying database. Does anyone know a better approach to this problem, or even more generally how to have a nicer integration with spark sql and remote database using some other approach or tool? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-pushdown-on-two-external-tables-from-the-same-external-source-tp28759.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
_SUCCESS file validation on read
When writing a dataframe, a _SUCCESS file is created to mark that the entire dataframe is written. However, the existence of this _SUCCESS does not seem to be validated by default on reads. This would allow in some cases for partially written dataframes to be read back. Is this behavior configurable? Is lack of validation intentional? Thanks! Here is an example from spark 2.1.0 shell. I would expect the read step to fail because I've manually removed the _SUCCESS file: scala> spark.range(10).write.save("/tmp/test") $ rm /tmp/test/_SUCCESS scala> spark.read.parquet("/tmp/test").show() +---+ | id| +---+ | 8| | 9| | 3| | 4| | 5| | 0| | 6| | 7| | 2| | 1| +---+ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SUCCESS-file-validation-on-read-tp28564.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: No way to set mesos cluster driver memory overhead?
It seems like this is a real issue, so I've opened an issue: https://issues.apache.org/jira/browse/SPARK-17928 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-way-to-set-mesos-cluster-driver-memory-overhead-tp27897p27901.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
No way to set mesos cluster driver memory overhead?
When using spark on mesos and deploying a job in cluster mode using dispatcher, there appears to be no memory overhead configuration for the launched driver processes ("--driver-memory" is the same as Xmx which is the same as the memory quota). This makes it almost a guarantee that a long running driver will be OOM killed by mesos. Yarn cluster mode has an equivalent option -- spark.yarn.driver.memoryOverhead. Is there some way to configure driver memory overhead that I'm missing? Bigger picture question-- Is it even best practice to deploy long running spark streaming jobs using dispatcher? I could alternatively launch the driver by itself using marathon for example, where it would be trivial to grant the process additional memory. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-way-to-set-mesos-cluster-driver-memory-overhead-tp27897.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org