Join pushdown on two external tables from the same external source?

2017-06-13 Thread drewrobb
I'm trying to figure out how to multiple tables from a single external source
directly in spark sql. Say I do the following in spark SQL:

CREATE OR REPLACE TEMPORARY VIEW t1 USING jdbc OPTIONS ( dbtable 't1' ...)
CREATE OR REPLACE TEMPORARY VIEW t2 USING jdbc OPTIONS ( dbtable 't2' ...)

SELECT * from t1 join t2 on t1.id = t2.id limit 10;

This query will result in a full table select from t1 and t2 in my jdbc
source, which isn't great, but understandable given how I have defined the
tables. An optimized query would perhaps only need to select 10 rows from
the underlying database.

This would work using the scala API (not sure exactly what the spark SQL
equivalent is, or if there is any):

spark.read.jdbc("jdbc:...", "(SELECT * from t1 join t2 on t1.id = t2.id
limit 10) as t", new java.util.Properties)

However, this method seems cumbersome to use for every query I might want to
run on my remote jdbc DB (requires writing a query in a string, and doesn't
use spark sql). Ideally, I would want something like defining an entire
database using the JDBC source, so that queries using only tables from that
source could be entirely pushed down to the underlying database. Does anyone
know a better approach to this problem, or even more generally how to have a
nicer integration with spark sql and remote database using some other
approach or tool?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Join-pushdown-on-two-external-tables-from-the-same-external-source-tp28759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



_SUCCESS file validation on read

2017-04-03 Thread drewrobb
When writing a dataframe, a _SUCCESS file is created to mark that the entire
dataframe is written. However, the existence of this _SUCCESS does not seem
to be validated by default on reads. This would allow in some cases for
partially written dataframes to be read back. Is this behavior configurable?
Is lack of validation intentional?

Thanks!

Here is an example from spark 2.1.0 shell. I would expect the read step to
fail because I've manually removed the _SUCCESS file:

scala> spark.range(10).write.save("/tmp/test")

$ rm /tmp/test/_SUCCESS

scala> spark.read.parquet("/tmp/test").show()
+---+
| id|
+---+
|  8|
|  9|
|  3|
|  4|
|  5|
|  0|
|  6|
|  7|
|  2|
|  1|
+---+



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SUCCESS-file-validation-on-read-tp28564.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: No way to set mesos cluster driver memory overhead?

2016-10-13 Thread drewrobb
It seems like this is a real issue, so I've opened an issue:
https://issues.apache.org/jira/browse/SPARK-17928



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-way-to-set-mesos-cluster-driver-memory-overhead-tp27897p27901.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



No way to set mesos cluster driver memory overhead?

2016-10-13 Thread drewrobb
When using spark on mesos and deploying a job in cluster mode using
dispatcher, there appears to be no memory overhead configuration for the
launched driver processes ("--driver-memory" is the same as Xmx which is the
same as the memory quota). This makes it almost a guarantee that a long
running driver will be OOM killed by mesos. Yarn cluster mode has an
equivalent option -- spark.yarn.driver.memoryOverhead. Is there some way to
configure driver memory overhead that I'm missing?

Bigger picture question-- Is it even best practice to deploy long running
spark streaming jobs using dispatcher? I could alternatively launch the
driver by itself using marathon for example, where it would be trivial to
grant the process additional memory.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-way-to-set-mesos-cluster-driver-memory-overhead-tp27897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org