DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Eran Medan
I understand that the following are equivalent df.filter('account === "acct1") sql("select * from tempTableName where account = 'acct1'") But is Spark SQL "smart" to also push filter predicates down for the initial load? e.g. sqlContext.read.jdbc(…).filter('account=== "acct1")

Joining HDFS and JDBC data sources - benchmarks

2015-11-13 Thread Eran Medan
Hi I'm looking for some benchmarks on joining data frames where most of the data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is still in RDBMS. I am only looking at the very first join before any caching happens, and I assume there will be loss of parallelization because

Spark not working on windows 7 64 bit

2015-06-10 Thread Eran Medan
I'm on a road block trying to understand why Spark doesn't work for a colleague of mine on his Windows 7 laptop. I have pretty much the same setup and everything works fine. I googled the error message and didn't get anything that resovled it. Here is the exception message (after running spark

Understanding Spark's caching

2015-04-27 Thread Eran Medan
Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val rdd1 = sc.textFile(some data) rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile(...)

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-29 Thread Eran Medan
. Speed is an important issue but by no means everything in the real world, and these are rarely mutually exclusive options in the OSS world. This is a great piece of work, but I don't think it's some kind of argument against distributed computing. On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan

Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-29 Thread Eran Medan
, 2015 at 2:31 PM, Eran Medan ehrann.meh...@gmail.com wrote: Hi everyone, I had a lot of questions today, sorry if I'm spamming the list, but I thought it's better than posting all questions in one thread. Let me know if I should throttle my posts ;) Here is my question: When I try

Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Eran Medan
Remember that article that went viral on HN? (Where a guy showed how GraphX / Giraph / GraphLab / Spark have worse performance on a 128 cluster than on a 1 thread machine? if not here is the article - http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) Well as you may

[spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-27 Thread Eran Medan
Hi everyone, I had a lot of questions today, sorry if I'm spamming the list, but I thought it's better than posting all questions in one thread. Let me know if I should throttle my posts ;) Here is my question: When I try to have a case class that has Any in it (e.g. I have a property map and