sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-15 Thread George Sigletos
According to the documentation they are exactly the same, but in my queries dataFrame.cache() results in much faster execution times vs doing sqlContext.cacheTable("tableName") Is there any explanation about this? I am not caching the RDD prior to creating the dataframe. Using Pyspark on Spark

How to use all available memory per worker?

2015-12-07 Thread George Sigletos
Hello, In a 2-worker cluster: 6 cores/30 GB RAM, 24cores/60GB RAM, how can I tell my executor to use all 90 GB of available memory? In the configuration you set e.g. "spark.cores.max" to 30 (24+6), but cannot set "spark.executor.memory" to 90g (30+60). Kind regards, George

Migrate a cassandra table among from one cluster to another

2015-12-01 Thread George Sigletos
Hello, Does anybody know how to copy a cassandra table (or an entire keyspace) from one cluster to another using Spark? I haven't found anything very specific about this so far. Thank you, George

sc.textFile() does not count lines properly?

2015-11-25 Thread George Sigletos
Hello, I have a text file consisting of 483150 lines (wc -l "my_file.txt"). However when I read it using textFile: %pyspark rdd = sc.textFile("my_file.txt") print rdd.count() it returns 554420 lines. Any idea why this is happening? Is it using a different new line delimiter and how this can be

Re: sc.textFile() does not count lines properly?

2015-11-25 Thread George Sigletos
Found the problem. Control-M characters. Please ignore the post On Wed, Nov 25, 2015 at 6:06 PM, George Sigletos <sigle...@textkernel.nl> wrote: > Hello, > > I have a text file consisting of 483150 lines (wc -l "my_file.txt"). > > However when I read it usi