According to the documentation they are exactly the same, but in my queries
dataFrame.cache()
results in much faster execution times vs doing
sqlContext.cacheTable("tableName")
Is there any explanation about this? I am not caching the RDD prior to
creating the dataframe. Using Pyspark on Spark
Hello,
In a 2-worker cluster: 6 cores/30 GB RAM, 24cores/60GB RAM,
how can I tell my executor to use all 90 GB of available memory?
In the configuration you set e.g. "spark.cores.max" to 30 (24+6),
but cannot set "spark.executor.memory" to 90g (30+60).
Kind regards,
George
Hello,
Does anybody know how to copy a cassandra table (or an entire keyspace)
from one cluster to another using Spark? I haven't found anything very
specific about this so far.
Thank you,
George
Hello,
I have a text file consisting of 483150 lines (wc -l "my_file.txt").
However when I read it using textFile:
%pyspark
rdd = sc.textFile("my_file.txt")
print rdd.count()
it returns 554420 lines. Any idea why this is happening? Is it using a
different new line delimiter and how this can be
Found the problem. Control-M characters. Please ignore the post
On Wed, Nov 25, 2015 at 6:06 PM, George Sigletos <sigle...@textkernel.nl>
wrote:
> Hello,
>
> I have a text file consisting of 483150 lines (wc -l "my_file.txt").
>
> However when I read it usi