Re: SparkSQL 1.3.0 (RC3) failed to read parquet file generated by 1.1.1

2015-03-12 Thread giive chen
Hi all My team has the same issue. It looks like Spark 1.3's sparkSQL cannot read parquet file generated by Spark 1.1. It will cost a lot of migration work when we wanna to upgrade Spark 1.3. Is there anyone can help me? Thanks Wisely Chen On Tue, Mar 10, 2015 at 5:06 PM, Pei-Lun Lee

Profiling Spark: MemoryStore

2015-03-12 Thread Ulanov, Alexander
Hi, I am working on artificial neural networks for Spark. It is solved with Gradient Descent, so each step the data is read, sum of gradients is calculated for each data partition (on each worker), aggregated (on the driver) and broadcasted back. I noticed that the gradient computation time is

Spilling when not expected

2015-03-12 Thread Tom Hubregtsen
Hi all, I'm running the teraSort benchmark with a relative small input set: 5GB. During profiling, I can see I am using a total of 68GB. I've got a terabyte of memory in my system, and set spark.executor.memory 900g spark.driver.memory 900g I use the default for spark.shuffle.memoryFraction

Re: Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

2015-03-12 Thread Joseph Bradley
The checks against maxCategories are not for statistical purposes; they are to make sure communication does not blow up. There currently are not checks to make sure that there are enough entries for statistically significant results. That is up to the user. I do like the idea of adding a