SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-18 Thread Nathan McCarthy
Hey, Spark Submit adds maven central spark bintray to the ChainResolver before it adds any external resolvers. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821 When running on a cluster without internet access, this means the

Spark SQL DATE_ADD function - Spark 1.3.1 1.4.0

2015-06-17 Thread Nathan McCarthy
Hi guys, Running with a parquet backed table in hive ‘dim_promo_date_curr_p' which has the following data; scala sqlContext.sql(select * from pz.dim_promo_date_curr_p).show(3) 15/06/18 00:53:21 INFO ParseDriver: Parsing command: select * from pz.dim_promo_date_curr_p 15/06/18 00:53:21 INFO

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-17 Thread Nathan McCarthy
filed https://issues.apache.org/jira/browse/SPARK-8406 to track this. Will deliver a fix ASAP and this will be included in 1.4.1. Best, Cheng On 6/16/15 12:30 AM, Nathan McCarthy wrote: Hi all, Looks like data frame parquet writing is very broken in Spark 1.4.0. We had no problems with Spark

Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-16 Thread Nathan McCarthy
Hi all, Looks like data frame parquet writing is very broken in Spark 1.4.0. We had no problems with Spark 1.3. When trying to save a data frame with 569610608 rows. dfc.write.format(parquet).save(“/data/map_parquet_file) We get random results between runs. Caching the data frame in memory

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-16 Thread Nathan McCarthy
and full command you ran Spark with ? On Wed, Apr 15, 2015 at 11:27 AM, Nathan McCarthy nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au wrote: Just an update, tried with the old JdbcRDD and that worked fine. From: Nathan nathan.mccar...@quantium.com.aumailto:nathan.mccar

RE: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Nathan McCarthy
Tried with 1.3.0 release (built myself) the most recent 1.3.1 Snapshot off the 1.3 branch. Haven't tried with 1.4/master. From: Wang, Daoyuan [daoyuan.w...@intel.com] Sent: Wednesday, April 15, 2015 5:22 PM To: Nathan McCarthy; user@spark.apache.org Subject: RE

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Nathan McCarthy
: Wednesday, April 15, 2015 5:22 PM To: Nathan McCarthy; user@spark.apache.orgmailto:user@spark.apache.org Subject: RE: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0 Can you provide your spark version? Thanks, Daoyuan From: Nathan McCarthy [mailto:nathan.mccar...@quantium.com.au] Sent

SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-14 Thread Nathan McCarthy
Hi guys, Trying to use a Spark SQL context’s .load(“jdbc, …) method to create a DF from a JDBC data source. All seems to work well locally (master = local[*]), however as soon as we try and run on YARN we have problems. We seem to be running into problems with the class path and loading up the

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-14 Thread Nathan McCarthy
Just an update, tried with the old JdbcRDD and that worked fine. From: Nathan nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au Date: Wednesday, 15 April 2015 1:57 pm To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-15 Thread Nathan McCarthy
@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats? On 1/11/15 1:40 PM, Nathan McCarthy wrote: Thanks Cheng Michael! Makes sense. Appreciate the tips! Idiomatic scala isn't performant. I’ll

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-10 Thread Nathan McCarthy
are inlined below. Cheng On 1/7/15 11:53 AM, Nathan McCarthy wrote: Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple example; load up some sample data from parquet on HDFS (about 380m rows, 10 columns) on a 7 node

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-08 Thread Nathan McCarthy
performance on MapPartitions on SchemaRDDs? Is there some unwrapping going on in the background that catalyst does in a smart way that I’m missing? Cheers, ~N Nathan McCarthy QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000 T: +61 2 8224 8922 F: +61 2 9292 6444 W

SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-06 Thread Nathan McCarthy
on SchemaRDDs? Is there some unwrapping going on in the background that catalyst does in a smart way that I’m missing? Cheers, ~N Nathan McCarthy QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000 T: +61 2 8224 8922 F: +61 2 9292 6444 W: quantium.com.auwww.quantium.com.au