Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Hi all, I am writing this email to both user-group and dev-group since this is applicable to both. I am now working on Spark XML datasource ( https://github.com/databricks/spark-xml). This uses a InputFormat implementation which I downgraded to Hadoop 1.x for version compatibility. However, I

Re: A proposal for Spark 2.0

2015-12-09 Thread kostas papageorgopoylos
Hi Kostas With regards to your *second* point. I believe that requiring from the user apps to explicitly declare their dependencies is the most clear API approach when it comes to classpath and classloading. However what about the following API: *SparkContext.addJar(String pathToJar)* . *Is this

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API. but it’s not a big issue for your change, only com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Thank you for your reply! I have already done the change locally. So for changing it would be fine. I just wanted to be sure which way is correct. On 9 Dec 2015 18:20, "Fengdong Yu" wrote: > I don’t think there is performance difference between 1.x API and 2.x API. >

SQL language vs DataFrame API

2015-12-09 Thread Cristian O
Hi, I was wondering what the "official" view is on feature parity between SQL and DF apis. Docs are pretty sparse on the SQL front, and it seems that some features are only supported at various times in only one of Spark SQL dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY

Re: [build system] jenkins downtime, thursday 12/10/15 7am PDT

2015-12-09 Thread shane knapp
reminder! this is happening tomorrow morning. On Wed, Dec 2, 2015 at 7:20 PM, shane knapp wrote: > there's Yet Another Jenkins Security Advisory[tm], and a big release > to patch it all coming out next wednesday. > > to that end i will be performing a jenkins update, as

Re: SQL language vs DataFrame API

2015-12-09 Thread Michael Armbrust
I don't plan to abandon HiveQL compatibility, but I'd like to see us move towards something with more SQL compliance (perhaps just newer versions of the HiveQL parser). Exactly which parser will do that for us is under investigation. On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li

Re: Fastest way to build Spark from scratch

2015-12-09 Thread Josh Rosen
Yeah, this is the same idea behind having Travis cache the ivy2 folder to speed up builds. In Amplab Jenkins each individual build workspace has its own individual Ivy cache which is preserved across build runs but which is only used by one active run at a time in order to avoid SBT ivy lock

Re: SQL language vs DataFrame API

2015-12-09 Thread Xiao Li
Hi, Michael, Does that mean SqlContext will be built on HiveQL in the near future? Thanks, Xiao Li 2015-12-09 10:36 GMT-08:00 Michael Armbrust : > I think that it is generally good to have parity when the functionality is > useful. However, in some cases various

DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
hi, I met following exception when the driver program tried to recover from checkpoint, looks like the logic relies on zeroTime being set which doesn't seem to happen here. am I missing anything or is it a bug in 1.4.1? org.apache.spark.SparkException:

RE: Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Sun, Rui
Hi, Just use ""objectFile" instead of "objectFile[PipelineModel]" for callJMethod. You can take the objectFile() in context.R as example. Since the SparkContext created in SparkR is actually a JavaSparkContext, there is no need to pass the implicit ClassTag. -Original Message- From:

Re: SQL language vs DataFrame API

2015-12-09 Thread Stephen Boesch
Is this a candidate for the version 1.X/2.0 split? 2015-12-09 16:29 GMT-08:00 Michael Armbrust : > Yeah, I would like to address any actual gaps in functionality that are > present. > > On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris > wrote:

Re: DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
never mind, one of my peers correct the driver program for me - all dstream operations need to be within the scope of getOrCreate API On Wed, Dec 9, 2015 at 3:32 PM, Renyi Xiong wrote: > following scala program throws same exception, I know people are running > streaming

Re: Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Shivaram Venkataraman
The SparkR callJMethod can only invoke methods as they show up in the Java byte code. So in this case you'll need to check the SparkContext byte code (with javap or something like that) to see how that method looks. My guess is the type is passed in as a class tag argument, so you'll need to do

Re: SQL language vs DataFrame API

2015-12-09 Thread Michael Armbrust
Yeah, I would like to address any actual gaps in functionality that are present. On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris wrote: > The reason I'm asking is because it's important in larger projects to be > able to stick to a particular programming style. Some

Re: DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
following scala program throws same exception, I know people are running streaming jobs against kafka, I must be missing something. any idea why? package org.apache.spark.streaming.api.csharp import java.util.HashMap import kafka.serializer.{DefaultDecoder, Decoder, StringDecoder} import

Re: SQL language vs DataFrame API

2015-12-09 Thread Xiao Li
That sounds great! When it is decided, please let us know and we can add more features and make it ANSI SQL compliant. Thank you! Xiao Li 2015-12-09 11:31 GMT-08:00 Michael Armbrust : > I don't plan to abandon HiveQL compatibility, but I'd like to see us move > towards

Re: [build system] jenkins downtime, thursday 12/10/15 7am PDT

2015-12-09 Thread shane knapp
here's the security advisory for the update: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-12-09 On Wed, Dec 9, 2015 at 9:55 AM, shane knapp wrote: > reminder! this is happening tomorrow morning. > > On Wed, Dec 2, 2015 at 7:20 PM, shane knapp