Re: aggregateByKey on PairRDD

2016-03-30 Thread write2sivakumar@gmail
Hi, We can use CombineByKey to achieve this. val finalRDD = tempRDD.combineByKey((x: (Any, Any)) => (x),(acc: (Any, Any), x) => (acc, x),(acc1: (Any, Any), acc2: (Any, Any)) => (acc1, acc2)) finalRDD.collect.foreach(println) (amazon,((book1, tech),(book2,tech)))(barns, (book,tech))(eBay,

Re: Any documentation on Spark's security model beyond YARN?

2016-03-30 Thread Sean Busbey
On Wed, Mar 30, 2016 at 4:33 AM, Steve Loughran wrote: > >> On 29 Mar 2016, at 22:19, Michael Segel wrote: >> >> Hi, >> >> So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its >> security from the underlying YARN job. >>

Re: Spark SQL UDF Returning Rows

2016-03-30 Thread Michael Armbrust
Some answers and more questions inline - UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects > as parameters. I cannot take in a case class object in place of the > corresponding Row object, even if the schema matches because the Row object > will always be passed in at

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Michael Armbrust
+1 to Matei's reasoning. On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia wrote: > I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the > entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's > the default version we built with in

Re: Understanding PySpark Internals

2016-03-30 Thread Josh Rosen
One clarification: there *are* Python interpreters running on executors so that Python UDFs and RDD API code can be executed. Some slightly-outdated but mostly-correct reference material for this can be found at https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. See also: search

Re: Spark SQL UDF Returning Rows

2016-03-30 Thread Hamel Kothari
Just to clarify, this is possible via UDF1/2/3 etc and registering those with the desired return schema. It just felt wrong that the only way to do this in scala was to use these classes which were in the Java package. Maybe the relevant question is, why are these in a Java package? On Wed, Mar

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Matei Zaharia
I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's the default version we built with in 1.x. We want to make the transition from 1.x to 2.0 as easy as possible. In 2.0, we'll have the default downloads

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Koert Kuipers
oh wow, had no idea it got ripped out On Wed, Mar 30, 2016 at 11:50 AM, Mark Hamstra wrote: > No, with 2.0 Spark really doesn't use Akka: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744 > > On Wed, Mar 30, 2016 at

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
My concern is that for some of those stuck using 2.10 because of some library dependency, three months isn't sufficient time to refactor their infrastructure to be compatible with Spark 2.0.0 if that requires Scala 2.11. The additional 3-6 months would make it much more feasible for those users

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
No, with 2.0 Spark really doesn't use Akka: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744 On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers wrote: > Spark still runs on akka. So if you want the benefits of the latest akka >

Spark SQL UDF Returning Rows

2016-03-30 Thread Hamel Kothari
Hi all, I've been trying for the last couple of days to define a UDF which takes in a deeply nested Row object and performs some extraction to pull out a portion of of the Row and return it. This row object is nested not just with StructTypes but a bunch of ArrayTypes and MapTypes. From this

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Koert Kuipers
Spark still runs on akka. So if you want the benefits of the latest akka (not saying we do, was just an example) then you need to drop scala 2.10 On Mar 30, 2016 10:44 AM, "Cody Koeninger" wrote: > I agree with Mark in that I don't see how supporting scala 2.10 for > spark

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Sean Owen
Yeah it is not crazy to drop support for something foundational like this in a feature release but is something ideally coupled to a major release. You could at least say it is probably a decision to keep supporting through the end of the year given how releases are likely to go. Given the

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Cody Koeninger
I agree with Mark in that I don't see how supporting scala 2.10 for spark 2.0 implies supporting it for all of spark 2.x Regarding Koert's comment on akka, I thought all akka dependencies have been removed from spark after SPARK-7997 and the recent removal of external/akka On Wed, Mar 30, 2016

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
Dropping Scala 2.10 support has to happen at some point, so I'm not fundamentally opposed to the idea; but I've got questions about how we go about making the change and what degree of negative consequences we are willing to accept. Until now, we have been saying that 2.10 support will be

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Koert Kuipers
​about that pro, i think it's more the opposite: ​many libraries have stopped maintaining scala 2.10 versions. bugs will no longer be fixed for scala 2.10 and new libraries will not be available for scala 2.10 at all, making them unusable in spark. take for example akka, a distributed messaging

RE: [discuss] ending support for Java 7 in Spark 2.0

2016-03-30 Thread Raymond Honderdors
Maybe the question should be how far back should spark be compatible? There is nothings stopping people to run spark 1.6.x with jdk 7 or scala 2.10 or Hadoop <2.6 But if they want spark 2.x they should consider a migration to jdk8 and scala 2.11 Or am I getting it all wrong? Raymond

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-30 Thread Tom Graves
Steve, those are good points, I had forgotten Hadoop had those issues.    We run with jdk 8, hadoop is built for jdk7 compatibility, we are running hadoop 2.7 on our clusters and by the time Spark 2.0 is out I would expected a mix of Hadoop 2.7 and 2.8.  We also don't use spnego. I didn't quite

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-30 Thread Steve Loughran
Can I note that if Spark 2.0 is going to be Java 8+ only, then that means Hadoop 2.6.x should be the minimum Hadoop version. https://issues.apache.org/jira/browse/HADOOP-11090 Where things get complicated, is that situation of: Hadoop services on Java 7, Spark on Java 8 in its own JVM I'm

Re: Any documentation on Spark's security model beyond YARN?

2016-03-30 Thread Steve Loughran
> On 29 Mar 2016, at 22:19, Michael Segel wrote: > > Hi, > > So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its > security from the underlying YARN job. > However… that’s not really saying much when you think about some use cases. > >

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Steve Loughran
On 30 Mar 2016, at 04:44, Selvam Raman > wrote: Hi, i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine. i was trying to use databricks csv format to read csv file. i used the below command. I got null pointer exception. Any

Re: Master options Cluster/Client descrepencies.

2016-03-30 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 Thanks Best Regards On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna < satyajit.apas...@gmail.com> wrote: > > Hi All, > > I have written a spark program on my dev box , >IDE:Intellij >

Re: aggregateByKey on PairRDD

2016-03-30 Thread Akhil Das
Isn't it what tempRDD.groupByKey does? Thanks Best Regards On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh wrote: > Hi All, > > I have an RDD having the data in the following form : > > tempRDD: RDD[(String, (String, String))] > > (brand , (product, key)) > >

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Akhil Das
Looks like the winutils.exe is missing from the environment, See https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman wrote: > Hi, > > i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine. >