Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
The join and joinWith are just two different join semantics, and is not about Dataset vs DataFrame. join is the relational join, where fields are flattened; joinWith is more like a tuple join, where the output has two fields that are nested. So you can do Dataset[A] joinWith Dataset[B] =

RE: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Sun, Rui
Vote for option 2. Source compatibility and binary compatibility are very important from user’s perspective. It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As you said, sometimes it is more natural to think about DataFrame. I am wondering if conceptually there is

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
Yes - and that's why source compatibility is broken. Note that it is not just a "convenience" thing. Conceptually DataFrame is a Dataset[Row], and for some developers it is more natural to think about "DataFrame" rather than "Dataset[Row]". If we were in C++, DataFrame would've been a type alias

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Koert Kuipers
since a type alias is purely a convenience thing for the scala compiler, does option 1 mean that the concept of DataFrame ceases to exist from a java perspective, and they will have to refer to Dataset? On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin wrote: > When we first

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
It might make sense, but this option seems to carry all the cons of Option 2, and yet doesn't provide compatibility for Java? On Thu, Feb 25, 2016 at 3:31 PM, Michael Malak wrote: > Would it make sense (in terms of feasibility, code organization, and > politically) to

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Michael Malak
Would it make sense (in terms of feasibility, code organization, and politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines to a Java compatibility layer/class? From: Reynold Xin To: "dev@spark.apache.org" Sent:

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Chester Chen
vote for Option 1. 1) Since 2.0 is major API, we are expecting some API changes, 2) It helps long term code base maintenance with short term pain on Java side 3) Not quite sure how large the code base is using Java DataFrame APIs. On Thu, Feb 25, 2016 at 3:23 PM, Reynold Xin

[discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge

Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Łukasz Gieroń
Thank you, your version of the mvn invocation (as opposed to mine bare "mvn eclipse:eclipse") worked perfectly. On Thu, Feb 25, 2016 at 3:22 PM, Yin Yang wrote: > In yarn/.classpath , I see: > > > Here is the command I used: > > build/mvn clean -Phive -Phive-thriftserver

Re: [build system] additional jenkins downtime next thursday

2016-02-25 Thread shane knapp
alright, the update is done and worker-08 rebooted. we're back up and building already! On Thu, Feb 25, 2016 at 8:15 AM, shane knapp wrote: > this is happening now. > > On Wed, Feb 24, 2016 at 6:08 PM, shane knapp wrote: >> the security update has been

Re: [build system] additional jenkins downtime next thursday

2016-02-25 Thread shane knapp
this is happening now. On Wed, Feb 24, 2016 at 6:08 PM, shane knapp wrote: > the security update has been released, and it's a doozy! > > https://wiki.jenkins-ci.org/display/SECURITY/Security+Advisory+2016-02-24 > > i will be putting jenkins in to quiet mode ~7am PST

Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Yin Yang
In yarn/.classpath , I see: Here is the command I used: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.0 package -DskipTests eclipse:eclipse FYI On Thu, Feb 25, 2016 at 6:13 AM, Łukasz Gieroń wrote: > I've just checked, and "mvn

Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Allen Zhang
well, I am using IDEA to import the code base. At 2016-02-25 22:13:11, "Łukasz Gieroń" wrote: I've just checked, and "mvn eclipse:eclipse" generates incorrect projects as well. On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang wrote: why not use

Re: Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Łukasz Gieroń
I've just checked, and "mvn eclipse:eclipse" generates incorrect projects as well. On Thu, Feb 25, 2016 at 3:04 PM, Allen Zhang wrote: > why not use maven > > > > > > > At 2016-02-25 21:55:49, "lgieron" wrote: > >The Spark projects generated by sbt

Re:Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Allen Zhang
dev/change-scala-version 2.10 may help you? At 2016-02-25 21:55:49, "lgieron" wrote: >The Spark projects generated by sbt eclipse plugin have incorrect dependent >projects (as visible on Properties -> Java Build Path -> Projects tab). All >dependent project are missing

Re:Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread Allen Zhang
why not use maven At 2016-02-25 21:55:49, "lgieron" wrote: >The Spark projects generated by sbt eclipse plugin have incorrect dependent >projects (as visible on Properties -> Java Build Path -> Projects tab). All >dependent project are missing the "_2.11" suffix (for

Eclipse: Wrong project dependencies in generated by "sbt eclipse"

2016-02-25 Thread lgieron
The Spark projects generated by sbt eclipse plugin have incorrect dependent projects (as visible on Properties -> Java Build Path -> Projects tab). All dependent project are missing the "_2.11" suffix (for example, it's "spark-core" instead of correct "spark-core_2.11"). This of course causes the

Bug in DiskBlockManager subDirs logic?

2016-02-25 Thread Zee Chen
Hi, I am debugging a situation where SortShuffleWriter sometimes fail to create a file, with the following stack trace: 16/02/23 11:48:46 ERROR Executor: Exception in task 13.0 in stage 47827.0 (TID 1367089) java.io.FileNotFoundException: