Re: [discuss] ending support for Java 7 in Spark 2.0

2016-04-03 Thread Reynold Xin
Since my original email, I've talked to a lot more users and looked at what
various environments support. It is true that a lot of enterprises, and
even some technology companies, are still using Java 7. One thing is that
up until this date, users still can't install openjdk 8 on Ubuntu by
default. I see that as an indication that it is too early to drop Java 7.

Looking at the timeline, JDK release a major new version roughly every 3
years. We dropped Java 6 support one year ago, so from a timeline point of
view we would be very aggressive here if we were to drop Java 7 support in
Spark 2.0.

Note that not dropping Java 7 support now doesn't mean we have to support
Java 7 throughout Spark 2.x. We dropped Java 6 support in Spark 1.5, even
though Spark 1.0 started with Java 6.

In terms of testing, Josh has actually improved our test infra so now we
would run the Java 8 tests: https://github.com/apache/spark/pull/12073




On Thu, Mar 24, 2016 at 8:51 PM, Liwei Lin  wrote:

> Arguments are really convincing; new Dataset API as well as performance
>
> improvements is exiting, so I'm personally +1 on moving onto Java8.
>
>
>
> However, I'm afraid Tencent is one of "the organizations stuck with Java7"
>
> -- our IT Infra division wouldn't upgrade to Java7 until Java8 is out, and
>
> wouldn't upgrade to Java8 until Java9 is out.
>
>
> So:
>
> (non-binding) +1 on dropping scala 2.10 support
>
> (non-binding)  -1 on dropping Java 7 support
>
>   * as long as we figure out a practical way to run
> Spark with
>
> JDK8 on JDK7 clusters, this -1 would then
> definitely be +1
>
>
> Thanks !
>
> On Fri, Mar 25, 2016 at 10:28 AM, Koert Kuipers  wrote:
>
>> i think that logic is reasonable, but then the same should also apply to
>> scala 2.10, which is also unmaintained/unsupported at this point (basically
>> has been since march 2015 except for one hotfix due to a license
>> incompatibility)
>>
>> who wants to support scala 2.10 three years after they did the last
>> maintenance release?
>>
>>
>> On Thu, Mar 24, 2016 at 9:59 PM, Mridul Muralidharan 
>> wrote:
>>
>>> Removing compatibility (with jdk, etc) can be done with a major release-
>>> given that 7 has been EOLed a while back and is now unsupported, we have to
>>> decide if we drop support for it in 2.0 or 3.0 (2+ years from now).
>>>
>>> Given the functionality & performance benefits of going to jdk8, future
>>> enhancements relevant in 2.x timeframe ( scala, dependencies) which
>>> requires it, and simplicity wrt code, test & support it looks like a good
>>> checkpoint to drop jdk7 support.
>>>
>>> As already mentioned in the thread, existing yarn clusters are
>>> unaffected if they want to continue running jdk7 and yet use
>>> spark2 (install jdk8 on all nodes and use it via JAVA_HOME, or worst case
>>> distribute jdk8 as archive - suboptimal).
>>> I am unsure about mesos (standalone might be easier upgrade I guess ?).
>>>
>>>
>>> Proposal is for 1.6x line to continue to be supported with critical
>>> fixes; newer features will require 2.x and so jdk8
>>>
>>> Regards
>>> Mridul
>>>
>>>
>>> On Thursday, March 24, 2016, Marcelo Vanzin  wrote:
>>>
 On Thu, Mar 24, 2016 at 4:50 PM, Reynold Xin 
 wrote:
 > If you want to go down that route, you should also ask somebody who
 has had
 > experience managing a large organization's applications and try to
 update
 > Scala version.

 I understand both sides. But if you look at what I've been asking
 since the beginning, it's all about the cost and benefits of dropping
 support for java 1.7.

 The biggest argument in your original e-mail is about testing. And the
 testing cost is much bigger for supporting scala 2.10 than it is for
 supporting java 1.7. If you read one of my earlier replies, it should
 be even possible to just do everything in a single job - compile for
 java 7 and still be able to test things in 1.8, including lambdas,
 which seems to be the main thing you were worried about.


 > On Thu, Mar 24, 2016 at 4:48 PM, Marcelo Vanzin 
 wrote:
 >>
 >> On Thu, Mar 24, 2016 at 4:46 PM, Reynold Xin 
 wrote:
 >> > Actually it's *way* harder to upgrade Scala from 2.10 to 2.11, than
 >> > upgrading the JVM runtime from 7 to 8, because Scala 2.10 and 2.11
 are
 >> > not
 >> > binary compatible, whereas JVM 7 and 8 are binary compatible except
 >> > certain
 >> > esoteric cases.
 >>
 >> True, but ask anyone who manages a large cluster how long it would
 >> take them to upgrade the jdk across their cluster and validate all
 >> their applications and everything... binary compatibility is a tiny
 >> drop in that bucket.
 >>
 >> --
 >> Marcelo
 >
 >



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.ap

Re: explain codegen

2016-04-03 Thread Reynold Xin
Works for me on latest master.



scala> sql("explain codegen select 'a' as a group by 1").head
res3: org.apache.spark.sql.Row =
[Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
WholeStageCodegen
:  +- TungstenAggregate(key=[], functions=[], output=[a#10])
: +- INPUT
+- Exchange SinglePartition, None
   +- WholeStageCodegen
  :  +- TungstenAggregate(key=[], functions=[], output=[])
  : +- INPUT
  +- Scan OneRowRelation[]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ /** Codegened pipeline for:
/* 006 */ * TungstenAggregate(key=[], functions=[], output=[a#10])
/* 007 */ +- INPUT
/* 008 */ */
/* 009 */ final class GeneratedIterator extends
org.apache.spark.sql.execution.BufferedRowIterator {
/* 010 */   private Object[] references;
/* 011 */   ...


On Sun, Apr 3, 2016 at 9:38 PM, Jacek Laskowski  wrote:

> Hi,
>
> Looks related to the recent commit...
>
> Repository: spark
> Updated Branches:
>   refs/heads/master 2262a9335 -> 1f0c5dceb
>
> [SPARK-14350][SQL] EXPLAIN output should be in a single cell
>
> Jacek
> 03.04.2016 7:00 PM "Ted Yu"  napisał(a):
>
>> Hi,
>> Based on master branch refreshed today, I issued 'git clean -fdx' first.
>>
>> Then this command:
>> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
>> -Dhadoop.version=2.7.0 package -DskipTests
>>
>> I got the following error:
>>
>> scala>  sql("explain codegen select 'a' as a group by 1").head
>> org.apache.spark.sql.catalyst.parser.ParseException:
>> extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD',
>> 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE',
>> 'DESCRIBE', 'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP',
>> 'SET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH',
>> 'CLEAR', 'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE',
>> 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT',
>> 'LOAD'}(line 1, pos 8)
>>
>> == SQL ==
>> explain codegen select 'a' as a group by 1
>> ^^^
>>
>> Can someone shed light ?
>>
>> Thanks
>>
>


Re: explain codegen

2016-04-03 Thread Jacek Laskowski
Hi,

Looks related to the recent commit...

Repository: spark
Updated Branches:
  refs/heads/master 2262a9335 -> 1f0c5dceb

[SPARK-14350][SQL] EXPLAIN output should be in a single cell

Jacek
03.04.2016 7:00 PM "Ted Yu"  napisał(a):

> Hi,
> Based on master branch refreshed today, I issued 'git clean -fdx' first.
>
> Then this command:
> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.0 package -DskipTests
>
> I got the following error:
>
> scala>  sql("explain codegen select 'a' as a group by 1").head
> org.apache.spark.sql.catalyst.parser.ParseException:
> extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD',
> 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE',
> 'DESCRIBE', 'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP',
> 'SET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH',
> 'CLEAR', 'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE',
> 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT',
> 'LOAD'}(line 1, pos 8)
>
> == SQL ==
> explain codegen select 'a' as a group by 1
> ^^^
>
> Can someone shed light ?
>
> Thanks
>


explain codegen

2016-04-03 Thread Ted Yu
Hi,
Based on master branch refreshed today, I issued 'git clean -fdx' first.

Then this command:
build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
-Dhadoop.version=2.7.0 package -DskipTests

I got the following error:

scala>  sql("explain codegen select 'a' as a group by 1").head
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC',
'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE',
'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET',
'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH', 'CLEAR',
'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE', 'REVOKE',
'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos
8)

== SQL ==
explain codegen select 'a' as a group by 1
^^^

Can someone shed light ?

Thanks


[SQL] Dataset.map gives error: missing parameter type for expanded function?

2016-04-03 Thread Jacek Laskowski
Hi,

(since 2.0.0-SNAPSHOT it's more for dev not user)

With today's master I'm getting the following:

scala> ds
res14: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

// WHY?!
scala> ds.groupBy(_._1)
:26: error: missing parameter type for expanded function
((x$1) => x$1._1)
   ds.groupBy(_._1)
  ^

scala> ds.filter(_._1.size > 10)
res23: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

It's even on the slide of Michael in
https://youtu.be/i7l3JQRx7Qw?t=7m38s from Spark Summit East?! Am I
doing something wrong? Please guide.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[no subject]

2016-04-03 Thread Avinash Dongre