Re: [discuss] ending support for Java 7 in Spark 2.0

2016-04-03 Thread Reynold Xin
Since my original email, I've talked to a lot more users and looked at what
various environments support. It is true that a lot of enterprises, and
even some technology companies, are still using Java 7. One thing is that
up until this date, users still can't install openjdk 8 on Ubuntu by
default. I see that as an indication that it is too early to drop Java 7.

Looking at the timeline, JDK release a major new version roughly every 3
years. We dropped Java 6 support one year ago, so from a timeline point of
view we would be very aggressive here if we were to drop Java 7 support in
Spark 2.0.

Note that not dropping Java 7 support now doesn't mean we have to support
Java 7 throughout Spark 2.x. We dropped Java 6 support in Spark 1.5, even
though Spark 1.0 started with Java 6.

In terms of testing, Josh has actually improved our test infra so now we
would run the Java 8 tests: https://github.com/apache/spark/pull/12073




On Thu, Mar 24, 2016 at 8:51 PM, Liwei Lin  wrote:

> Arguments are really convincing; new Dataset API as well as performance
>
> improvements is exiting, so I'm personally +1 on moving onto Java8.
>
>
>
> However, I'm afraid Tencent is one of "the organizations stuck with Java7"
>
> -- our IT Infra division wouldn't upgrade to Java7 until Java8 is out, and
>
> wouldn't upgrade to Java8 until Java9 is out.
>
>
> So:
>
> (non-binding) +1 on dropping scala 2.10 support
>
> (non-binding)  -1 on dropping Java 7 support
>
>   * as long as we figure out a practical way to run
> Spark with
>
> JDK8 on JDK7 clusters, this -1 would then
> definitely be +1
>
>
> Thanks !
>
> On Fri, Mar 25, 2016 at 10:28 AM, Koert Kuipers  wrote:
>
>> i think that logic is reasonable, but then the same should also apply to
>> scala 2.10, which is also unmaintained/unsupported at this point (basically
>> has been since march 2015 except for one hotfix due to a license
>> incompatibility)
>>
>> who wants to support scala 2.10 three years after they did the last
>> maintenance release?
>>
>>
>> On Thu, Mar 24, 2016 at 9:59 PM, Mridul Muralidharan 
>> wrote:
>>
>>> Removing compatibility (with jdk, etc) can be done with a major release-
>>> given that 7 has been EOLed a while back and is now unsupported, we have to
>>> decide if we drop support for it in 2.0 or 3.0 (2+ years from now).
>>>
>>> Given the functionality & performance benefits of going to jdk8, future
>>> enhancements relevant in 2.x timeframe ( scala, dependencies) which
>>> requires it, and simplicity wrt code, test & support it looks like a good
>>> checkpoint to drop jdk7 support.
>>>
>>> As already mentioned in the thread, existing yarn clusters are
>>> unaffected if they want to continue running jdk7 and yet use
>>> spark2 (install jdk8 on all nodes and use it via JAVA_HOME, or worst case
>>> distribute jdk8 as archive - suboptimal).
>>> I am unsure about mesos (standalone might be easier upgrade I guess ?).
>>>
>>>
>>> Proposal is for 1.6x line to continue to be supported with critical
>>> fixes; newer features will require 2.x and so jdk8
>>>
>>> Regards
>>> Mridul
>>>
>>>
>>> On Thursday, March 24, 2016, Marcelo Vanzin  wrote:
>>>
 On Thu, Mar 24, 2016 at 4:50 PM, Reynold Xin 
 wrote:
 > If you want to go down that route, you should also ask somebody who
 has had
 > experience managing a large organization's applications and try to
 update
 > Scala version.

 I understand both sides. But if you look at what I've been asking
 since the beginning, it's all about the cost and benefits of dropping
 support for java 1.7.

 The biggest argument in your original e-mail is about testing. And the
 testing cost is much bigger for supporting scala 2.10 than it is for
 supporting java 1.7. If you read one of my earlier replies, it should
 be even possible to just do everything in a single job - compile for
 java 7 and still be able to test things in 1.8, including lambdas,
 which seems to be the main thing you were worried about.


 > On Thu, Mar 24, 2016 at 4:48 PM, Marcelo Vanzin 
 wrote:
 >>
 >> On Thu, Mar 24, 2016 at 4:46 PM, Reynold Xin 
 wrote:
 >> > Actually it's *way* harder to upgrade Scala from 2.10 to 2.11, than
 >> > upgrading the JVM runtime from 7 to 8, because Scala 2.10 and 2.11
 are
 >> > not
 >> > binary compatible, whereas JVM 7 and 8 are binary compatible except
 >> > certain
 >> > esoteric cases.
 >>
 >> True, but ask anyone who manages a large cluster how long it would
 >> take them to upgrade the jdk across their cluster and validate all
 >> their applications and everything... binary compatibility is a tiny
 >> drop in that bucket.
 >>
 >> --
 >> Marcelo
 >
 >



 --
 

Re: explain codegen

2016-04-03 Thread Reynold Xin
Works for me on latest master.



scala> sql("explain codegen select 'a' as a group by 1").head
res3: org.apache.spark.sql.Row =
[Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
WholeStageCodegen
:  +- TungstenAggregate(key=[], functions=[], output=[a#10])
: +- INPUT
+- Exchange SinglePartition, None
   +- WholeStageCodegen
  :  +- TungstenAggregate(key=[], functions=[], output=[])
  : +- INPUT
  +- Scan OneRowRelation[]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ /** Codegened pipeline for:
/* 006 */ * TungstenAggregate(key=[], functions=[], output=[a#10])
/* 007 */ +- INPUT
/* 008 */ */
/* 009 */ final class GeneratedIterator extends
org.apache.spark.sql.execution.BufferedRowIterator {
/* 010 */   private Object[] references;
/* 011 */   ...


On Sun, Apr 3, 2016 at 9:38 PM, Jacek Laskowski  wrote:

> Hi,
>
> Looks related to the recent commit...
>
> Repository: spark
> Updated Branches:
>   refs/heads/master 2262a9335 -> 1f0c5dceb
>
> [SPARK-14350][SQL] EXPLAIN output should be in a single cell
>
> Jacek
> 03.04.2016 7:00 PM "Ted Yu"  napisał(a):
>
>> Hi,
>> Based on master branch refreshed today, I issued 'git clean -fdx' first.
>>
>> Then this command:
>> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
>> -Dhadoop.version=2.7.0 package -DskipTests
>>
>> I got the following error:
>>
>> scala>  sql("explain codegen select 'a' as a group by 1").head
>> org.apache.spark.sql.catalyst.parser.ParseException:
>> extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD',
>> 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE',
>> 'DESCRIBE', 'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP',
>> 'SET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH',
>> 'CLEAR', 'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE',
>> 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT',
>> 'LOAD'}(line 1, pos 8)
>>
>> == SQL ==
>> explain codegen select 'a' as a group by 1
>> ^^^
>>
>> Can someone shed light ?
>>
>> Thanks
>>
>


Re: explain codegen

2016-04-03 Thread Jacek Laskowski
Hi,

Looks related to the recent commit...

Repository: spark
Updated Branches:
  refs/heads/master 2262a9335 -> 1f0c5dceb

[SPARK-14350][SQL] EXPLAIN output should be in a single cell

Jacek
03.04.2016 7:00 PM "Ted Yu"  napisał(a):

> Hi,
> Based on master branch refreshed today, I issued 'git clean -fdx' first.
>
> Then this command:
> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.0 package -DskipTests
>
> I got the following error:
>
> scala>  sql("explain codegen select 'a' as a group by 1").head
> org.apache.spark.sql.catalyst.parser.ParseException:
> extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD',
> 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE',
> 'DESCRIBE', 'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP',
> 'SET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH',
> 'CLEAR', 'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE',
> 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT',
> 'LOAD'}(line 1, pos 8)
>
> == SQL ==
> explain codegen select 'a' as a group by 1
> ^^^
>
> Can someone shed light ?
>
> Thanks
>


explain codegen

2016-04-03 Thread Ted Yu
Hi,
Based on master branch refreshed today, I issued 'git clean -fdx' first.

Then this command:
build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
-Dhadoop.version=2.7.0 package -DskipTests

I got the following error:

scala>  sql("explain codegen select 'a' as a group by 1").head
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC',
'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE',
'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET',
'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH', 'CLEAR',
'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE', 'REVOKE',
'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos
8)

== SQL ==
explain codegen select 'a' as a group by 1
^^^

Can someone shed light ?

Thanks


[SQL] Dataset.map gives error: missing parameter type for expanded function?

2016-04-03 Thread Jacek Laskowski
Hi,

(since 2.0.0-SNAPSHOT it's more for dev not user)

With today's master I'm getting the following:

scala> ds
res14: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

// WHY?!
scala> ds.groupBy(_._1)
:26: error: missing parameter type for expanded function
((x$1) => x$1._1)
   ds.groupBy(_._1)
  ^

scala> ds.filter(_._1.size > 10)
res23: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

It's even on the slide of Michael in
https://youtu.be/i7l3JQRx7Qw?t=7m38s from Spark Summit East?! Am I
doing something wrong? Please guide.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: how about a custom coalesce() policy?

2016-04-03 Thread Nezih Yigitbasi
Sure, here  is the jira
and this  is the PR.

Nezih

On Sat, Apr 2, 2016 at 10:40 PM Hemant Bhanawat 
wrote:

> correcting email id for Nezih
>
> Hemant Bhanawat 
> www.snappydata.io
>
> On Sun, Apr 3, 2016 at 11:09 AM, Hemant Bhanawat 
> wrote:
>
>> Hi Nezih,
>>
>> Can you share JIRA and PR numbers?
>>
>> This partial de-coupling of data partitioning strategy and spark
>> parallelism would be a useful feature for any data store.
>>
>> Hemant
>>
>> Hemant Bhanawat 
>> www.snappydata.io
>>
>> On Fri, Apr 1, 2016 at 10:33 PM, Nezih Yigitbasi <
>> nyigitb...@netflix.com.invalid> wrote:
>>
>>> Hey Reynold,
>>> Created an issue (and a PR) for this change to get discussions started.
>>>
>>> Thanks,
>>> Nezih
>>>
>>> On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin 
>>> wrote:
>>>
 Using the right email for Nezih


 On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin 
 wrote:

> I think this can be useful.
>
> The only thing is that we are slowly migrating to the
> Dataset/DataFrame API, and leave RDD mostly as is as a lower level API.
> Maybe we should do both? In either case it would be great to discuss the
> API on a pull request. Cheers.
>
> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
> nyigitb...@netflix.com.invalid> wrote:
>
>> Hi Spark devs,
>>
>> I have sent an email about my problem some time ago where I want to
>> merge a large number of small files with Spark. Currently I am using Hive
>> with the CombineHiveInputFormat and I can control the size of the
>> output files with the max split size parameter (which is used for
>> coalescing the input splits by the CombineHiveInputFormat). My first
>> attempt was to use coalesce(), but since coalesce only considers the
>> target number of partitions the output file sizes were varying wildly.
>>
>> What I think can be useful is to have an optional PartitionCoalescer
>> parameter (a new interface) in the coalesce() method (or maybe we
>> can add a new method ?) that the callers can implement for custom
>> coalescing strategies — for my use case I have already implemented a
>> SizeBasedPartitionCoalescer that coalesces partitions by looking at
>> their sizes and by using a max split size parameter, similar to the
>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access
>> to the individual split sizes etc.).
>>
>> What do you guys think about such a change, can it be useful to other
>> users as well? Or do you think that there is an easier way to accomplish
>> the same merge logic? If you think it may be useful, I already have
>> an implementation and I will be happy to work with the community to
>> contribute it.
>>
>> Thanks,
>> Nezih
>> ​
>>
>
>

>>
>