Re: Ask for ARM CI for spark

2019-07-09 Thread Tianhua huang
Hi all,

I am glad to tell you there is a new progress of build/test spark on
aarch64 server, the tests are running, see the build/test detail log
https://logs.openlabtesting.org/logs/1/1/419fcb11764048d5a3cda186ea76dd43249e1f97/check/spark-build-arm64/75cc6f5/job-output.txt.gz
and
the aarch64 instance info see
https://logs.openlabtesting.org/logs/1/1/419fcb11764048d5a3cda186ea76dd43249e1f97/check/spark-build-arm64/75cc6f5/zuul-info/zuul-info.ubuntu-xenial-arm64.txt
In
order to enable the test, I made some modification, the major one is to
build leveldbjni local package, I forked fusesource/leveldbjni and
chirino/leveldb repos, and made some modification to make sure to build the
local package, see https://github.com/huangtianhua/leveldbjni/pull/1 and
https://github.com/huangtianhua/leveldbjni/pull/2 , then to use it in
spark, the detail you can find in https://github.com/theopenlab/spark/pull/1


Now the tests are not all successful, I will try to fix it and any
suggestion is welcome, thank you all.

On Mon, Jul 1, 2019 at 5:25 PM Tianhua huang 
wrote:

> We are focus on the arm instance of cloud, and now I use the arm instance
> of vexxhost cloud to run the build job which mentioned above, the
> specification of the arm instance is 8VCPU and 8GB of RAM,
> and we can use bigger flavor to create the arm instance to run the job, if
> need be.
>
> On Fri, Jun 28, 2019 at 6:55 PM Steve Loughran 
> wrote:
>
>>
>> Be interesting to see how well a Pi4 works; with only 4GB of RAM you
>> wouldn't compile with it, but you could try installing the spark jar bundle
>> and then run against some NFS mounted disks:
>> https://www.raspberrypi.org/magpi/raspberry-pi-4-specs-benchmarks/ ;
>> unlikely to be fast, but it'd be an efficient kind of slow
>>
>> On Fri, Jun 28, 2019 at 3:08 AM Rui Chen  wrote:
>>
>>> >  I think any AA64 work is going to have to define very clearly what
>>> "works" is defined as
>>>
>>> +1
>>> It's very valuable to build a clear scope of these projects
>>> functionality for ARM platform in upstream community, it bring confidence
>>> to end user and customers when they plan to deploy these projects on ARM.
>>>
>>> This is absolute long term work, let's to make it step by step, CI,
>>> testing, issue and resolving.
>>>
>>> Steve Loughran  于2019年6月27日周四 下午9:22写道:
>>>
 level db and native codecs are invariably a problem here, as is
 anything else doing misaligned IO. Protobuf has also had "issues" in the
 past

 see https://issues.apache.org/jira/browse/HADOOP-16100

 I think any AA64 work is going to have to define very clearly what
 "works" is defined as; spark standalone with a specific set of codecs is
 probably the first thing to aim for -no Snappy or lz4.

 Anything which goes near: protobuf, checksums, native code, etc is in
 trouble. Don't try and deploy with HDFS as the cluster FS, would be my
 recommendation.

 If you want a cluster use NFS or one of google GCS, Azure WASB for the
 cluster FS. And before trying either of those cloud store, run the
 filesystem connector test suites (hadoop-azure; google gcs github) to see
 that they work. If the foundational FS test suites fail, nothing else will
 work



 On Thu, Jun 27, 2019 at 3:09 AM Tianhua huang <
 huangtianhua...@gmail.com> wrote:

> I took the ut tests on my arm instance before and reported an issue in
> https://issues.apache.org/jira/browse/SPARK-27721,  and seems there
> was no leveldbjni native package for aarch64 in leveldbjni-all.jar(or 1.8)
> https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8
> , we can find https://github.com/fusesource/leveldbjni/pull/82 this
> pr added the aarch64 support and merged on 2 Nov 2017, but the latest
> release of the repo is  on 17 Oct 2013, unfortunately it didn't
> include the aarch64 supporting.
>
> I will running the test on the job mentioned above, and will try to
> fix the issue above, or if anyone have any idea of it, welcome reply me,
> thank you.
>
>
> On Wed, Jun 26, 2019 at 8:11 PM Sean Owen  wrote:
>
>> Can you begin by testing yourself? I think the first step is to make
>> sure the build and tests work on ARM. If you find problems you can
>> isolate them and try to fix them, or at least report them. It's only
>> worth getting CI in place when we think builds will work.
>>
>> On Tue, Jun 25, 2019 at 9:26 PM Tianhua huang <
>> huangtianhua...@gmail.com> wrote:
>> >
>> > Thanks Shane :)
>> >
>> > This sounds good, and yes I agree that it's best to keep the
>> test/build infrastructure in one place. If you can't find the ARM 
>> resource
>> we are willing to support the ARM instance :)  Our goal is to make more
>> open source software to be more compatible for aarch64 platform, so let's
>> to do it. I will be happy if I can give some 

Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Dongjoon Hyun
Thank you for the reply, Sean. Sure. 2.4.x should be a LTS version.

The main reason of 2.4.4 release (before 3.0.0) is to have a better basis
for comparison to 3.0.0.
For example, SPARK-27798 had an old bug, but its correctness issue is only
exposed at Spark 2.4.3.
It would be great if we can have a better basis.

Bests,
Dongjoon.


On Tue, Jul 9, 2019 at 9:52 AM Sean Owen  wrote:

> We will certainly want a 2.4.4 release eventually. In fact I'd expect
> 2.4.x gets maintained for longer than the usual 18 months, as it's the
> last 2.x branch.
> It doesn't need to happen before 3.0, but could. Usually maintenance
> releases happen 3-4 months apart and the last one was 2 months ago. If
> these are significant issues, sure. It'll probably be August before
> it's out anyway.
>
> On Tue, Jul 9, 2019 at 11:15 AM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Spark 2.4.3 was released two months ago (8th May).
> >
> > As of today (9th July), there exist 45 fixes in `branch-2.4` including
> the following correctness or blocker issues.
> >
> > - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
> decimals not fitting in long
> > - SPARK-26045 Error in the spark 2.4 release package with the
> spark-avro_2.11 dependency
> > - SPARK-27798 from_avro can modify variables in other rows in local
> mode
> > - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
> > - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
> entries
> > - SPARK-28308 CalendarInterval sub-second part should be padded
> before parsing
> >
> > It would be great if we can have Spark 2.4.4 before we are going to get
> busier for 3.0.0.
> > If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
> it next Monday. (15th July).
> > How do you think about this?
> >
> > Bests,
> > Dongjoon.
>


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Sean Owen
We will certainly want a 2.4.4 release eventually. In fact I'd expect
2.4.x gets maintained for longer than the usual 18 months, as it's the
last 2.x branch.
It doesn't need to happen before 3.0, but could. Usually maintenance
releases happen 3-4 months apart and the last one was 2 months ago. If
these are significant issues, sure. It'll probably be August before
it's out anyway.

On Tue, Jul 9, 2019 at 11:15 AM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released two months ago (8th May).
>
> As of today (9th July), there exist 45 fixes in `branch-2.4` including the 
> following correctness or blocker issues.
>
> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for 
> decimals not fitting in long
> - SPARK-26045 Error in the spark 2.4 release package with the 
> spark-avro_2.11 dependency
> - SPARK-27798 from_avro can modify variables in other rows in local mode
> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
> - SPARK-28308 CalendarInterval sub-second part should be padded before 
> parsing
>
> It would be great if we can have Spark 2.4.4 before we are going to get 
> busier for 3.0.0.
> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll it 
> next Monday. (15th July).
> How do you think about this?
>
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Dongjoon Hyun
Hi, All.

Spark 2.4.3 was released two months ago (8th May).

As of today (9th July), there exist 45 fixes in `branch-2.4` including the
following correctness or blocker issues.

- SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
decimals not fitting in long
- SPARK-26045 Error in the spark 2.4 release package with the
spark-avro_2.11 dependency
- SPARK-27798 from_avro can modify variables in other rows in local mode
- SPARK-27907 HiveUDAF should return NULL in case of 0 rows
- SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
- SPARK-28308 CalendarInterval sub-second part should be padded before
parsing

It would be great if we can have Spark 2.4.4 before we are going to get
busier for 3.0.0.
If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll it
next Monday. (15th July).
How do you think about this?

Bests,
Dongjoon.


Re: Contribution help needed for sub-tasks of an umbrella JIRA - port *.sql tests to improve coverage of Python, Pandas, Scala UDF cases

2019-07-09 Thread Hyukjin Kwon
It's alright - thanks for that.
Anyone can take a look. This is an open source project :D.

2019년 7월 9일 (화) 오후 8:18, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com>님이 작성:

> I can try one and see how it goes, although not familiar with the area.
>
> Stavros
>
> On Tue, Jul 9, 2019 at 6:17 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I am currently targeting to improve Python, Pandas UDFs Scala UDF test
>> cases by integrating our existing *.sql files at
>> https://issues.apache.org/jira/browse/SPARK-27921
>>
>> I would appreciate that anyone who's interested in Spark contribution
>> takes some sub-tasks. It's too many for me to do :-). I am doing one by one
>> for now.
>>
>> I wrote some guides about this umbrella JIRA specifically so if you're
>> able to follow it very closely one by one, I think the process itself isn't
>> that difficult.
>>
>> The most import guide that should be carefully addressed is:
>> > 7. If there are diff, analyze it, file or find the JIRA, skip the tests
>> with comments.
>>
>> Thanks!
>>
>
>
>


Re: Contribution help needed for sub-tasks of an umbrella JIRA - port *.sql tests to improve coverage of Python, Pandas, Scala UDF cases

2019-07-09 Thread Stavros Kontopoulos
I can try one and see how it goes, although not familiar with the area.

Stavros

On Tue, Jul 9, 2019 at 6:17 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I am currently targeting to improve Python, Pandas UDFs Scala UDF test
> cases by integrating our existing *.sql files at
> https://issues.apache.org/jira/browse/SPARK-27921
>
> I would appreciate that anyone who's interested in Spark contribution
> takes some sub-tasks. It's too many for me to do :-). I am doing one by one
> for now.
>
> I wrote some guides about this umbrella JIRA specifically so if you're
> able to follow it very closely one by one, I think the process itself isn't
> that difficult.
>
> The most import guide that should be carefully addressed is:
> > 7. If there are diff, analyze it, file or find the JIRA, skip the tests
> with comments.
>
> Thanks!
>


Re: Opinions wanted: how much to match PostgreSQL semantics?

2019-07-09 Thread Dongjoon Hyun
Thank you, Sean and all.

One decision was made swiftly today.

I believe that we can move forward case-by-case for the others until the
feature freeze (3.0 branch cut).

Bests,
Dongjoon.

On Mon, Jul 8, 2019 at 13:03 Marco Gaido  wrote:

> Hi Sean,
>
> Thanks for bringing this up. Honestly, my opinion is that Spark should be
> fully ANSI SQL compliant. Where ANSI SQL compliance is not an issue, I am
> fine following any other DB. IMHO, we won't get anyway 100% compliance with
> any DB - postgres in this case (e.g. for decimal operations, we are
> following SQLServer, and postgres behaviour would be very hard to meet) -
> so I think it is fine that PMC members decide for each feature whether it
> is worth to support it or not.
>
> Thanks,
> Marco
>
> On Mon, 8 Jul 2019, 20:09 Sean Owen,  wrote:
>
>> See the particular issue / question at
>> https://github.com/apache/spark/pull/24872#issuecomment-509108532 and
>> the larger umbrella at
>> https://issues.apache.org/jira/browse/SPARK-27764 -- Dongjoon rightly
>> suggests this is a broader question.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>