Hi all, We run all unit tests for spark on arm64 platform, after effort there are four tests FAILED, see https://logs.openlabtesting.org/logs/4/4/ae5ebaddd6ba6eba5a525b2bf757043ebbe78432/check/spark-build-arm64/9ecccad/job-output.txt.gz
Two failed and the reason is 'Can't find 1 executors before 10000 milliseconds elapsed', see below, then we try increase timeout the tests passed, so wonder if we can increase the timeout? and here I have another question about https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/TestUtils.scala#L285, why is not >=? see the comment of the function, it should be >=? - test driver discovery under local-cluster mode *** FAILED *** java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293) at org.apache.spark.SparkContextSuite.$anonfun$new$78(SparkContextSuite.scala:753) at org.apache.spark.SparkContextSuite.$anonfun$new$78$adapted(SparkContextSuite.scala:741) at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161) at org.apache.spark.SparkContextSuite.$anonfun$new$77(SparkContextSuite.scala:741) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) - test gpu driver resource files and discovery under local-cluster mode *** FAILED *** java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293) at org.apache.spark.SparkContextSuite.$anonfun$new$80(SparkContextSuite.scala:781) at org.apache.spark.SparkContextSuite.$anonfun$new$80$adapted(SparkContextSuite.scala:761) at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161) at org.apache.spark.SparkContextSuite.$anonfun$new$79(SparkContextSuite.scala:761) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) The other two failed and the reason is '2143289344 equaled 2143289344', this because the value of floatToRawIntBits(0.0f/0.0f) on aarch64 platform is 2143289344 and equals to floatToRawIntBits(Float.NaN). About this I send email to jdk-dev and proposed a topic on scala community https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845 and https://github.com/scala/bug/issues/11632, I thought it's something about jdk or scala, but after discuss, it should related with platform, so seems the following asserts is not appropriate? https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala#L704-L705 and https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala#L732-L733 - SPARK-26021: NaN and -0.0 in grouping expressions *** FAILED *** 2143289344 equaled 2143289344 (DataFrameAggregateSuite.scala:732) - NaN and -0.0 in window partition keys *** FAILED *** 2143289344 equaled 2143289344 (DataFrameWindowFunctionsSuite.scala:704) About the failed tests fixing, we are waiting for your suggestions, thank you very much. On Wed, Jul 10, 2019 at 10:07 AM Tianhua huang <huangtianhua...@gmail.com> wrote: > Hi all, > > I am glad to tell you there is a new progress of build/test spark on > aarch64 server, the tests are running, see the build/test detail log > https://logs.openlabtesting.org/logs/1/1/419fcb11764048d5a3cda186ea76dd43249e1f97/check/spark-build-arm64/75cc6f5/job-output.txt.gz > and > the aarch64 instance info see > https://logs.openlabtesting.org/logs/1/1/419fcb11764048d5a3cda186ea76dd43249e1f97/check/spark-build-arm64/75cc6f5/zuul-info/zuul-info.ubuntu-xenial-arm64.txt > In > order to enable the test, I made some modification, the major one is to > build leveldbjni local package, I forked fusesource/leveldbjni and > chirino/leveldb repos, and made some modification to make sure to build the > local package, see https://github.com/huangtianhua/leveldbjni/pull/1 and > https://github.com/huangtianhua/leveldbjni/pull/2 , then to use it in > spark, the detail you can find in > https://github.com/theopenlab/spark/pull/1 > > Now the tests are not all successful, I will try to fix it and any > suggestion is welcome, thank you all. > > On Mon, Jul 1, 2019 at 5:25 PM Tianhua huang <huangtianhua...@gmail.com> > wrote: > >> We are focus on the arm instance of cloud, and now I use the arm instance >> of vexxhost cloud to run the build job which mentioned above, the >> specification of the arm instance is 8VCPU and 8GB of RAM, >> and we can use bigger flavor to create the arm instance to run the job, >> if need be. >> >> On Fri, Jun 28, 2019 at 6:55 PM Steve Loughran >> <ste...@cloudera.com.invalid> wrote: >> >>> >>> Be interesting to see how well a Pi4 works; with only 4GB of RAM you >>> wouldn't compile with it, but you could try installing the spark jar bundle >>> and then run against some NFS mounted disks: >>> https://www.raspberrypi.org/magpi/raspberry-pi-4-specs-benchmarks/ ; >>> unlikely to be fast, but it'd be an efficient kind of slow >>> >>> On Fri, Jun 28, 2019 at 3:08 AM Rui Chen <chenrui.m...@gmail.com> wrote: >>> >>>> > I think any AA64 work is going to have to define very clearly what >>>> "works" is defined as >>>> >>>> +1 >>>> It's very valuable to build a clear scope of these projects >>>> functionality for ARM platform in upstream community, it bring confidence >>>> to end user and customers when they plan to deploy these projects on ARM. >>>> >>>> This is absolute long term work, let's to make it step by step, CI, >>>> testing, issue and resolving. >>>> >>>> Steve Loughran <ste...@cloudera.com.invalid> 于2019年6月27日周四 下午9:22写道: >>>> >>>>> level db and native codecs are invariably a problem here, as is >>>>> anything else doing misaligned IO. Protobuf has also had "issues" in the >>>>> past >>>>> >>>>> see https://issues.apache.org/jira/browse/HADOOP-16100 >>>>> >>>>> I think any AA64 work is going to have to define very clearly what >>>>> "works" is defined as; spark standalone with a specific set of codecs is >>>>> probably the first thing to aim for -no Snappy or lz4. >>>>> >>>>> Anything which goes near: protobuf, checksums, native code, etc is in >>>>> trouble. Don't try and deploy with HDFS as the cluster FS, would be my >>>>> recommendation. >>>>> >>>>> If you want a cluster use NFS or one of google GCS, Azure WASB for the >>>>> cluster FS. And before trying either of those cloud store, run the >>>>> filesystem connector test suites (hadoop-azure; google gcs github) to see >>>>> that they work. If the foundational FS test suites fail, nothing else will >>>>> work >>>>> >>>>> >>>>> >>>>> On Thu, Jun 27, 2019 at 3:09 AM Tianhua huang < >>>>> huangtianhua...@gmail.com> wrote: >>>>> >>>>>> I took the ut tests on my arm instance before and reported an issue >>>>>> in https://issues.apache.org/jira/browse/SPARK-27721, and seems >>>>>> there was no leveldbjni native package for aarch64 in >>>>>> leveldbjni-all.jar(or >>>>>> 1.8) >>>>>> https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8 >>>>>> , we can find https://github.com/fusesource/leveldbjni/pull/82 this >>>>>> pr added the aarch64 support and merged on 2 Nov 2017, but the latest >>>>>> release of the repo is on 17 Oct 2013, unfortunately it didn't >>>>>> include the aarch64 supporting. >>>>>> >>>>>> I will running the test on the job mentioned above, and will try to >>>>>> fix the issue above, or if anyone have any idea of it, welcome reply me, >>>>>> thank you. >>>>>> >>>>>> >>>>>> On Wed, Jun 26, 2019 at 8:11 PM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> Can you begin by testing yourself? I think the first step is to make >>>>>>> sure the build and tests work on ARM. If you find problems you can >>>>>>> isolate them and try to fix them, or at least report them. It's only >>>>>>> worth getting CI in place when we think builds will work. >>>>>>> >>>>>>> On Tue, Jun 25, 2019 at 9:26 PM Tianhua huang < >>>>>>> huangtianhua...@gmail.com> wrote: >>>>>>> > >>>>>>> > Thanks Shane :) >>>>>>> > >>>>>>> > This sounds good, and yes I agree that it's best to keep the >>>>>>> test/build infrastructure in one place. If you can't find the ARM >>>>>>> resource >>>>>>> we are willing to support the ARM instance :) Our goal is to make more >>>>>>> open source software to be more compatible for aarch64 platform, so >>>>>>> let's >>>>>>> to do it. I will be happy if I can give some help for the goal. >>>>>>> > >>>>>>> > Waiting for you good news :) >>>>>>> > >>>>>>> > On Wed, Jun 26, 2019 at 9:47 AM shane knapp <skn...@berkeley.edu> >>>>>>> wrote: >>>>>>> >> >>>>>>> >> ...or via VM as you mentioned earlier. :) >>>>>>> >> >>>>>>> >> shane (who will file a JIRA tomorrow) >>>>>>> >> >>>>>>> >> On Tue, Jun 25, 2019 at 6:44 PM shane knapp <skn...@berkeley.edu> >>>>>>> wrote: >>>>>>> >>> >>>>>>> >>> i'd much prefer that we keep the test/build infrastructure in >>>>>>> one place. >>>>>>> >>> >>>>>>> >>> we don't have ARM hardware, but there's a slim possibility i can >>>>>>> scare something up in our older research stock... >>>>>>> >>> >>>>>>> >>> another option would be to run the build in a arm-based docker >>>>>>> container, which (according to the intarwebs) is possible. >>>>>>> >>> >>>>>>> >>> shane >>>>>>> >>> >>>>>>> >>> On Tue, Jun 25, 2019 at 6:35 PM Tianhua huang < >>>>>>> huangtianhua...@gmail.com> wrote: >>>>>>> >>>> >>>>>>> >>>> I forked apache/spark project and propose a job( >>>>>>> https://github.com/theopenlab/spark/pull/1) for spark building in >>>>>>> OpenLab ARM instance, this is the first step to build spark on ARM, I >>>>>>> can >>>>>>> enable a periodic job for arm building for apache/spark master if you >>>>>>> guys >>>>>>> like. Later I will run tests for spark. I also willing to be the >>>>>>> maintainer of the arm ci of spark. >>>>>>> >>>> >>>>>>> >>>> Thanks for you attention. >>>>>>> >>>> >>>>>>> >>>> On Thu, Jun 20, 2019 at 10:17 AM Tianhua huang < >>>>>>> huangtianhua...@gmail.com> wrote: >>>>>>> >>>>> >>>>>>> >>>>> Thanks Sean. >>>>>>> >>>>> >>>>>>> >>>>> I am very happy to hear that the community will put effort to >>>>>>> fix the ARM-related issues. I'd be happy to help if you like. And could >>>>>>> you >>>>>>> give the trace link of this issue, then I can check it is fixed or not, >>>>>>> thank you. >>>>>>> >>>>> As far as I know the old versions of spark support ARM, and >>>>>>> now the new versions don't, this just shows that we need a CI to check >>>>>>> whether the spark support ARM and whether some modification break it. >>>>>>> >>>>> I will add a demo job in OpenLab to build spark on ARM and do >>>>>>> a simple UT test. Later I will give the job link. >>>>>>> >>>>> >>>>>>> >>>>> Let me know what you think. >>>>>>> >>>>> >>>>>>> >>>>> Thank you all! >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> >>>>> On Wed, Jun 19, 2019 at 8:47 PM Sean Owen <sro...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>> >>>>>>> >>>>>> I'd begin by reporting and fixing ARM-related issues in the >>>>>>> build. If >>>>>>> >>>>>> they're small, of course we should do them. If it requires >>>>>>> significant >>>>>>> >>>>>> modifications, we can discuss how much Spark can support ARM. >>>>>>> I don't >>>>>>> >>>>>> think it's yet necessary for the Spark project to run these >>>>>>> CI builds >>>>>>> >>>>>> until that point, but it's always welcome if people are >>>>>>> testing that >>>>>>> >>>>>> separately. >>>>>>> >>>>>> >>>>>>> >>>>>> On Wed, Jun 19, 2019 at 7:41 AM Holden Karau < >>>>>>> hol...@pigscanfly.ca> wrote: >>>>>>> >>>>>> > >>>>>>> >>>>>> > Moving to dev@ for increased visibility among the >>>>>>> developers. >>>>>>> >>>>>> > >>>>>>> >>>>>> > On Wed, Jun 19, 2019 at 1:24 AM Tianhua huang < >>>>>>> huangtianhua...@gmail.com> wrote: >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> Thanks for your reply. >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> As I said before, I met some problem of build or test for >>>>>>> spark on aarch64 server, so it will be better to have the ARM CI to make >>>>>>> sure the spark is compatible for AArch64 platforms. >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> I’m from OpenLab team(https://openlabtesting.org/ ,a >>>>>>> community to do open source project testing. And we can support some Arm >>>>>>> virtual machines to AMPLab Jenkins, and also we have a developer team >>>>>>> that >>>>>>> willing to work on this, we willing to maintain build CI jobs and >>>>>>> address >>>>>>> the CI issues. What do you think? >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> Thanks for your attention. >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> >>>>>>> >>>>>> >> On Wed, Jun 19, 2019 at 6:39 AM shane knapp < >>>>>>> skn...@berkeley.edu> wrote: >>>>>>> >>>>>> >>> >>>>>>> >>>>>> >>> yeah, we don't have any aarch64 systems for testing... >>>>>>> this has been asked before but is currently pretty low on our priority >>>>>>> list >>>>>>> as we don't have the hardware. >>>>>>> >>>>>> >>> >>>>>>> >>>>>> >>> sorry, >>>>>>> >>>>>> >>> >>>>>>> >>>>>> >>> shane >>>>>>> >>>>>> >>> >>>>>>> >>>>>> >>> On Mon, Jun 10, 2019 at 7:08 PM Tianhua huang < >>>>>>> huangtianhua...@gmail.com> wrote: >>>>>>> >>>>>> >>>> >>>>>>> >>>>>> >>>> Hi, sorry to disturb you. >>>>>>> >>>>>> >>>> The CI testing for apache spark is supported by AMPLab >>>>>>> Jenkins, and I find there are some computers(most of them are Linux >>>>>>> (amd64) >>>>>>> arch) for the CI development, but seems there is no Aarch64 computer for >>>>>>> spark CI testing. Recently, I build and run test for spark(master and >>>>>>> branch-2.4) on my arm server, and unfortunately there are some problems, >>>>>>> for example, ut test is failed due to a LEVELDBJNI native package, the >>>>>>> details for java test see http://paste.openstack.org/show/752063/ >>>>>>> and python test see http://paste.openstack.org/show/752709/ >>>>>>> >>>>>> >>>> So I have a question about the ARM CI testing for spark, >>>>>>> is there any plan to support it? Thank you very much and I will wait for >>>>>>> your reply! >>>>>>> >>>>>> >>> >>>>>>> >>>>>> >>> >>>>>>> >>>>>> >>> >>>>>>> >>>>>> >>> -- >>>>>>> >>>>>> >>> Shane Knapp >>>>>>> >>>>>> >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>>> >>>>>> >>> https://rise.cs.berkeley.edu >>>>>>> >>>>>> > >>>>>>> >>>>>> > >>>>>>> >>>>>> > >>>>>>> >>>>>> > -- >>>>>>> >>>>>> > Twitter: https://twitter.com/holdenkarau >>>>>>> >>>>>> > Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 >>>>>>> >>>>>> > YouTube Live Streams: >>>>>>> https://www.youtube.com/user/holdenkarau >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> -- >>>>>>> >>> Shane Knapp >>>>>>> >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>>> >>> https://rise.cs.berkeley.edu >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> -- >>>>>>> >> Shane Knapp >>>>>>> >> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>>>> >> https://rise.cs.berkeley.edu >>>>>>> >>>>>>