Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-12-09 Thread Arwin Tio
Hello,

I have a ticket/PR out for this issue:

https://issues.apache.org/jira/browse/SPARK-29089
https://github.com/apache/spark/pull/25899

Can somebody please take a look/anything else I can do to get this through the 
door?

Thanks,

Arwin


From: Steve Loughran 
Sent: September 7, 2019 9:22 AM
To: Arwin Tio 
Cc: Sean Owen ; dev@spark.apache.org 
Subject: Re: DataFrameReader bottleneck in 
DataSource#checkAndGlobPathIfNecessary when reading S3 files



On Fri, Sep 6, 2019 at 10:56 PM Arwin Tio 
mailto:arwin@hotmail.com>> wrote:
I think the problem is calling globStatus to expand all 300K files.
In my particular case I did not use any glob patterns, so my bottleneck came 
from the FileSystem#exists specifically. I do concur that the globStatus 
expansion could also be problematic.

But you might
consider, if possible, running a lot of .csv jobs in parallel to query
subsets of all the files, and union the results. At least there you
parallelize the reading from the object store.
That is a great solution! I think that's what I will do as a workaround for the 
moment. Right now I'm thinking that a potential improvement here is to 
parallelize the SparkHadoopUtil#globPathIfNecessary and FileSystem#exists calls 
whenever possible (i.e. when multiple paths are specified), so that the client 
doesn't have to.


The other tactic though it'd go through a lot more of the code would be to 
postpone the exists check until the work is scheduled, which is implicitly in 
open() on the workers, or explicit when the RDD does the split calculation and 
calls getFileBlockLocations(). If you are confident that that always happens 
(and you will have to trace back from those calls in things like 
org.apache.spark.streaming.util.HdfsUtils and ParallelizedWithLocalityRDD) then 
you get those scans in the driver ... but I fear regression handling there gets 
harder.

* have SparkHadoopUtils differentiate between files returned by globStatus(), 
and which therefore exist, and those which it didn't glob for -it will only 
need to check those.
* then worry about parallel execution of the scan, again
Okay sounds good, I will take a crack at this and open a ticket. Any thoughts 
on the parallelism; should it be configurable?

For file input formats (parquet, orc, ...) there is an option, default == 8. 
Though its also off by default...maybe i should change that.


Another possible QoL improvement here is to show progress log messages - 
something that indicates to the user that the cluster is stuck while the driver 
is listing S3 files, maybe even including the FS getStorageStatistics?

yeah. If you want some examples of this, take a look at 
https://github.com/steveloughran/cloudstore . the locatedfilestatus command 
replicates what happens during FileInputFormat scans, so is how I'm going to 
tune IOPs there. It might also be good to have those bits of the hadoop MR 
classes which spark uses to log internally @ debug, so everything gets this 
logging if they ask for it.

Happy to take contribs there as Hadoop JIRAs & PRs

Thanks,

Arwin

From: Steve Loughran mailto:ste...@cloudera.com>>
Sent: September 6, 2019 4:15 PM
To: Sean Owen mailto:sro...@gmail.com>>
Cc: Arwin Tio mailto:arwin@hotmail.com>>; 
dev@spark.apache.org 
mailto:dev@spark.apache.org>>
Subject: Re: DataFrameReader bottleneck in 
DataSource#checkAndGlobPathIfNecessary when reading S3 files



On Fri, Sep 6, 2019 at 2:50 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
I think the problem is calling globStatus to expand all 300K files.
This is a general problem for object stores and huge numbers of files.
Steve L. may have better thoughts on real solutions. But you might
consider, if possible, running a lot of .csv jobs in parallel to query
subsets of all the files, and union the results. At least there you
parallelize the reading from the object store.

yeah, avoid globs and small files, especially small files in deep trees.

I think it's hard to optimize this case from the Spark side as it's
not clear how big a glob like s3://foo/* is going to be. I think it
would take reimplementing some logic to expand the glob incrementally
or something. Or maybe I am overlooking optimizations that have gone
into Spark 3.

A long time ago I actually tried to move Filesystem.globFiles off its own 
recursive treewalk into supporting the option of flat-list-chlldren + filter. 
But while you can get some great speedups in some layouts, you can get 
pathological collapses in perf elsewhere, which makes the people running those 
queries very sad. So I gave up.

Parallelized scans can do speedup; look at the code in 
org.apache.hadoop.mapred.LocatedFileStatusFetcher to see what it does there. 
I've only just started exploring what we can do to tune that, with
HADOOP-16458, HADOOP-16465, 
which should speed up ORC/Parquet scans) . These 

Re: Is it feasible to build and run Spark on Windows?

2019-12-09 Thread Ping Liu
Super.  Thanks Deepak!

On Mon, Dec 9, 2019 at 6:58 PM Deepak Vohra  wrote:

> Please install Apache Spark on Windows as discussed in Apache Spark on
> Windows - DZone Open Source
> 
>
> Apache Spark on Windows - DZone Open Source
>
> This article explains and provides solutions for some of the most common
> errors developers come across when inst...
> 
>
>
>
> On Monday, December 9, 2019, 11:27:53 p.m. UTC, Ping Liu <
> pingpinga...@gmail.com> wrote:
>
>
> Thanks Deepak!  Yes, I want to try it with Docker.  But my AWS account ran
> out of free period.  Is there a shared EC2 for Spark that we can use for
> free?
>
> Ping
>
>
> On Monday, December 9, 2019, Deepak Vohra  wrote:
> > Haven't tested but the general procedure is to exclude all guava
> dependencies that are not needed. The hadoop-common depedency does not have
> a dependency on guava according to Maven Repository: org.apache.hadoop »
> hadoop-common
> >
> > Maven Repository: org.apache.hadoop » hadoop-common
> >
> > Apache Spark 2.4 has dependency on guava 14.
> > If a Docker image for Cloudera Hadoop is used Spark is may be installed
> on Docker for Windows.
> > For Docker on Windows on EC2 refer Getting Started with Docker for
> Windows - Developer.com
> >
> > Getting Started with Docker for Windows - Developer.com
> >
> > Docker for Windows makes it feasible to run a Docker daemon on Windows
> Server 2016. Learn to harness its power.
> >
> >
> > Conflicting versions is not an issue if Docker is used.
> > "Apache Spark applications usually have a complex set of required
> software dependencies. Spark applications may require specific versions of
> these dependencies (such as Pyspark and R) on the Spark executor hosts,
> sometimes with conflicting versions."
> > Running Spark in Docker Containers on YARN
> >
> > Running Spark in Docker Containers on YARN
> >
> >
> >
> >
> >
> > On Monday, December 9, 2019, 08:37:47 p.m. UTC, Ping Liu <
> pingpinga...@gmail.com> wrote:
> >
> > Hi Deepak,
> > I tried it.  Unfortunately, it still doesn't work.  28.1-jre isn't
> downloaded for somehow.  I'll try something else.  Thank you very much for
> your help!
> > Ping
> >
> > On Fri, Dec 6, 2019 at 5:28 PM Deepak Vohra  wrote:
> >
> >  As multiple guava versions are found exclude guava from all the
> dependecies it could have been downloaded with. And explicitly add a recent
> guava version.
> > 
> > org.apache.hadoop
> > hadoop-common
> >  3.2.1
> > 
> >   
> >  com.google.guava
> >  guava
> >
> > 
> >
> > 
> > com.google.guava
> > guava
> > 28.1-jre
> > 
> >  
> >   
> >
> > On Friday, December 6, 2019, 10:12:55 p.m. UTC, Ping Liu <
> pingpinga...@gmail.com> wrote:
> >
> > Hi Deepak,
> > Following your suggestion, I put exclusion of guava in topmost POM
> (under Spark home directly) as follows.
> > 2227-  
> > 2228-  
> > 2229-org.apache.hadoop
> > 2230:hadoop-common
> > 2231-3.2.1
> > 2232-
> > 2233-  
> > 2234-com.google.guava
> > 2235-guava
> > 2236-  
> > 2237-
> > 2238-  
> > 2239-
> > 2240-  
> > I also set properties for spark.executor.userClassPathFirst=true and
> spark.driver.userClassPathFirst=true
> > D:\apache\spark>mvn -Pyarn -Phadoop-3.2 -Dhadoop-version=3.2.1
> -Dspark.executor.userClassPathFirst=true
> -Dspark.driver.userClassPathFirst=true -DskipTests clean package
> > and rebuilt spark.
> > But I got the same error when running spark-shell.
> >
> > [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> > [INFO]
> > [INFO] Spark Project Parent POM ... SUCCESS [
> 25.092 s]
> > [INFO] Spark Project Tags . SUCCESS [
> 22.093 s]
> > [INFO] Spark Project Sketch ... SUCCESS [
> 19.546 s]
> > [INFO] Spark Project Local DB . SUCCESS [
> 10.468 s]
> > [INFO] Spark Project Networking ... SUCCESS [
> 17.733 s]
> > [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  6.531 s]
> > [INFO] Spark Project Unsafe ... SUCCESS [
> 25.327 s]
> > [INFO] Spark Project Launcher . SUCCESS [
> 27.264 s]
> > [INFO] Spark Project Core . SUCCESS
> [07:59 min]
> > [INFO] Spark Project ML Local Library . SUCCESS
> [01:39 min]
> > [INFO] Spark Project GraphX ... SUCCESS
> [02:08 min]
> > [INFO] Spark Project Streaming  SUCCESS
> [02:56 min]
> > [INFO] Spark Project Catalyst . SUCCESS
> [08:55 min]
> > [INFO] Spark Project SQL .. SUCCESS
> [12:33 min]
> > 

Re: Spark 3.0 preview release 2?

2019-12-09 Thread Matei Zaharia
Yup, it would be great to release these more often.

> On Dec 9, 2019, at 4:25 PM, Takeshi Yamamuro  wrote:
> 
> +1; Looks great if we can in terms of user's feedbacks.
> 
> Bests,
> Takeshi
> 
> On Tue, Dec 10, 2019 at 3:14 AM Dongjoon Hyun  > wrote:
> Thank you, All.
> 
> +1 for another `3.0-preview`.
> 
> Also, thank you Yuming for volunteering for that!
> 
> Bests,
> Dongjoon.
> 
> 
> On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  > wrote:
> When entering the official release candidates, the new features have to be 
> disabled or even reverted [if the conf is not available] if the fixes are not 
> trivial; otherwise, we might need 10+ RCs to make the final release. The new 
> features should not block the release based on the previous discussions. 
> 
> I agree we should have code freeze at the beginning of 2020. The preview 
> releases should not block the official releases. The preview is just to 
> collect more feedback about these new features or behavior changes.
> 
> Also, for the release of Spark 3.0, we still need the Hive community to do us 
> a favor to release 2.3.7 for having HIVE-22190 
> . Before asking Hive 
> community to do 2.3.7 release, if possible, we want our Spark community to 
> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2, 
> which is based on Hive 2.3 execution JAR. During the preview stage, we might 
> find more issues that are not covered by our test cases.
> 
>  
> 
> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  > wrote:
> Seems fine to me of course. Honestly that wouldn't be a bad result for
> a release candidate, though we would probably roll another one now.
> How about simply moving to a release candidate? If not now then at
> least move to code freeze from the start of 2020. There is also some
> downside in pushing out the 3.0 release further with previews.
> 
> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  > wrote:
> >
> > I got many great feedbacks from the community about the recent 3.0 preview 
> > release. Since the last 3.0 preview release, we already have 353 commits 
> > [https://github.com/apache/spark/compare/v3.0.0-preview...master 
> > ]. There 
> > are various important features and behavior changes we want the community 
> > to try before entering the official release candidates of Spark 3.0.
> >
> >
> > Below is my selected items that are not part of the last 3.0 preview but 
> > already available in the upstream master branch:
> >
> > Support JDK 11 with Hadoop 2.7
> > Spark SQL will respect its own default format (i.e., parquet) when users do 
> > CREATE TABLE without USING or STORED AS clauses
> > Enable Parquet nested schema pruning and nested pruning on expressions by 
> > default
> > Add observable Metrics for Streaming queries
> > Column pruning through nondeterministic expressions
> > RecordBinaryComparator should check endianness when compared by long
> > Improve parallelism for local shuffle reader in adaptive query execution
> > Upgrade Apache Arrow to version 0.15.1
> > Various interval-related SQL support
> > Add a mode to pin Python thread into JVM's
> > Provide option to clean up completed files in streaming query
> >
> > I am wondering if we can have another preview release for Spark 3.0? This 
> > can help us find the design/API defects as early as possible and avoid the 
> > significant delay of the upcoming Spark 3.0 release
> >
> >
> > Also, any committer is willing to volunteer as the release manager of the 
> > next preview release of Spark 3.0, if we have such a release?
> >
> >
> > Cheers,
> >
> >
> > Xiao
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> -- 
>   
> 
> 
> -- 
> ---
> Takeshi Yamamuro



Re: Spark 3.0 preview release 2?

2019-12-09 Thread Takeshi Yamamuro
+1; Looks great if we can in terms of user's feedbacks.

Bests,
Takeshi

On Tue, Dec 10, 2019 at 3:14 AM Dongjoon Hyun 
wrote:

> Thank you, All.
>
> +1 for another `3.0-preview`.
>
> Also, thank you Yuming for volunteering for that!
>
> Bests,
> Dongjoon.
>
>
> On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  wrote:
>
>> When entering the official release candidates, the new features have to
>> be disabled or even reverted [if the conf is not available] if the fixes
>> are not trivial; otherwise, we might need 10+ RCs to make the final
>> release. The new features should not block the release based on the
>> previous discussions.
>>
>> I agree we should have code freeze at the beginning of 2020. The preview
>> releases should not block the official releases. The preview is just to
>> collect more feedback about these new features or behavior changes.
>>
>> Also, for the release of Spark 3.0, we still need the Hive community to
>> do us a favor to release 2.3.7 for having HIVE-22190
>> . Before asking Hive
>> community to do 2.3.7 release, if possible, we want our Spark community to
>> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2,
>> which is based on Hive 2.3 execution JAR. During the preview stage, we
>> might find more issues that are not covered by our test cases.
>>
>>
>>
>> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  wrote:
>>
>>> Seems fine to me of course. Honestly that wouldn't be a bad result for
>>> a release candidate, though we would probably roll another one now.
>>> How about simply moving to a release candidate? If not now then at
>>> least move to code freeze from the start of 2020. There is also some
>>> downside in pushing out the 3.0 release further with previews.
>>>
>>> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
>>> >
>>> > I got many great feedbacks from the community about the recent 3.0
>>> preview release. Since the last 3.0 preview release, we already have 353
>>> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
>>> There are various important features and behavior changes we want the
>>> community to try before entering the official release candidates of Spark
>>> 3.0.
>>> >
>>> >
>>> > Below is my selected items that are not part of the last 3.0 preview
>>> but already available in the upstream master branch:
>>> >
>>> > Support JDK 11 with Hadoop 2.7
>>> > Spark SQL will respect its own default format (i.e., parquet) when
>>> users do CREATE TABLE without USING or STORED AS clauses
>>> > Enable Parquet nested schema pruning and nested pruning on expressions
>>> by default
>>> > Add observable Metrics for Streaming queries
>>> > Column pruning through nondeterministic expressions
>>> > RecordBinaryComparator should check endianness when compared by long
>>> > Improve parallelism for local shuffle reader in adaptive query
>>> execution
>>> > Upgrade Apache Arrow to version 0.15.1
>>> > Various interval-related SQL support
>>> > Add a mode to pin Python thread into JVM's
>>> > Provide option to clean up completed files in streaming query
>>> >
>>> > I am wondering if we can have another preview release for Spark 3.0?
>>> This can help us find the design/API defects as early as possible and avoid
>>> the significant delay of the upcoming Spark 3.0 release
>>> >
>>> >
>>> > Also, any committer is willing to volunteer as the release manager of
>>> the next preview release of Spark 3.0, if we have such a release?
>>> >
>>> >
>>> > Cheers,
>>> >
>>> >
>>> > Xiao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>

-- 
---
Takeshi Yamamuro


Re: Is it feasible to build and run Spark on Windows?

2019-12-09 Thread Ping Liu
Thanks Deepak!  Yes, I want to try it with Docker.  But my AWS account ran
out of free period.  Is there a shared EC2 for Spark that we can use for
free?

Ping


On Monday, December 9, 2019, Deepak Vohra  wrote:
> Haven't tested but the general procedure is to exclude all guava
dependencies that are not needed. The hadoop-common depedency does not have
a dependency on guava according to Maven Repository: org.apache.hadoop »
hadoop-common
>
> Maven Repository: org.apache.hadoop » hadoop-common
>
> Apache Spark 2.4 has dependency on guava 14.
> If a Docker image for Cloudera Hadoop is used Spark is may be installed
on Docker for Windows.
> For Docker on Windows on EC2 refer Getting Started with Docker for
Windows - Developer.com
>
> Getting Started with Docker for Windows - Developer.com
>
> Docker for Windows makes it feasible to run a Docker daemon on Windows
Server 2016. Learn to harness its power.
>
>
> Conflicting versions is not an issue if Docker is used.
> "Apache Spark applications usually have a complex set of required
software dependencies. Spark applications may require specific versions of
these dependencies (such as Pyspark and R) on the Spark executor hosts,
sometimes with conflicting versions."
> Running Spark in Docker Containers on YARN
>
> Running Spark in Docker Containers on YARN
>
>
>
>
>
> On Monday, December 9, 2019, 08:37:47 p.m. UTC, Ping Liu <
pingpinga...@gmail.com> wrote:
>
> Hi Deepak,
> I tried it.  Unfortunately, it still doesn't work.  28.1-jre isn't
downloaded for somehow.  I'll try something else.  Thank you very much for
your help!
> Ping
>
> On Fri, Dec 6, 2019 at 5:28 PM Deepak Vohra  wrote:
>
>  As multiple guava versions are found exclude guava from all the
dependecies it could have been downloaded with. And explicitly add a recent
guava version.
> 
> org.apache.hadoop
> hadoop-common
>  3.2.1
> 
>   
>  com.google.guava
>  guava
>
> 
>
> 
> com.google.guava
> guava
> 28.1-jre
> 
>  
>   
>
> On Friday, December 6, 2019, 10:12:55 p.m. UTC, Ping Liu <
pingpinga...@gmail.com> wrote:
>
> Hi Deepak,
> Following your suggestion, I put exclusion of guava in topmost POM (under
Spark home directly) as follows.
> 2227-  
> 2228-  
> 2229-org.apache.hadoop
> 2230:hadoop-common
> 2231-3.2.1
> 2232-
> 2233-  
> 2234-com.google.guava
> 2235-guava
> 2236-  
> 2237-
> 2238-  
> 2239-
> 2240-  
> I also set properties for spark.executor.userClassPathFirst=true and
spark.driver.userClassPathFirst=true
> D:\apache\spark>mvn -Pyarn -Phadoop-3.2 -Dhadoop-version=3.2.1
-Dspark.executor.userClassPathFirst=true
-Dspark.driver.userClassPathFirst=true -DskipTests clean package
> and rebuilt spark.
> But I got the same error when running spark-shell.
>
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
25.092 s]
> [INFO] Spark Project Tags . SUCCESS [
22.093 s]
> [INFO] Spark Project Sketch ... SUCCESS [
19.546 s]
> [INFO] Spark Project Local DB . SUCCESS [
10.468 s]
> [INFO] Spark Project Networking ... SUCCESS [
17.733 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
 6.531 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
25.327 s]
> [INFO] Spark Project Launcher . SUCCESS [
27.264 s]
> [INFO] Spark Project Core . SUCCESS
[07:59 min]
> [INFO] Spark Project ML Local Library . SUCCESS
[01:39 min]
> [INFO] Spark Project GraphX ... SUCCESS
[02:08 min]
> [INFO] Spark Project Streaming  SUCCESS
[02:56 min]
> [INFO] Spark Project Catalyst . SUCCESS
[08:55 min]
> [INFO] Spark Project SQL .. SUCCESS
[12:33 min]
> [INFO] Spark Project ML Library ... SUCCESS
[08:49 min]
> [INFO] Spark Project Tools  SUCCESS [
16.967 s]
> [INFO] Spark Project Hive . SUCCESS
[06:15 min]
> [INFO] Spark Project Graph API  SUCCESS [
10.219 s]
> [INFO] Spark Project Cypher ... SUCCESS [
11.952 s]
> [INFO] Spark Project Graph  SUCCESS [
11.171 s]
> [INFO] Spark Project REPL . SUCCESS [
55.029 s]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS
[01:07 min]
> [INFO] Spark Project YARN . SUCCESS
[02:22 min]
> [INFO] Spark Project Assembly . SUCCESS [
21.483 s]
> [INFO] Kafka 0.10+ Token 

Re: SQL test failures in PR builder?

2019-12-09 Thread Shane Knapp
yeah, totally weird.

i'm actually going to take this moment and clean up the build scripts
for both of these jobs.  there's a lot of years-old cruft that i'll
delete and make things more readable.

On Sun, Dec 8, 2019 at 7:50 PM Sean Owen  wrote:
>
> Hm, so they look pretty similar except for minor differences in the
> actual script run. Is there any reason this should be different? Would
> it be reasonable to try making the 'new' one work like the 'old' one
> if the former isn't working?
>
> But I still can't figure out why it causes the same odd error every
> time on this one PR, which is a minor change to tooltips in the UI. I
> haven't seen other manually-triggered PR builds fail this way. Really
> mysterious so far!
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4964/testReport/
>
>
> Old:
>
> #!/bin/bash
>
> set -e  # fail on any non-zero exit code
> set -x
>
> export AMPLAB_JENKINS=1
> export PATH="$PATH:/home/anaconda/envs/py3k/bin"
>
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT
> incremental compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
>
> # Add a pre-downloaded version of Maven to the path so that we avoid
> the flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
>
> echo "fixing target dir permissions"
> chmod -R +w target/* || true  # stupid hack by sknapp to ensure that
> the chmod always exits w/0 and doesn't bork the script
>
> echo "running git clean -fdx"
> git clean -fdx
>
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
>
>
> ./dev/run-tests-jenkins
>
>
> # Hack to ensure that at least one JVM suite always runs in order to
> prevent spurious errors from the
> # Jenkins JUnit test reporter plugin
> ./build/sbt unsafe/test > /dev/null 2>&1
>
>
>
> New:
>
> #!/bin/bash
>
> set -e
> export AMPLAB_JENKINS=1
> export PATH="$PATH:/home/anaconda/envs/py3k/bin"
> git clean -fdx
>
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT
> incremental compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
>
> # Add a pre-downloaded version of Maven to the path so that we avoid
> the flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
>
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
>
> # This is required for tests of backport patches.
> # We need to download the run-tests-codes.sh file because it's
> imported by run-tests-jenkins.
> # When running tests on branch-1.0 (and earlier), the older version of
> run-tests won't set CURRENT_BLOCK, so
> # the Jenkins scripts will report all failures as "some tests failed"
> rather than a more specific
> # error message.
> if [ ! -f "dev/run-tests-jenkins" ]; then
>   wget 
> https://raw.githubusercontent.com/apache/spark/master/dev/run-tests-jenkins
>   wget 
> https://raw.githubusercontent.com/apache/spark/master/dev/run-tests-codes.sh
>   mv run-tests-jenkins dev/
>   mv run-tests-codes.sh dev/
>   chmod 755 dev/run-tests-jenkins
>   chmod 755 dev/run-tests-codes.sh
> fi
>
> ./dev/run-tests-jenkins
>
>
> On Wed, Dec 4, 2019 at 5:53 PM Shane Knapp  wrote:
> >
> > ++yin huai for more insight in to the NewSparkPullRequestBuilder job...
> >
> > tbh, i never (or still) really understand the exact use for that job,
> > except that it's triggered by https://spark-prs.appspot.com/
> >
> > shane
> >
> >
> > On Wed, Dec 4, 2019 at 3:34 PM Sean Owen  wrote:
> > >
> > > BTW does anyone know why there are two PR builder jobs? I'm confused
> > > about why different ones would execute.
> > >
> > > Yes I see NewSparkPullRequestBuilder failing on a variety of PRs.
> > > I don't think it has anything to do with Hive; these PRs touch
> > > different parts of code but all not related to this failure.
> > >
> > > On Wed, Dec 4, 2019 at 12:40 PM Dongjoon Hyun  
> > > wrote:
> > > >
> > > > Hi, Sean.
> > > >
> > > > It seems that there is no failure on your other SQL PR.
> > > >
> > > > https://github.com/apache/spark/pull/26748
> > > >
> > > > Does the sequential failure happen only at `NewSparkPullRequestBuilder`?
> > > > Since `NewSparkPullRequestBuilder` is not the same with 
> > > > `SparkPullRequestBuilder`,
> > > > there might be a root cause inside it if it happens only at 
> > > > `NewSparkPullRequestBuilder`.
> > > >
> > > > For `org.apache.hive.service.ServiceException: Failed to Start 
> > 

Re: Is it feasible to build and run Spark on Windows?

2019-12-09 Thread Ping Liu
Hi Deepak,

I tried it.  Unfortunately, it still doesn't work.  28.1-jre isn't
downloaded for somehow.  I'll try something else.  Thank you very much for
your help!

Ping


On Fri, Dec 6, 2019 at 5:28 PM Deepak Vohra  wrote:

>  As multiple guava versions are found exclude guava from all the
> dependecies it could have been downloaded with. And explicitly add a recent
> guava version.
>
> 
> org.apache.hadoop
> hadoop-common
>  3.2.1
> 
>   
>  com.google.guava
>  guava
>
> 
>
> 
> com.google.guava
> guava
> 28.1-jre
> 
>  
>   
>
>
> On Friday, December 6, 2019, 10:12:55 p.m. UTC, Ping Liu <
> pingpinga...@gmail.com> wrote:
>
>
> Hi Deepak,
>
> Following your suggestion, I put exclusion of guava in topmost POM (under
> Spark home directly) as follows.
>
> 2227-  
> 2228-  
> 2229-org.apache.hadoop
> 2230:hadoop-common
> 2231-3.2.1
> 2232-
> 2233-  
> 2234-com.google.guava
> 2235-guava
> 2236-  
> 2237-
> 2238-  
> 2239-
> 2240-  
>
> I also set properties for spark.executor.userClassPathFirst=true and
> spark.driver.userClassPathFirst=true
>
> D:\apache\spark>mvn -Pyarn -Phadoop-3.2 -Dhadoop-version=3.2.1
> -Dspark.executor.userClassPathFirst=true
> -Dspark.driver.userClassPathFirst=true -DskipTests clean package
>
> and rebuilt spark.
>
> But I got the same error when running spark-shell.
>
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
> 25.092 s]
> [INFO] Spark Project Tags . SUCCESS [
> 22.093 s]
> [INFO] Spark Project Sketch ... SUCCESS [
> 19.546 s]
> [INFO] Spark Project Local DB . SUCCESS [
> 10.468 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 17.733 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  6.531 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 25.327 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 27.264 s]
> [INFO] Spark Project Core . SUCCESS [07:59
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [01:39
> min]
> [INFO] Spark Project GraphX ... SUCCESS [02:08
> min]
> [INFO] Spark Project Streaming  SUCCESS [02:56
> min]
> [INFO] Spark Project Catalyst . SUCCESS [08:55
> min]
> [INFO] Spark Project SQL .. SUCCESS [12:33
> min]
> [INFO] Spark Project ML Library ... SUCCESS [08:49
> min]
> [INFO] Spark Project Tools  SUCCESS [
> 16.967 s]
> [INFO] Spark Project Hive . SUCCESS [06:15
> min]
> [INFO] Spark Project Graph API  SUCCESS [
> 10.219 s]
> [INFO] Spark Project Cypher ... SUCCESS [
> 11.952 s]
> [INFO] Spark Project Graph  SUCCESS [
> 11.171 s]
> [INFO] Spark Project REPL . SUCCESS [
> 55.029 s]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [01:07
> min]
> [INFO] Spark Project YARN . SUCCESS [02:22
> min]
> [INFO] Spark Project Assembly . SUCCESS [
> 21.483 s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
> 56.450 s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:21
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [02:33
> min]
> [INFO] Spark Project Examples . SUCCESS [02:05
> min]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
> 30.780 s]
> [INFO] Spark Avro . SUCCESS [01:43
> min]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time:  01:08 h
> [INFO] Finished at: 2019-12-06T11:43:08-08:00
> [INFO]
> 
>
> D:\apache\spark>spark-shell
> 'spark-shell' is not recognized as an internal or external command,
> operable program or batch file.
>
> D:\apache\spark>cd bin
>
> D:\apache\spark\bin>spark-shell
> Exception in thread "main" java.lang.NoSuchMethodError:
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
> at
> org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
> at
> org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
> at
> 

Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-09 Thread Sean Owen
Sure, seems fine. The release cadence slows down in a branch over time
as there is probably less to fix, so Jan-Feb 2020 for 2.4.5 and
something like middle or Q3 2020 for 2.4.6 is a reasonable
expectation. It might plausibly be the last 2.4.x release but who
knows.

On Mon, Dec 9, 2019 at 12:29 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Along with the discussion on 3.0.0, I'd like to discuss about the next 
> releases on `branch-2.4`.
>
> As we know, `branch-2.4` is our LTS branch and also there exists some 
> questions on the release plans. More releases are important not only for the 
> latest K8s version support, but also for delivering important bug fixes 
> regularly (at least until 3.x becomes dominant.)
>
> In short, I'd like to propose the followings.
>
> 1. Apache Spark 2.4.5 release (2020 January)
> 2. Apache Spark 2.4.6 release (2020 July)
>
> Of course, we can adjust the schedule.
> This aims to have a pre-defined cadence in order to give release managers to 
> prepare.
>
> Bests,
> Dongjoon.
>
> PS. As of now, `branch-2.4` has 135 additional patches after `2.4.4`.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Release Apache Spark 2.4.5 and 2.4.6

2019-12-09 Thread Dongjoon Hyun
Hi, All.

Along with the discussion on 3.0.0, I'd like to discuss about the next
releases on `branch-2.4`.

As we know, `branch-2.4` is our LTS branch and also there exists some
questions on the release plans. More releases are important not only for
the latest K8s version support, but also for delivering important bug fixes
regularly (at least until 3.x becomes dominant.)

In short, I'd like to propose the followings.

1. Apache Spark 2.4.5 release (2020 January)
2. Apache Spark 2.4.6 release (2020 July)

Of course, we can adjust the schedule.
This aims to have a pre-defined cadence in order to give release managers
to prepare.

Bests,
Dongjoon.

PS. As of now, `branch-2.4` has 135 additional patches after `2.4.4`.


Re: Spark 3.0 preview release 2?

2019-12-09 Thread Dongjoon Hyun
Thank you, All.

+1 for another `3.0-preview`.

Also, thank you Yuming for volunteering for that!

Bests,
Dongjoon.


On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  wrote:

> When entering the official release candidates, the new features have to be
> disabled or even reverted [if the conf is not available] if the fixes are
> not trivial; otherwise, we might need 10+ RCs to make the final release.
> The new features should not block the release based on the previous
> discussions.
>
> I agree we should have code freeze at the beginning of 2020. The preview
> releases should not block the official releases. The preview is just to
> collect more feedback about these new features or behavior changes.
>
> Also, for the release of Spark 3.0, we still need the Hive community to do
> us a favor to release 2.3.7 for having HIVE-22190
> . Before asking Hive
> community to do 2.3.7 release, if possible, we want our Spark community to
> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2,
> which is based on Hive 2.3 execution JAR. During the preview stage, we
> might find more issues that are not covered by our test cases.
>
>
>
> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  wrote:
>
>> Seems fine to me of course. Honestly that wouldn't be a bad result for
>> a release candidate, though we would probably roll another one now.
>> How about simply moving to a release candidate? If not now then at
>> least move to code freeze from the start of 2020. There is also some
>> downside in pushing out the 3.0 release further with previews.
>>
>> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
>> >
>> > I got many great feedbacks from the community about the recent 3.0
>> preview release. Since the last 3.0 preview release, we already have 353
>> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
>> There are various important features and behavior changes we want the
>> community to try before entering the official release candidates of Spark
>> 3.0.
>> >
>> >
>> > Below is my selected items that are not part of the last 3.0 preview
>> but already available in the upstream master branch:
>> >
>> > Support JDK 11 with Hadoop 2.7
>> > Spark SQL will respect its own default format (i.e., parquet) when
>> users do CREATE TABLE without USING or STORED AS clauses
>> > Enable Parquet nested schema pruning and nested pruning on expressions
>> by default
>> > Add observable Metrics for Streaming queries
>> > Column pruning through nondeterministic expressions
>> > RecordBinaryComparator should check endianness when compared by long
>> > Improve parallelism for local shuffle reader in adaptive query execution
>> > Upgrade Apache Arrow to version 0.15.1
>> > Various interval-related SQL support
>> > Add a mode to pin Python thread into JVM's
>> > Provide option to clean up completed files in streaming query
>> >
>> > I am wondering if we can have another preview release for Spark 3.0?
>> This can help us find the design/API defects as early as possible and avoid
>> the significant delay of the upcoming Spark 3.0 release
>> >
>> >
>> > Also, any committer is willing to volunteer as the release manager of
>> the next preview release of Spark 3.0, if we have such a release?
>> >
>> >
>> > Cheers,
>> >
>> >
>> > Xiao
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


Re: Spark 3.0 preview release 2?

2019-12-09 Thread Xiao Li
When entering the official release candidates, the new features have to be
disabled or even reverted [if the conf is not available] if the fixes are
not trivial; otherwise, we might need 10+ RCs to make the final release.
The new features should not block the release based on the previous
discussions.

I agree we should have code freeze at the beginning of 2020. The preview
releases should not block the official releases. The preview is just to
collect more feedback about these new features or behavior changes.

Also, for the release of Spark 3.0, we still need the Hive community to do
us a favor to release 2.3.7 for having HIVE-22190
. Before asking Hive
community to do 2.3.7 release, if possible, we want our Spark community to
have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2,
which is based on Hive 2.3 execution JAR. During the preview stage, we
might find more issues that are not covered by our test cases.



On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  wrote:

> Seems fine to me of course. Honestly that wouldn't be a bad result for
> a release candidate, though we would probably roll another one now.
> How about simply moving to a release candidate? If not now then at
> least move to code freeze from the start of 2020. There is also some
> downside in pushing out the 3.0 release further with previews.
>
> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
> >
> > I got many great feedbacks from the community about the recent 3.0
> preview release. Since the last 3.0 preview release, we already have 353
> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
> There are various important features and behavior changes we want the
> community to try before entering the official release candidates of Spark
> 3.0.
> >
> >
> > Below is my selected items that are not part of the last 3.0 preview but
> already available in the upstream master branch:
> >
> > Support JDK 11 with Hadoop 2.7
> > Spark SQL will respect its own default format (i.e., parquet) when users
> do CREATE TABLE without USING or STORED AS clauses
> > Enable Parquet nested schema pruning and nested pruning on expressions
> by default
> > Add observable Metrics for Streaming queries
> > Column pruning through nondeterministic expressions
> > RecordBinaryComparator should check endianness when compared by long
> > Improve parallelism for local shuffle reader in adaptive query execution
> > Upgrade Apache Arrow to version 0.15.1
> > Various interval-related SQL support
> > Add a mode to pin Python thread into JVM's
> > Provide option to clean up completed files in streaming query
> >
> > I am wondering if we can have another preview release for Spark 3.0?
> This can help us find the design/API defects as early as possible and avoid
> the significant delay of the upcoming Spark 3.0 release
> >
> >
> > Also, any committer is willing to volunteer as the release manager of
> the next preview release of Spark 3.0, if we have such a release?
> >
> >
> > Cheers,
> >
> >
> > Xiao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
[image: Databricks Summit - Watch the talks]



Re: Next DSv2 sync date

2019-12-09 Thread Ryan Blue
Actually, my conflict was cancelled so I'll send out the usual invite for
Wednesday. Sorry for the noise.

On Sun, Dec 8, 2019 at 3:15 PM Ryan Blue  wrote:

> Hi everyone,
>
> I have a conflict with the normal DSv2 sync time this Wednesday and I'd
> like to attend to talk about the TableProvider API.
>
> Would it work for everyone to have the sync at 6PM PST on Tuesday, 10
> December instead? I could also make it at the normal time on Thursday.
>
> Thanks,
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Spark 3.0 preview release 2?

2019-12-09 Thread Sean Owen
Seems fine to me of course. Honestly that wouldn't be a bad result for
a release candidate, though we would probably roll another one now.
How about simply moving to a release candidate? If not now then at
least move to code freeze from the start of 2020. There is also some
downside in pushing out the 3.0 release further with previews.

On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
>
> I got many great feedbacks from the community about the recent 3.0 preview 
> release. Since the last 3.0 preview release, we already have 353 commits 
> [https://github.com/apache/spark/compare/v3.0.0-preview...master]. There are 
> various important features and behavior changes we want the community to try 
> before entering the official release candidates of Spark 3.0.
>
>
> Below is my selected items that are not part of the last 3.0 preview but 
> already available in the upstream master branch:
>
> Support JDK 11 with Hadoop 2.7
> Spark SQL will respect its own default format (i.e., parquet) when users do 
> CREATE TABLE without USING or STORED AS clauses
> Enable Parquet nested schema pruning and nested pruning on expressions by 
> default
> Add observable Metrics for Streaming queries
> Column pruning through nondeterministic expressions
> RecordBinaryComparator should check endianness when compared by long
> Improve parallelism for local shuffle reader in adaptive query execution
> Upgrade Apache Arrow to version 0.15.1
> Various interval-related SQL support
> Add a mode to pin Python thread into JVM's
> Provide option to clean up completed files in streaming query
>
> I am wondering if we can have another preview release for Spark 3.0? This can 
> help us find the design/API defects as early as possible and avoid the 
> significant delay of the upcoming Spark 3.0 release
>
>
> Also, any committer is willing to volunteer as the release manager of the 
> next preview release of Spark 3.0, if we have such a release?
>
>
> Cheers,
>
>
> Xiao

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org