Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Steve Loughran
There's a new parquet RC up this week which would be good to pull in. On Thu, 2 May 2024 at 03:20, Jungtaek Lim wrote: > +1 love to see it! > > On Thu, May 2, 2024 at 10:08 AM Holden Karau > wrote: > >> +1 :) yay previews >> >> On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: >> >>> +1 >>> >>>

Re: Which version of spark version supports parquet version 2 ?

2024-04-19 Thread Steve Loughran
Those are some quite good improvements -but committing to storing all your data in an unstable format, is, well, "bold". For temporary data as part of a workflow though, it could be appealing Now, assuming you are going to be working with s3, you might want to start with merging PARQUET-2117 into

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Steve Loughran
ASF will be unhappy about this. and stack overflow exists. otherwise: apache Confluent and linkedIn exist; LI is the option I'd point at On Mon, 18 Mar 2024 at 10:59, Mich Talebzadeh wrote: > Some of you may be aware that Databricks community Home | Databricks > have just launched a knowledge

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Steve Loughran
I consider the context info as more important than just logging; at hadoop level we do it to attach things like task/jobIds, kerberos principals etc to all store requests. https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/auditing.html So worrying about how pass and manage that at

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-23 Thread Steve Loughran
Now, if you are ruthless it'd make sense to randomise the order of results if someone left out the order by, to stop complacency. like that time sun changed the ordering that methods were returned in a Class.listMethods() call and everyone's junit test cases failed if they'd assumed that ordering

Re: What else could be removed in Spark 4?

2023-08-24 Thread Steve Loughran
I would recommend cutting them. + historically they've fixed the version of aws-sdk jar used in spark releases, meaning s3a connector through spark rarely used the same sdk release as that qualified through the hadoop sdk update process, so if there were incompatibilities, it'd be up to the spark

Re: Spark writing API

2023-08-07 Thread Steve Loughran
On Thu, 1 Jun 2023 at 00:58, Andrew Melo wrote: > Hi all > > I've been developing for some time a Spark DSv2 plugin "Laurelin" ( > https://github.com/spark-root/laurelin > ) to read the ROOT (https://root.cern) file format (which is used in high > energy physics). I've recently presented my work

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-21 Thread Steve Loughran
I'd say everyone should *and* http UA in all the clients who make calls of object stores should, as it helps field issues there. s3a and abfs clients do provide the ability to add params there -please set them in your deployments On Fri, 16 Jun 2023 at 21:53, Dongjoon Hyun wrote: > Please vote

Re: Remove protobuf 2.5.0 from Spark dependencies

2023-06-01 Thread Steve Loughran
. On Sat, 27 May 2023 at 14:39, 张铎(Duo Zhang) wrote: > For hbase, the problem is we rely on protobuf 2.5 for our coprocessors. > > See HBASE-27436. > > Cheng Pan 于2023年5月24日周三 10:00写道: > >> +CC dev@hbase >> >> Thanks, >> Cheng Pan >> >> On

Re: Remove protobuf 2.5.0 from Spark dependencies

2023-05-18 Thread Steve Loughran
+1. the shaded one which is in use also needs upgrading. > Thanks, > Cheng Pan > > > On May 17, 2023 at 04:10:43, Dongjoon Hyun wrote: > >> Thank you for sharing, Steve. >> >> Dongjoon >> >> On Tue, May 16, 2023 at 11:44 AM Steve Loughr

Re: Remove protobuf 2.5.0 from Spark dependencies

2023-05-16 Thread Steve Loughran
I have some bad news here which is even though hadoop cut protobuf 2.5 support, hbase team put it back in (HADOOP-17046). I don't know if the shaded hadoop client has removed that dependency on protobuf 2.5. In HADOOP-18487 i want to allow hadoop to cut that dependency, with hbase having to add

Re: hadoop-2 profile to be removed in 3.5.0

2023-04-18 Thread Steve Loughran
This is truly wonderful. 1. I have an internal patch related to committer stuff I could submit now 2. if someone wants to look at it where FileSystem.open() is used *and you have the file length, file path, or simply know whether you plan to do random or sequential IO*, switch to openFile(). on

Re: maven build failing in spark sql w/BouncyCastleProvider CNFE

2022-12-08 Thread Steve Loughran
t; > test > > > > ``` > > > > Yang Jie > > > > *发件人**: *"Yang,Jie(INF)" > *日期**: *2022年12月6日 星期二 18:27 > *收件人**: *Steve Loughran > *抄送**: *Hyukjin Kwon , Apache Spark Dev < > dev@spark.apache.org> > *主题**: *Re: ma

Re: maven build failing in spark sql w/BouncyCastleProvider CNFE

2022-12-06 Thread Steve Loughran
stleProvider CNFE > > > > Steve, does the lower version of scala plugin work for you? If that > solves, we could temporary downgrade for now. > > > > On Mon, 5 Dec 2022 at 22:23, Steve Loughran > wrote: > > trying to build spark master w/ hadoop trunk and the maven s

maven build failing in spark sql w/BouncyCastleProvider CNFE

2022-12-05 Thread Steve Loughran
trying to build spark master w/ hadoop trunk and the maven sbt plugin is failing. This doesn't happen with the 3.3.5 RC0; I note that the only mention of this anywhere was me in march. clearly something in hadoop trunk has changed in a way which is incompatible. Has anyone else tried such a

Re: CVE-2022-42889

2022-10-27 Thread Steve Loughran
the api doesn't get used in the hadoop libraries; not sure about other dependencies. probably makes sense to say on the jira that there's no need to panic here; I've had to start doing that as some of the security scanners appear to overreact https://issues.apache.org/jira/browse/HDFS-16766 On

Re: Missing data in spark output

2022-10-25 Thread Steve Loughran
v1 on gcs isn't safe either as promotion from task attempt to successful task is a dir rename; fast and atomic on hdfs, O(files) and nonatomic on GCS. if i can get that hadoop 3.3.5 rc out soon, the manifest committer will be there to test https://issues.apache.org/jira/browse/MAPREDUCE-7341

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-06 Thread Steve Loughran
On Wed, 5 Oct 2022 at 21:59, Chao Sun wrote: > +1 > > > and specifically may allow us to finally move off of the ancient version > of Guava (?) > > I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. > hadoop branch-2 has guava dependencies; not sure which one A key lesson

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-04 Thread Steve Loughran
that sounds suspiciously like something I'd write :) the move to java8 happened in HADOOP-11858; 3.0.0 HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been open since 2019 and I just closed as WONTFIX. Most of the big production hadoop 2 clusters use java7, because that is

Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-14 Thread Steve Loughran
ocker file which works >fine in our environment (with Hadoop3.1) > > > Regards > Pralabh Kumar > > > > On Mon, Jun 13, 2022 at 3:25 PM Steve Loughran > wrote: > >> >> >> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar >> wrote: >> >>&

Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-13 Thread Steve Loughran
On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar wrote: > Hi Dev team > > I have a spark32 image with Java 11 (Running Spark on K8s) . While > reading a huge parquet file via spark.read.parquet("") . I am getting > the following error . The same error is mentioned in Spark docs >

Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-07 Thread Steve Loughran
On Fri, 3 Jun 2022 at 18:46, Martin Grund wrote: > Hi Everyone, > > We would like to start a discussion on the "Spark Connect" proposal. > Please find the links below: > > *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375 > *SPIP Document* - >

Re: Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355

2022-05-03 Thread Steve Loughran
Prablah, did you follow the URL provided in the exception message? i put a lot of effort in to improving the diagnostics, where the wiki articles are part of the troubleshooing process https://issues.apache.org/jira/browse/HADOOP-7469 it's really disappointing when people escalate the problem to

Re: Spark client for Hadoop 2.x

2022-04-12 Thread Steve Loughran
I should back up Donjoon's comments with the observation that hadoop 2.10.x is the only branch-2 release which get any security updates; on branch-3 it is 3.2.x and 3.3.x which do. Donjoon's colleague Chao Sun was the release manager on the 3.3.2 release, so it got thoroughly tested with Spark.

Re: Log4j 1.2.17 spark CVE

2021-12-14 Thread Steve Loughran
log4j 1.2.17 is not vulnerable. There is an existing CVE there from a log aggregation servlet; Cloudera products ship a patched release with that servlet stripped...asf projects are not allowed to do that. But: some recent Cloudera Products do include log4j 2.x, so colleagues of mine are busy

Re: HiveThrift2 ACID Transactions?

2021-11-23 Thread Steve Loughran
without commenting on any other part of this, note that it was in some hive commit operations where a race condition in rename surfaced https://issues.apache.org/jira/browse/HADOOP-16721 if you get odd errors about parent dirs not existing during renames, that'll be it...Upgrade to Hadoop-3.3.1

Re: Handle FileAlreadyExistsException for .spark-staging files

2021-09-10 Thread Steve Loughran
one of the issues here is that Parquet creates files with overwrite=false; other output formats do not do this, so implicitly overwrite the output of previous attempts. Which is fine if you are confident that each task attempt (henceforth: TA) is writing to an isolated path. the next iteration of

Re: Observer Namenode and Committer Algorithm V1

2021-09-07 Thread Steve Loughran
I haven't looked into much. I'd expect to only > see that once since it seems to properly reuse a single FileContext > instance. > > Adam > > On Fri, Aug 20, 2021 at 2:22 PM Steve Loughran > wrote: > >> >> ooh, this is fun, >> >> v2 isn't safe to use

Re: Observer Namenode and Committer Algorithm V1

2021-08-20 Thread Steve Loughran
ooh, this is fun, v2 isn't safe to use unless every task attempt generates files with exactly the same names and it is okay to intermingle the output of two task attempts. This is because task commit can felt partway through (or worse, that process pause for a full GC), and a second attempt

MAPREDUCE-7341. Intermediate Manifest Committer for Azure + GCS

2021-07-07 Thread Steve Loughran
My little committer project, an intermediate manifest committer for Azure and GCS is reaching the stage where it's ready for others to look at https://github.com/apache/hadoop/pull/2971 Goals 1. Correctness even on GCS, which doesn't have atomic dir rename (so v1 isn't safe). It does use

Re: Spark ACID compatibility

2021-06-22 Thread Steve Loughran
On Mon, 14 Jun 2021 at 19:07, Mich Talebzadeh wrote: > > > Now I am trying to read it in Hive > > 0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta; > ++--+--+ > |col_name| data_type | comment | >

Re: Missing module spark-hadoop-cloud in Maven central

2021-06-01 Thread Steve Loughran
(can't reply to user@, so pulling @dev instead. sorry) (can't reply to user@, so pulling @dev instead) There is no fundamental reason why the hadoop-cloud POM and artifact isn't built/released by the ASF spark project; I think the effort it took to get the spark-hadoop-cloud module it in at all

Re: [DISCUSS] Add error IDs

2021-04-15 Thread Steve Loughran
Machine readable logs are always good, especially if you can read the entire logs into an SQL query. It might be good to use some specific differentiation between hint/warn/fatal error in the numbering so that any automated analysis of the logs can identify the class of an error even if its an

Re: UserGroupInformation.doAS is working well in Spark Executors?

2021-04-15 Thread Steve Loughran
If are using kerberized HDFS the spark principal (or whoever is running the cluster) has to be declared as a proxy user. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html Once done, you call the val ugi = UserGroupInformation.createProxyUser("joe",

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-04-06 Thread Steve Loughran
On Thu, 11 Mar 2021 at 19:58, Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > I agree with you to extend the documentation around this. Moreover I > support to have specific unit tests for this. > > > There is clearly some demand for Spark to automatically clean up > checkpoints on

Re: AWS Consistent S3 & Apache Hadoop's S3A connector

2020-12-10 Thread Steve Loughran
On Mon, 7 Dec 2020 at 07:36, Chang Chen wrote: > Since S3A now works perfectly with S3Guard turned off, Could Magic > Committor work with S3Guard is off? If Yes, will performance degenerate? Or > if HADOOP-17400 is fixed, then it will have comparable performance? > Yes, works really well. * It

AWS Consistent S3 & Apache Hadoop's S3A connector

2020-12-04 Thread Steve Loughran
as sent to hadoop-general. TL;DR. S3 is consistent; S3A now works perfectly with S3Guard turned off, if not, file a JIRA. rename still isn't real, so don't rely on that or create(path, overwrite=false) for atomic operations --- If you've missed the announcement, AWS S3 storage is now

Hive isolation and context classloaders

2020-11-10 Thread Steve Loughran
I'm staring at https://issues.apache.org/jira/browse/HADOOP-17372 and a stack trace which claims that a com.amazonaws class doesn't implement an interface which it very much does 2020-11-10 05:27:33,517 [ScalaTest-main-running-S3DataFrameExampleSuite] WARN fs.FileSystem

Re: -Phadoop-provided still includes hadoop jars

2020-11-09 Thread Steve Loughran
On Mon, 12 Oct 2020 at 19:06, Sean Owen wrote: > I don't have a good answer, Steve may know more, but from looking at > dependency:tree, it looks mostly like it's hadoop-common that's at issue. > Without -Phive it remains 'provided' in the assembly/ module, but -Phive > causes it to come back

Re: A common naming policy for third-party packages/modules under org.apache.spark?

2020-09-23 Thread Steve Loughran
s. Could you give us some > examples specifically? > > > Can I suggest some common prefix for third-party-classes put into the > spark package tree, just to make clear that they are external contributions? > > Bests, > Dongjoon. > > > On Mon, Sep 21, 2020 at 6:29 AM St

A common naming policy for third-party packages/modules under org.apache.spark?

2020-09-21 Thread Steve Loughran
I've just been stack-trace-chasing the 404-in-task-commit code: https://issues.apache.org/jira/browse/HADOOP-17216 And although it's got an org.apache.spark. prefix, it's actually org.apache.spark.sql.delta, which lives in github, so the code/issue tracker lives elsewhere. I understand why

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-23 Thread Steve Loughran
On Wed, 22 Jul 2020 at 18:50, Holden Karau wrote: > Wonderful. To be clear the patch is more to start the discussion about how > we want to do it and less what I think is the right way. > > be happy to give a quick online tour of ongoing work on S3A enhancements some time next week, get feedback

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Steve Loughran
On Wed, 22 Jul 2020 at 00:51, Holden Karau wrote: > Hi Folks, > > In Spark SQL there is the ability to have Spark do it's partition > discovery/file listing in parallel on the worker nodes and also avoid > locality lookups. I'd like to expose this in core, but given the Hadoop > APIs it's a bit

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-07-21 Thread Steve Loughran
On Sun, 12 Jul 2020 at 01:45, gpongracz wrote: > As someone who mainly operates in AWS it would be very welcome to have the > option to use an updated version of hadoop using pyspark sourced from pypi. > > Acknowledging the issues of backwards compatability... > > The most vexing issue is the

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Steve Loughran
al.io.cloud.PathOutputCommitProtocol"); > hadoopConfiguration.set("spark.sql.parquet.output.committer.class", > "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"); > hadoopConfiguration.set("fs.s3a.connection.maximum", > Integer.toString(coreCount

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-07-01 Thread Steve Loughran
https://issues.apache.org/jira/browse/MAPREDUCE-7282 "MR v2 commit algorithm is dangerous, should be deprecated and not the default" someone do a PR to change the default & if it doesn't break too much I'l merge it On Mon, 29 Jun 2020 at 13:20, Steve Loughran wrote: > v2 do

Re: preferredlocations for hadoopfsrelations based baseRelations

2020-06-29 Thread Steve Loughran
Here's a class which lets you proved a function on a row by row basis to declare location https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala needs to be in o.a.spark as something

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-29 Thread Steve Loughran
you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is doomed. using a different aws sdk jar is a bit risky, though more recent upgrades have all be fairly low stress On Fri, 19 Jun 2020 at 05:39, murat

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-29 Thread Steve Loughran
rowse/SPARK-20107?focusedCommentId=15945177=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945177 > > I also did discuss this a bit with Steve Loughran and his opinion was that > v2 should just be deprecated all together. I believe he was going to bring > that up

Re: RDD order guarantees

2020-05-06 Thread Steve Loughran
On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch wrote: > Hi, > > Sorry to dig out this thread but this bug is still present. > > The fix proposed in this thread (creating a new FileSystem implementation > which sorts listed files) was rejected, with the suggestion that it is the >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-22 Thread Steve Loughran
On Tue, Nov 19, 2019 at 10:40 PM Cheng Lian wrote: > Hey Steve, > > In terms of Maven artifact, I don't think the default Hadoop version > matters except for the spark-hadoop-cloud module, which is only meaningful > under the hadoop-3.2 profile. All the other spark-* artifacts published to >

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-22 Thread Steve Loughran
On Thu, Nov 21, 2019 at 12:53 AM Dongjoon Hyun wrote: > Thank you for much thoughtful clarification. I agree with your all options. > > Especially, for Hive Metastore connection, `Hive isolated client loader` > is also important with Hive 2.3 because Hive 2.3 client cannot talk with > Hive 2.1

Re: Ask for ARM CI for spark

2019-11-17 Thread Steve Loughran
The ASF PR team would like something like that "Spark now supports ARM" in press releases. And don't forget: they do you like to be involved in the launch of the final release. On Fri, Nov 15, 2019 at 9:46 AM bo zhaobo wrote: > Hi @Sean Owen , > > Thanks for your idea. > > We may use the bad

Re: Adding JIRA ID as the prefix for the test case name

2019-11-17 Thread Steve Loughran
need some more > efforts to investigate as well. > > On Fri, 15 Nov 2019, 20:56 Steve Loughran, > wrote: > >> Junit5: Display names. >> >> Goes all the way to the XML. >> >> >> https://junit.org/junit5/docs/current/user-guide/#writ

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-17 Thread Steve Loughran
Can I take this moment to remind everyone that the version of hive which spark has historically bundled (the org.spark-project one) is an orphan project put together to deal with Hive's shading issues and a source of unhappiness in the Hive project. What ever get shipped should do its best to

Re: Adding JIRA ID as the prefix for the test case name

2019-11-15 Thread Steve Loughran
Junit5: Display names. Goes all the way to the XML. https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu wrote: > Should we also add a guideline for non Scala tests? Other languages (Java, > Python, R) don't support

Re: [DISCUSS] writing structured streaming dataframe to custom S3 buckets?

2019-11-08 Thread Steve Loughran
> spark.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") This is some superstition which seems to get carried through stack overflow articles. You do not need to declare the implementation class for s3a:// any more than you have to do for

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-04 Thread Steve Loughran
On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas wrote: > On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran > wrote: > >> It would be really good if the spark distributions shipped with later >> versions of the hadoop artifacts. >> > > I second this. If we need to

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-04 Thread Steve Loughran
I'd move spark's branch-2 line to 2.9.x as (a) spark's version of httpclient hits a bug in the AWS SDK used in hadoop-2.8 unless you revert that patch https://issues.apache.org/jira/browse/SPARK-22919 (b) there's only one future version of 2.8x planned, which is expected once myself or someone

Re: Spark 3.0 and S3A

2019-11-01 Thread Steve Loughran
On Mon, Oct 28, 2019 at 3:40 PM Sean Owen wrote: > There will be a "Hadoop 3.x" version of 3.0, as it's essential to get > a JDK 11-compatible build. you can see the hadoop-3.2 profile. > hadoop-aws is pulled in in the hadoop-cloud module I believe, so bears > checking whether the profile

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-01 Thread Steve Loughran
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises. One issue about using a older versions is that any

Re: Minimum JDK8 version

2019-10-25 Thread Steve Loughran
On Fri, Oct 25, 2019 at 2:56 AM Dongjoon Hyun wrote: > > All versions of JDK8 are not the same naturally. For example, Hadoop > community also have the following document although they are not specifying > the minimum versions. > > - >

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-07 Thread Steve Loughran
appens during FileInputFormat scans, so is how I'm going to tune IOPs there. It might also be good to have those bits of the hadoop MR classes which spark uses to log internally @ debug, so everything gets this logging if they ask for it. Happy to take contribs there as Hadoop JIRAs &

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Steve Loughran
On Fri, Sep 6, 2019 at 2:50 PM Sean Owen wrote: > I think the problem is calling globStatus to expand all 300K files. > This is a general problem for object stores and huge numbers of files. > Steve L. may have better thoughts on real solutions. But you might > consider, if possible, running a

Re: Why two netty libs?

2019-09-04 Thread Steve Loughran
Zookeeper client is/was netty 3, AFAIK, so if you want to use it for anything, it ends up on the CP On Tue, Sep 3, 2019 at 5:18 PM Shixiong(Ryan) Zhu wrote: > Yep, historical reasons. And Netty 4 is under another namespace, so we can > use Netty 3 and Netty 4 in the same JVM. > > On Tue, Sep 3,

Re: IPv6 support

2019-07-17 Thread Steve Loughran
Fairly neglected hadoop patch, FWIW; https://issues.apache.org/jira/browse/HADOOP-11890 FB have been running HDFS on IPv6 for a while, but their codebase has diverged; getting the stuff into trunk is going to take effort. At least the JDK has moved on and should be better On Wed, Jul 17, 2019

Re: Option for silent failure while reading a list of files.

2019-07-01 Thread Steve Loughran
Where is this list of files coming from? If you made the list, then yes, the expectation is generally "supply a list of files which are present" on the basis that general convention is "missing files are considered bad" Though you could try setting spark.sql.files.ignoreCorruptFiles=true to see

Re: Ask for ARM CI for spark

2019-06-28 Thread Steve Loughran
in upstream community, it bring confidence to end user and > customers when they plan to deploy these projects on ARM. > > This is absolute long term work, let's to make it step by step, CI, > testing, issue and resolving. > > Steve Loughran 于2019年6月27日周四 下午9:22写道: > >>

Re: Ask for ARM CI for spark

2019-06-27 Thread Steve Loughran
level db and native codecs are invariably a problem here, as is anything else doing misaligned IO. Protobuf has also had "issues" in the past see https://issues.apache.org/jira/browse/HADOOP-16100 I think any AA64 work is going to have to define very clearly what "works" is defined as; spark

Re: Detect executor core count

2019-06-18 Thread Steve Loughran
be aware that older java 8 versions count the #of cores in the host, not those allocated for the container they run in https://bugs.openjdk.java.net/browse/JDK-8140793 On Tue, Jun 18, 2019 at 8:13 PM Ilya Matiach wrote: > Hi Andrew, > > I tried to do something similar to that in the LightGBM >

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-22 Thread Steve Loughran
hadoop is still on 1.7.7 branch. A move to 1.9 would probably be as painful as a move to 1.8.x, so submit a patch for hadoop trunk. Last PR there wasn't quite ready and I didn't get any follow up to the "what is this going to break" question https://issues.apache.org/jira/browse/HADOOP-13386

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-03 Thread Steve Loughran
On Tue, Apr 2, 2019 at 9:39 PM Ankur Gupta wrote: > Hi Steve, > > Thanks for your feedback. From your email, I could gather the following > two important points: > >1. Report failures to something (cluster manager) which can opt to >destroy the node and request a new one >2.

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-02 Thread Steve Loughran
On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin wrote: > We tried enabling blacklisting for some customers and in the cloud, very > quickly they end up having 0 executors due to various transient errors. So > unfortunately I think the current implementation is terrible for cloud > deployments, and

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-19 Thread Steve Loughran
you might want to look at the work on FPGA resources; again it should just be a resource available by a scheduler. Key thing is probably just to keep the docs generic https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingFPGA.html I don't know where you get those FPGAs to play

Re: proposal for expanded & consistent timestamp types

2019-01-02 Thread Steve Loughran
e to care about the persistence format *or which app created the data* What does Arrow do in this world, incidentally? On 2 Jan 2019, at 11:48, Steve Loughran mailto:ste...@hortonworks.com>> wrote: On 17 Dec 2018, at 17:44, Zoltan Ivanfi mailto:z...@cloudera.com.INVALID>> wrote:

Re: proposal for expanded & consistent timestamp types

2019-01-02 Thread Steve Loughran
On 17 Dec 2018, at 17:44, Zoltan Ivanfi mailto:z...@cloudera.com.INVALID>> wrote: Hi, On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan mailto:cloud0...@gmail.com>> wrote: Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it.

Re: Hadoop 3 support

2018-10-23 Thread Steve Loughran
> On 16 Oct 2018, at 22:06, t4 wrote: > > has anyone got spark jars working with hadoop3.1 that they can share? i am > looking to be able to use the latest hadoop-aws fixes from v3.1 we do, but we do with * a patched hive JAR * bulding spark with

Re: Random sampling in tests

2018-10-09 Thread Steve Loughran
Randomized testing can, in theory, help you explore a far larger area of the environment of an app than you could explicitly explore, such as "does everything work in the turkish locale where "I".toLower()!="i", etc. Good: faster tests, especially on an essentially-non-finite set of options

Re: [Discuss] Datasource v2 support for Kerberos

2018-10-02 Thread Steve Loughran
On 2 Oct 2018, at 04:44, tigerquoll mailto:tigerqu...@outlook.com>> wrote: Hi Steve, I think that passing a kerberos keytab around is one of those bad ideas that is entirely appropriate to re-question every single time you come across it. It has been used already in spark when interacting with

Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?

2018-10-01 Thread Steve Loughran
On 30 Sep 2018, at 19:37, Jacek Laskowski mailto:ja...@japila.pl>> wrote: scala> spark.range(1).write.saveAsTable("demo") 2018-09-30 17:44:27 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException 2018-09-30 17:44:28 ERROR FileOutputCommitter:314 - Mkdirs

Re: [Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?

2018-10-01 Thread Steve Loughran
On 11 Aug 2018, at 17:33, chandan prakash mailto:chandanbaran...@gmail.com>> wrote: Hi All, I was going through this pull request about new CheckpointFileManager abstraction in structured streaming coming in 2.4 : https://issues.apache.org/jira/browse/SPARK-23966

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-27 Thread Steve Loughran
> On 25 Sep 2018, at 07:52, tigerquoll wrote: > > To give some Kerberos specific examples, The spark-submit args: > -–conf spark.yarn.keytab=path_to_keytab -–conf > spark.yarn.principal=princi...@realm.com > > are currently not passed through to the data sources. > > > I'm not sure why

Re: sql compile failing with Zinc?

2018-08-14 Thread Steve Loughran
t more memory with -J-Xmx2g or whatever. If you're running ./build/mvn and letting it run zinc we might need to increase the memory that it requests in the script. On Tue, Aug 14, 2018 at 2:56 PM Steve Loughran mailto:ste...@hortonworks.com>> wrote: Is anyone else getting the sql module

sql compile failing with Zinc?

2018-08-14 Thread Steve Loughran
Is anyone else getting the sql module maven build on master branch failing when you use zinc for incremental builds? [warn] ^ java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.tools.nsc.backend.icode.GenICode$Scope.(GenICode.scala:2225) at

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-07 Thread Steve Loughran
CVS with schema inference is a full read of the data, so that could be one of the problems. Do it at most once, print out the schema and use it from then on during ingress & use something else for persistence On 6 Aug 2018, at 05:44, makatun mailto:d.i.maka...@gmail.com>> wrote: a.

Opentrace in ASF projects

2018-07-06 Thread Steve Loughran
FYI, there's some initial exploring of what it would take to move the HDFS wire protocol to move from HTrace for OpenTrace for tracing, and wire up the other stores too https://issues.apache.org/jira/browse/HADOOP-15566 If anyone has any input/insight or code review capacity, it'd be

hadoop-aws versions (was Re: [VOTE] Spark 2.3.1 (RC4))

2018-06-26 Thread Steve Loughran
following up after a ref to this in https://issues.apache.org/jira/browse/HADOOP-15559 the AWS SDK is a very fast moving project, with a release cycle of ~2 weeks, but it's in the state Fred Brooks described, "the number of bugs is constant, they just move around"; bumpin gup an AWS release

Re: Running lint-java during PR builds?

2018-05-28 Thread Steve Loughran
> On 21 May 2018, at 17:20, Marcelo Vanzin wrote: > > Is there a way to trigger it conditionally? e.g. only if the diff > touches java files. > what about adding it as another command which could be added alongside "jenkins test this please", something like "lint this

Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-26 Thread Steve Loughran
sorry, not noticed this followup. Been busy with other issues On 3 Apr 2018, at 11:19, cane > wrote: Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause data loss. I check the comment of thi api: We should

Re: time for Apache Spark 3.0?

2018-04-05 Thread Steve Loughran
On 5 Apr 2018, at 18:04, Matei Zaharia > wrote: Java 9/10 support would be great to add as well. Be aware that the work moving hadoop core to java 9+ is still a big piece of work being undertaken by Akira Ajisaka & colleagues at NTT

Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-03 Thread Steve Loughran
> On 3 Apr 2018, at 11:19, cane wrote: > > Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause > data loss. > I check the comment of thi api: > > We should make sure our tasks are idempotent when speculation is enabled, > i.e. do > * not

Re: Hadoop 3 support

2018-04-03 Thread Steve Loughran
On 3 Apr 2018, at 01:30, Saisai Shao > wrote: Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some

Re: A new external catalog

2018-02-16 Thread Steve Loughran
On 14 Feb 2018, at 19:56, Tayyebi, Ameen > wrote: Newbie question: I want to add system/integration tests for the new functionality. There are a set of existing tests around Spark Catalog that I can leverage. Great. The provider I’m writing is

Re: A new external catalog

2018-02-16 Thread Steve Loughran
On 14 Feb 2018, at 13:51, Tayyebi, Ameen > wrote: Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a quick look and am sufficiently scared for now. I had run into that warning from the S3 stream before. Sigh. things

Re: Regarding NimbusDS JOSE JWT jar 3.9 security vulnerability

2018-02-14 Thread Steve Loughran
might be coming in transitively https://issues.apache.org/jira/browse/HADOOP-14799 On 13 Feb 2018, at 18:18, PJ Fanning > wrote: Hi Sujith, I didn't find the nimbusds dependency in any spark 2.2 jars. Maybe I missed something. Could you tell us

Re: A new external catalog

2018-02-13 Thread Steve Loughran
1.11.199 because it didn't have any issues that we hadn't already got under control (https://github.com/aws/aws-sdk-java/issues/1211) Like I said: upgrades bring fear From: Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> Date: Tuesday, February 13, 2018

Re: A new external catalog

2018-02-13 Thread Steve Loughran
On 13 Feb 2018, at 19:50, Tayyebi, Ameen > wrote: The biggest challenge is that I had to upgrade the AWS SDK to a newer version so that it includes the Glue client since Glue is a new service. So far, I haven’t see any jar hell issues, but

Re: Corrupt parquet file

2018-02-13 Thread Steve Loughran
On 12 Feb 2018, at 20:21, Ryan Blue > wrote: I wouldn't say we have a primary failure mode that we deal with. What we concluded was that all the schemes we came up with to avoid corruption couldn't cover all cases. For example, what about when

Re: Drop the Hadoop 2.6 profile?

2018-02-12 Thread Steve Loughran
I'd advocate 2.7 over 2.6, primarily due to Kerberos and JVM versions 2.6 is not even qualified for Java 7, let alone Java 8: you've got no guarantees that things work on the min Java version Spark requires. Kerberos is always the failure point here, as well as various libraries (jetty) which

Re: Corrupt parquet file

2018-02-12 Thread Steve Loughran
On 12 Feb 2018, at 19:35, Dong Jiang > wrote: I got no error messages from EMR. We write directly from dataframe to S3. There doesn’t appear to be an issue with S3 file, we can still down the parquet file and read most of the columns, just one

  1   2   3   >