Re: How to run spark connect in kubernetes?

2024-10-07 Thread Steve Loughran
https://isitdns.com/ On Wed, 2 Oct 2024 at 22:45, kant kodali wrote: > please ignore this. it was a dns issue > > On Wed, Oct 2, 2024 at 11:16 AM kant kodali wrote: > >> Here >>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-29 Thread Steve Loughran
I'm going to join in from an ASF community perspective. Nobody should be making fundamental changes to an ASF code base with a PR up and then merged two hours later because of the needs of a single vendor of a downstream product. This doesn't even give people in different time zones the chance to

Re: Spark decommission

2024-07-29 Thread Steve Loughran
On Fri, 5 Jul 2024 at 01:44, Arun Ravi wrote: > Hi Rajesh, > > We use it production at scale. We run spark on kubernetes on aws cloud and > here are the key things that we do > 1) we run driver on on-demand node > 2) we have configured decommission along with fallback option on to S3, > try the l

Re: [DISCUSS] Spark 4.0.0 release

2024-05-02 Thread Steve Loughran
There's a new parquet RC up this week which would be good to pull in. On Thu, 2 May 2024 at 03:20, Jungtaek Lim wrote: > +1 love to see it! > > On Thu, May 2, 2024 at 10:08 AM Holden Karau > wrote: > >> +1 :) yay previews >> >> On Wed, May 1, 2024 at 5:36 PM Chao Sun wrote: >> >>> +1 >>> >>> O

Re: Which version of spark version supports parquet version 2 ?

2024-04-19 Thread Steve Loughran
Those are some quite good improvements -but committing to storing all your data in an unstable format, is, well, "bold". For temporary data as part of a workflow though, it could be appealing Now, assuming you are going to be working with s3, you might want to start with merging PARQUET-2117 into

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Steve Loughran
ASF will be unhappy about this. and stack overflow exists. otherwise: apache Confluent and linkedIn exist; LI is the option I'd point at On Mon, 18 Mar 2024 at 10:59, Mich Talebzadeh wrote: > Some of you may be aware that Databricks community Home | Databricks > have just launched a knowledge sh

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Steve Loughran
I consider the context info as more important than just logging; at hadoop level we do it to attach things like task/jobIds, kerberos principals etc to all store requests. https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/auditing.html So worrying about how pass and manage that at

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-23 Thread Steve Loughran
Now, if you are ruthless it'd make sense to randomise the order of results if someone left out the order by, to stop complacency. like that time sun changed the ordering that methods were returned in a Class.listMethods() call and everyone's junit test cases failed if they'd assumed that ordering

Re: What else could be removed in Spark 4?

2023-08-24 Thread Steve Loughran
I would recommend cutting them. + historically they've fixed the version of aws-sdk jar used in spark releases, meaning s3a connector through spark rarely used the same sdk release as that qualified through the hadoop sdk update process, so if there were incompatibilities, it'd be up to the spark

Re: Spark writing API

2023-08-07 Thread Steve Loughran
On Thu, 1 Jun 2023 at 00:58, Andrew Melo wrote: > Hi all > > I've been developing for some time a Spark DSv2 plugin "Laurelin" ( > https://github.com/spark-root/laurelin > ) to read the ROOT (https://root.cern) file format (which is used in high > energy physics). I've recently presented my work

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-21 Thread Steve Loughran
I'd say everyone should *and* http UA in all the clients who make calls of object stores should, as it helps field issues there. s3a and abfs clients do provide the ability to add params there -please set them in your deployments On Fri, 16 Jun 2023 at 21:53, Dongjoon Hyun wrote: > Please vote o

Re: Remove protobuf 2.5.0 from Spark dependencies

2023-06-01 Thread Steve Loughran
gs up. On Sat, 27 May 2023 at 14:39, 张铎(Duo Zhang) wrote: > For hbase, the problem is we rely on protobuf 2.5 for our coprocessors. > > See HBASE-27436. > > Cheng Pan 于2023年5月24日周三 10:00写道: > >> +CC dev@hbase >> >> Thanks, >> Cheng Pan >> >&g

Re: Remove protobuf 2.5.0 from Spark dependencies

2023-05-18 Thread Steve Loughran
client-runtime. > +1. the shaded one which is in use also needs upgrading. > Thanks, > Cheng Pan > > > On May 17, 2023 at 04:10:43, Dongjoon Hyun wrote: > >> Thank you for sharing, Steve. >> >> Dongjoon >> >> On Tue, May 16, 2023 at 11:4

Re: Remove protobuf 2.5.0 from Spark dependencies

2023-05-16 Thread Steve Loughran
I have some bad news here which is even though hadoop cut protobuf 2.5 support, hbase team put it back in (HADOOP-17046). I don't know if the shaded hadoop client has removed that dependency on protobuf 2.5. In HADOOP-18487 i want to allow hadoop to cut that dependency, with hbase having to add it

Re: hadoop-2 profile to be removed in 3.5.0

2023-04-18 Thread Steve Loughran
This is truly wonderful. 1. I have an internal patch related to committer stuff I could submit now 2. if someone wants to look at it where FileSystem.open() is used *and you have the file length, file path, or simply know whether you plan to do random or sequential IO*, switch to openFile(). on s3

Re: maven build failing in spark sql w/BouncyCastleProvider CNFE

2022-12-08 Thread Steve Loughran
t; > test > > > > ``` > > > > Yang Jie > > > > *发件人**: *"Yang,Jie(INF)" > *日期**: *2022年12月6日 星期二 18:27 > *收件人**: *Steve Loughran > *抄送**: *Hyukjin Kwon , Apache Spark Dev < > dev@spark.apache.org> > *主题**: *Re: ma

Re: maven build failing in spark sql w/BouncyCastleProvider CNFE

2022-12-06 Thread Steve Loughran
w/BouncyCastleProvider CNFE > > > > Steve, does the lower version of scala plugin work for you? If that > solves, we could temporary downgrade for now. > > > > On Mon, 5 Dec 2022 at 22:23, Steve Loughran > wrote: > > trying to build spark master w/ hadoop trunk

maven build failing in spark sql w/BouncyCastleProvider CNFE

2022-12-05 Thread Steve Loughran
trying to build spark master w/ hadoop trunk and the maven sbt plugin is failing. This doesn't happen with the 3.3.5 RC0; I note that the only mention of this anywhere was me in march. clearly something in hadoop trunk has changed in a way which is incompatible. Has anyone else tried such a bui

Re: CVE-2022-42889

2022-10-27 Thread Steve Loughran
the api doesn't get used in the hadoop libraries; not sure about other dependencies. probably makes sense to say on the jira that there's no need to panic here; I've had to start doing that as some of the security scanners appear to overreact https://issues.apache.org/jira/browse/HDFS-16766 On T

Re: Missing data in spark output

2022-10-25 Thread Steve Loughran
v1 on gcs isn't safe either as promotion from task attempt to successful task is a dir rename; fast and atomic on hdfs, O(files) and nonatomic on GCS. if i can get that hadoop 3.3.5 rc out soon, the manifest committer will be there to test https://issues.apache.org/jira/browse/MAPREDUCE-7341 unt

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-06 Thread Steve Loughran
On Wed, 5 Oct 2022 at 21:59, Chao Sun wrote: > +1 > > > and specifically may allow us to finally move off of the ancient version > of Guava (?) > > I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. > hadoop branch-2 has guava dependencies; not sure which one A key lesson there

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-04 Thread Steve Loughran
that sounds suspiciously like something I'd write :) the move to java8 happened in HADOOP-11858; 3.0.0 HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been open since 2019 and I just closed as WONTFIX. Most of the big production hadoop 2 clusters use java7, because that is wh

Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-14 Thread Steve Loughran
file which works >fine in our environment (with Hadoop3.1) > > > Regards > Pralabh Kumar > > > > On Mon, Jun 13, 2022 at 3:25 PM Steve Loughran > wrote: > >> >> >> On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar >> wrote: >> >>&

Re: Spark32 + Java 11 . Reading parquet java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'

2022-06-13 Thread Steve Loughran
On Mon, 13 Jun 2022 at 08:52, Pralabh Kumar wrote: > Hi Dev team > > I have a spark32 image with Java 11 (Running Spark on K8s) . While > reading a huge parquet file via spark.read.parquet("") . I am getting > the following error . The same error is mentioned in Spark docs > https://spark.apac

Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-07 Thread Steve Loughran
On Fri, 3 Jun 2022 at 18:46, Martin Grund wrote: > Hi Everyone, > > We would like to start a discussion on the "Spark Connect" proposal. > Please find the links below: > > *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375 > *SPIP Document* - > https://docs.google.com/document/d/1Mnl6jmGs

Re: Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355

2022-05-03 Thread Steve Loughran
Prablah, did you follow the URL provided in the exception message? i put a lot of effort in to improving the diagnostics, where the wiki articles are part of the troubleshooing process https://issues.apache.org/jira/browse/HADOOP-7469 it's really disappointing when people escalate the problem to o

Re: Spark client for Hadoop 2.x

2022-04-12 Thread Steve Loughran
I should back up Donjoon's comments with the observation that hadoop 2.10.x is the only branch-2 release which get any security updates; on branch-3 it is 3.2.x and 3.3.x which do. Donjoon's colleague Chao Sun was the release manager on the 3.3.2 release, so it got thoroughly tested with Spark. (I'

Re: Log4j 1.2.17 spark CVE

2021-12-14 Thread Steve Loughran
log4j 1.2.17 is not vulnerable. There is an existing CVE there from a log aggregation servlet; Cloudera products ship a patched release with that servlet stripped...asf projects are not allowed to do that. But: some recent Cloudera Products do include log4j 2.x, so colleagues of mine are busy patc

Re: HiveThrift2 ACID Transactions?

2021-11-23 Thread Steve Loughran
without commenting on any other part of this, note that it was in some hive commit operations where a race condition in rename surfaced https://issues.apache.org/jira/browse/HADOOP-16721 if you get odd errors about parent dirs not existing during renames, that'll be it...Upgrade to Hadoop-3.3.1 bi

Re: Handle FileAlreadyExistsException for .spark-staging files

2021-09-10 Thread Steve Loughran
one of the issues here is that Parquet creates files with overwrite=false; other output formats do not do this, so implicitly overwrite the output of previous attempts. Which is fine if you are confident that each task attempt (henceforth: TA) is writing to an isolated path. the next iteration of

Re: Observer Namenode and Committer Algorithm V1

2021-09-07 Thread Steve Loughran
dataLog looking for > the active namenode which I haven't looked into much. I'd expect to only > see that once since it seems to properly reuse a single FileContext > instance. > > Adam > > On Fri, Aug 20, 2021 at 2:22 PM Steve Loughran > wrote: > >> >&

Re: Observer Namenode and Committer Algorithm V1

2021-08-20 Thread Steve Loughran
ooh, this is fun, v2 isn't safe to use unless every task attempt generates files with exactly the same names and it is okay to intermingle the output of two task attempts. This is because task commit can felt partway through (or worse, that process pause for a full GC), and a second attempt commi

MAPREDUCE-7341. Intermediate Manifest Committer for Azure + GCS

2021-07-07 Thread Steve Loughran
My little committer project, an intermediate manifest committer for Azure and GCS is reaching the stage where it's ready for others to look at https://github.com/apache/hadoop/pull/2971 Goals 1. Correctness even on GCS, which doesn't have atomic dir rename (so v1 isn't safe). It does use F

Re: Spark ACID compatibility

2021-06-22 Thread Steve Loughran
On Mon, 14 Jun 2021 at 19:07, Mich Talebzadeh wrote: > > > Now I am trying to read it in Hive > > 0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta; > ++--+--+ > |col_name| data_type | comment | > ++--+-

Re: Missing module spark-hadoop-cloud in Maven central

2021-06-01 Thread Steve Loughran
(can't reply to user@, so pulling @dev instead. sorry) (can't reply to user@, so pulling @dev instead) There is no fundamental reason why the hadoop-cloud POM and artifact isn't built/released by the ASF spark project; I think the effort it took to get the spark-hadoop-cloud module it in at all w

Re: [DISCUSS] Add error IDs

2021-04-15 Thread Steve Loughran
Machine readable logs are always good, especially if you can read the entire logs into an SQL query. It might be good to use some specific differentiation between hint/warn/fatal error in the numbering so that any automated analysis of the logs can identify the class of an error even if its an err

Re: UserGroupInformation.doAS is working well in Spark Executors?

2021-04-15 Thread Steve Loughran
If are using kerberized HDFS the spark principal (or whoever is running the cluster) has to be declared as a proxy user. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html Once done, you call the val ugi = UserGroupInformation.createProxyUser("joe", UserGr

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-04-06 Thread Steve Loughran
On Thu, 11 Mar 2021 at 19:58, Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > I agree with you to extend the documentation around this. Moreover I > support to have specific unit tests for this. > > > There is clearly some demand for Spark to automatically clean up > checkpoints on shu

Re: AWS Consistent S3 & Apache Hadoop's S3A connector

2020-12-10 Thread Steve Loughran
On Mon, 7 Dec 2020 at 07:36, Chang Chen wrote: > Since S3A now works perfectly with S3Guard turned off, Could Magic > Committor work with S3Guard is off? If Yes, will performance degenerate? Or > if HADOOP-17400 is fixed, then it will have comparable performance? > Yes, works really well. * It

AWS Consistent S3 & Apache Hadoop's S3A connector

2020-12-04 Thread Steve Loughran
as sent to hadoop-general. TL;DR. S3 is consistent; S3A now works perfectly with S3Guard turned off, if not, file a JIRA. rename still isn't real, so don't rely on that or create(path, overwrite=false) for atomic operations --- If you've missed the announcement, AWS S3 storage is now strong

Hive isolation and context classloaders

2020-11-10 Thread Steve Loughran
I'm staring at https://issues.apache.org/jira/browse/HADOOP-17372 and a stack trace which claims that a com.amazonaws class doesn't implement an interface which it very much does 2020-11-10 05:27:33,517 [ScalaTest-main-running-S3DataFrameExampleSuite] WARN fs.FileSystem (FileSystem.java:createFil

Re: -Phadoop-provided still includes hadoop jars

2020-11-09 Thread Steve Loughran
On Mon, 12 Oct 2020 at 19:06, Sean Owen wrote: > I don't have a good answer, Steve may know more, but from looking at > dependency:tree, it looks mostly like it's hadoop-common that's at issue. > Without -Phive it remains 'provided' in the assembly/ module, but -Phive > causes it to come back in.

Re: A common naming policy for third-party packages/modules under org.apache.spark?

2020-09-23 Thread Steve Loughran
ility rules. Could you give us some > examples specifically? > > > Can I suggest some common prefix for third-party-classes put into the > spark package tree, just to make clear that they are external contributions? > > Bests, > Dongjoon. > > > On Mon, Sep 21, 2020 at

A common naming policy for third-party packages/modules under org.apache.spark?

2020-09-21 Thread Steve Loughran
I've just been stack-trace-chasing the 404-in-task-commit code: https://issues.apache.org/jira/browse/HADOOP-17216 And although it's got an org.apache.spark. prefix, it's actually org.apache.spark.sql.delta, which lives in github, so the code/issue tracker lives elsewhere. I understand why they'

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-23 Thread Steve Loughran
On Wed, 22 Jul 2020 at 18:50, Holden Karau wrote: > Wonderful. To be clear the patch is more to start the discussion about how > we want to do it and less what I think is the right way. > > be happy to give a quick online tour of ongoing work on S3A enhancements some time next week, get feedback

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Steve Loughran
On Wed, 22 Jul 2020 at 00:51, Holden Karau wrote: > Hi Folks, > > In Spark SQL there is the ability to have Spark do it's partition > discovery/file listing in parallel on the worker nodes and also avoid > locality lookups. I'd like to expose this in core, but given the Hadoop > APIs it's a bit m

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-07-21 Thread Steve Loughran
On Sun, 12 Jul 2020 at 01:45, gpongracz wrote: > As someone who mainly operates in AWS it would be very welcome to have the > option to use an updated version of hadoop using pyspark sourced from pypi. > > Acknowledging the issues of backwards compatability... > > The most vexing issue is the lac

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Steve Loughran
al.io.cloud.PathOutputCommitProtocol"); > hadoopConfiguration.set("spark.sql.parquet.output.committer.class", > "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"); > hadoopConfiguration.set("fs.s3a.connection.maximum", > Integer.toString(coreCount

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-07-01 Thread Steve Loughran
https://issues.apache.org/jira/browse/MAPREDUCE-7282 "MR v2 commit algorithm is dangerous, should be deprecated and not the default" someone do a PR to change the default & if it doesn't break too much I'l merge it On Mon, 29 Jun 2020 at 13:20, Steve Loughran wrote:

Re: preferredlocations for hadoopfsrelations based baseRelations

2020-06-29 Thread Steve Loughran
Here's a class which lets you proved a function on a row by row basis to declare location https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala needs to be in o.a.spark as something you

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-29 Thread Steve Loughran
you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is doomed. using a different aws sdk jar is a bit risky, though more recent upgrades have all be fairly low stress On Fri, 19 Jun 2020 at 05:39, murat mig

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-29 Thread Steve Loughran
t; https://issues.apache.org/jira/browse/SPARK-20107?focusedCommentId=15945177&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945177 > > I also did discuss this a bit with Steve Loughran and his opinion was that > v2 should just be deprecated all together.

Re: RDD order guarantees

2020-05-06 Thread Steve Loughran
On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch wrote: > Hi, > > Sorry to dig out this thread but this bug is still present. > > The fix proposed in this thread (creating a new FileSystem implementation > which sorts listed files) was rejected, with the suggestion that it is the > FileInputFormat's

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-22 Thread Steve Loughran
On Tue, Nov 19, 2019 at 10:40 PM Cheng Lian wrote: > Hey Steve, > > In terms of Maven artifact, I don't think the default Hadoop version > matters except for the spark-hadoop-cloud module, which is only meaningful > under the hadoop-3.2 profile. All the other spark-* artifacts published to > Mav

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-22 Thread Steve Loughran
On Thu, Nov 21, 2019 at 12:53 AM Dongjoon Hyun wrote: > Thank you for much thoughtful clarification. I agree with your all options. > > Especially, for Hive Metastore connection, `Hive isolated client loader` > is also important with Hive 2.3 because Hive 2.3 client cannot talk with > Hive 2.1 an

Re: Ask for ARM CI for spark

2019-11-17 Thread Steve Loughran
The ASF PR team would like something like that "Spark now supports ARM" in press releases. And don't forget: they do you like to be involved in the launch of the final release. On Fri, Nov 15, 2019 at 9:46 AM bo zhaobo wrote: > Hi @Sean Owen , > > Thanks for your idea. > > We may use the bad wo

Re: Adding JIRA ID as the prefix for the test case name

2019-11-17 Thread Steve Loughran
pull/25630 . It will need some more > efforts to investigate as well. > > On Fri, 15 Nov 2019, 20:56 Steve Loughran, > wrote: > >> Junit5: Display names. >> >> Goes all the way to the XML. >> >> >> https://junit.org/junit5/docs/curr

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-17 Thread Steve Loughran
Can I take this moment to remind everyone that the version of hive which spark has historically bundled (the org.spark-project one) is an orphan project put together to deal with Hive's shading issues and a source of unhappiness in the Hive project. What ever get shipped should do its best to avoid

Re: Adding JIRA ID as the prefix for the test case name

2019-11-15 Thread Steve Loughran
Junit5: Display names. Goes all the way to the XML. https://junit.org/junit5/docs/current/user-guide/#writing-tests-display-names On Thu, Nov 14, 2019 at 6:13 PM Shixiong(Ryan) Zhu wrote: > Should we also add a guideline for non Scala tests? Other languages (Java, > Python, R) don't support u

Re: [DISCUSS] writing structured streaming dataframe to custom S3 buckets?

2019-11-08 Thread Steve Loughran
> spark.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") This is some superstition which seems to get carried through stack overflow articles. You do not need to declare the implementation class for s3a:// any more than you have to do for H

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-04 Thread Steve Loughran
On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas wrote: > On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran > wrote: > >> It would be really good if the spark distributions shipped with later >> versions of the hadoop artifacts. >> > > I second this. If we need to

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-04 Thread Steve Loughran
I'd move spark's branch-2 line to 2.9.x as (a) spark's version of httpclient hits a bug in the AWS SDK used in hadoop-2.8 unless you revert that patch https://issues.apache.org/jira/browse/SPARK-22919 (b) there's only one future version of 2.8x planned, which is expected once myself or someone els

Re: Spark 3.0 and S3A

2019-11-01 Thread Steve Loughran
On Mon, Oct 28, 2019 at 3:40 PM Sean Owen wrote: > There will be a "Hadoop 3.x" version of 3.0, as it's essential to get > a JDK 11-compatible build. you can see the hadoop-3.2 profile. > hadoop-aws is pulled in in the hadoop-cloud module I believe, so bears > checking whether the profile updates

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-01 Thread Steve Loughran
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises. One issue about using a older versions is that any p

Re: Minimum JDK8 version

2019-10-25 Thread Steve Loughran
On Fri, Oct 25, 2019 at 2:56 AM Dongjoon Hyun wrote: > > All versions of JDK8 are not the same naturally. For example, Hadoop > community also have the following document although they are not specifying > the minimum versions. > > - > https://cwiki.apache.org/confluence/display/HADOOP/Hadoop

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-07 Thread Steve Loughran
command replicates what happens during FileInputFormat scans, so is how I'm going to tune IOPs there. It might also be good to have those bits of the hadoop MR classes which spark uses to log internally @ debug, so everything gets this logging if they ask for it. Happy to take contri

Re: DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading S3 files

2019-09-06 Thread Steve Loughran
On Fri, Sep 6, 2019 at 2:50 PM Sean Owen wrote: > I think the problem is calling globStatus to expand all 300K files. > This is a general problem for object stores and huge numbers of files. > Steve L. may have better thoughts on real solutions. But you might > consider, if possible, running a lo

Re: concurrent writes with dynamic partition overwrite mode

2019-09-05 Thread Steve Loughran
On Sun, Sep 1, 2019 at 7:54 PM Koert Kuipers wrote: > hi, > i am struggling to understand if concurrent writes to same basedir but > different partitions are save with file sources such as parquet. > > i tested this in spark 2.4 and spark 3.0.0-SNAPSHOT with real concurrent > jobs on hdfs and it

Re: Why two netty libs?

2019-09-04 Thread Steve Loughran
Zookeeper client is/was netty 3, AFAIK, so if you want to use it for anything, it ends up on the CP On Tue, Sep 3, 2019 at 5:18 PM Shixiong(Ryan) Zhu wrote: > Yep, historical reasons. And Netty 4 is under another namespace, so we can > use Netty 3 and Netty 4 in the same JVM. > > On Tue, Sep 3,

Re: IPv6 support

2019-07-17 Thread Steve Loughran
Fairly neglected hadoop patch, FWIW; https://issues.apache.org/jira/browse/HADOOP-11890 FB have been running HDFS &c on IPv6 for a while, but their codebase has diverged; getting the stuff into trunk is going to take effort. At least the JDK has moved on and should be better On Wed, Jul 17, 2019

Re: Option for silent failure while reading a list of files.

2019-07-01 Thread Steve Loughran
Where is this list of files coming from? If you made the list, then yes, the expectation is generally "supply a list of files which are present" on the basis that general convention is "missing files are considered bad" Though you could try setting spark.sql.files.ignoreCorruptFiles=true to see w

Re: Ask for ARM CI for spark

2019-06-28 Thread Steve Loughran
r ARM platform in upstream community, it bring confidence to end user and > customers when they plan to deploy these projects on ARM. > > This is absolute long term work, let's to make it step by step, CI, > testing, issue and resolving. > > Steve Loughran 于2019年6月27日周四 下午

Re: Ask for ARM CI for spark

2019-06-27 Thread Steve Loughran
level db and native codecs are invariably a problem here, as is anything else doing misaligned IO. Protobuf has also had "issues" in the past see https://issues.apache.org/jira/browse/HADOOP-16100 I think any AA64 work is going to have to define very clearly what "works" is defined as; spark stan

Re: Detect executor core count

2019-06-18 Thread Steve Loughran
be aware that older java 8 versions count the #of cores in the host, not those allocated for the container they run in https://bugs.openjdk.java.net/browse/JDK-8140793 On Tue, Jun 18, 2019 at 8:13 PM Ilya Matiach wrote: > Hi Andrew, > > I tried to do something similar to that in the LightGBM > c

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-22 Thread Steve Loughran
hadoop is still on 1.7.7 branch. A move to 1.9 would probably be as painful as a move to 1.8.x, so submit a patch for hadoop trunk. Last PR there wasn't quite ready and I didn't get any follow up to the "what is this going to break" question https://issues.apache.org/jira/browse/HADOOP-13386 Ther

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-03 Thread Steve Loughran
On Tue, Apr 2, 2019 at 9:39 PM Ankur Gupta wrote: > Hi Steve, > > Thanks for your feedback. From your email, I could gather the following > two important points: > >1. Report failures to something (cluster manager) which can opt to >destroy the node and request a new one >2. Pluggable

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-04-02 Thread Steve Loughran
On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin wrote: > We tried enabling blacklisting for some customers and in the cloud, very > quickly they end up having 0 executors due to various transient errors. So > unfortunately I think the current implementation is terrible for cloud > deployments, and sh

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-19 Thread Steve Loughran
you might want to look at the work on FPGA resources; again it should just be a resource available by a scheduler. Key thing is probably just to keep the docs generic https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingFPGA.html I don't know where you get those FPGAs to play w

Re: proposal for expanded & consistent timestamp types

2019-01-02 Thread Steve Loughran
uldn't have to care about the persistence format *or which app created the data* What does Arrow do in this world, incidentally? On 2 Jan 2019, at 11:48, Steve Loughran mailto:ste...@hortonworks.com>> wrote: On 17 Dec 2018, at 17:44, Zoltan Ivanfi mailto:z...@cloudera.com.INVALID

Re: proposal for expanded & consistent timestamp types

2019-01-02 Thread Steve Loughran
On 17 Dec 2018, at 17:44, Zoltan Ivanfi mailto:z...@cloudera.com.INVALID>> wrote: Hi, On Sun, Dec 16, 2018 at 4:43 AM Wenchen Fan mailto:cloud0...@gmail.com>> wrote: Shall we include Parquet and ORC? If they don't support it, it's hard for general query engines like Spark to support it. Fo

Re: Hadoop 3 support

2018-10-23 Thread Steve Loughran
> On 16 Oct 2018, at 22:06, t4 wrote: > > has anyone got spark jars working with hadoop3.1 that they can share? i am > looking to be able to use the latest hadoop-aws fixes from v3.1 we do, but we do with * a patched hive JAR * bulding spark with -Phive,yarn,hadoop-3.1,hadoop-cloud,kinesis

Re: Random sampling in tests

2018-10-09 Thread Steve Loughran
Randomized testing can, in theory, help you explore a far larger area of the environment of an app than you could explicitly explore, such as "does everything work in the turkish locale where "I".toLower()!="i", etc. Good: faster tests, especially on an essentially-non-finite set of options bad

Re: [Discuss] Datasource v2 support for Kerberos

2018-10-02 Thread Steve Loughran
On 2 Oct 2018, at 04:44, tigerquoll mailto:tigerqu...@outlook.com>> wrote: Hi Steve, I think that passing a kerberos keytab around is one of those bad ideas that is entirely appropriate to re-question every single time you come across it. It has been used already in spark when interacting with

Re: saveAsTable in 2.3.2 throws IOException while 2.3.1 works fine?

2018-10-01 Thread Steve Loughran
On 30 Sep 2018, at 19:37, Jacek Laskowski mailto:ja...@japila.pl>> wrote: scala> spark.range(1).write.saveAsTable("demo") 2018-09-30 17:44:27 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException 2018-09-30 17:44:28 ERROR FileOutputCommitter:314 - Mkdirs f

Re: [Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?

2018-10-01 Thread Steve Loughran
On 11 Aug 2018, at 17:33, chandan prakash mailto:chandanbaran...@gmail.com>> wrote: Hi All, I was going through this pull request about new CheckpointFileManager abstraction in structured streaming coming in 2.4 : https://issues.apache.org/jira/browse/SPARK-23966 https://github.com/apache/spar

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-27 Thread Steve Loughran
> On 25 Sep 2018, at 07:52, tigerquoll wrote: > > To give some Kerberos specific examples, The spark-submit args: > -–conf spark.yarn.keytab=path_to_keytab -–conf > spark.yarn.principal=princi...@realm.com > > are currently not passed through to the data sources. > > > I'm not sure why th

Re: sql compile failing with Zinc?

2018-08-14 Thread Steve Loughran
can give it more memory with -J-Xmx2g or whatever. If you're running ./build/mvn and letting it run zinc we might need to increase the memory that it requests in the script. On Tue, Aug 14, 2018 at 2:56 PM Steve Loughran mailto:ste...@hortonworks.com>> wrote: Is anyone else getting th

sql compile failing with Zinc?

2018-08-14 Thread Steve Loughran
Is anyone else getting the sql module maven build on master branch failing when you use zinc for incremental builds? [warn] ^ java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.tools.nsc.backend.icode.GenICode$Scope.(GenICode.scala:2225) at scala.t

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-07 Thread Steve Loughran
CVS with schema inference is a full read of the data, so that could be one of the problems. Do it at most once, print out the schema and use it from then on during ingress & use something else for persistence On 6 Aug 2018, at 05:44, makatun mailto:d.i.maka...@gmail.com>> wrote: a. cs

Opentrace in ASF projects

2018-07-06 Thread Steve Loughran
FYI, there's some initial exploring of what it would take to move the HDFS wire protocol to move from HTrace for OpenTrace for tracing, and wire up the other stores too https://issues.apache.org/jira/browse/HADOOP-15566 If anyone has any input/insight or code review capacity, it'd be welcome.

hadoop-aws versions (was Re: [VOTE] Spark 2.3.1 (RC4))

2018-06-26 Thread Steve Loughran
following up after a ref to this in https://issues.apache.org/jira/browse/HADOOP-15559 the AWS SDK is a very fast moving project, with a release cycle of ~2 weeks, but it's in the state Fred Brooks described, "the number of bugs is constant, they just move around"; bumpin gup an AWS release is

Re: Running lint-java during PR builds?

2018-05-28 Thread Steve Loughran
> On 21 May 2018, at 17:20, Marcelo Vanzin wrote: > > Is there a way to trigger it conditionally? e.g. only if the diff > touches java files. > what about adding it as another command which could be added alongside "jenkins test this please", something like "lint this please" > On Mon, May

Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-26 Thread Steve Loughran
sorry, not noticed this followup. Been busy with other issues On 3 Apr 2018, at 11:19, cane mailto:zhoukang199...@gmail.com>> wrote: Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause data loss. I check the comment of thi api: We should make sure our tasks are idemp

Re: time for Apache Spark 3.0?

2018-04-05 Thread Steve Loughran
On 5 Apr 2018, at 18:04, Matei Zaharia mailto:matei.zaha...@gmail.com>> wrote: Java 9/10 support would be great to add as well. Be aware that the work moving hadoop core to java 9+ is still a big piece of work being undertaken by Akira Ajisaka & colleagues at NTT https://issues.apache.org/ji

Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-03 Thread Steve Loughran
> On 3 Apr 2018, at 11:19, cane wrote: > > Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause > data loss. > I check the comment of thi api: > > We should make sure our tasks are idempotent when speculation is enabled, > i.e. do > * not use output committer that w

Re: Hadoop 3 support

2018-04-03 Thread Steve Loughran
On 3 Apr 2018, at 01:30, Saisai Shao mailto:sai.sai.s...@gmail.com>> wrote: Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enou

Re: Hadoop 3 support

2018-04-03 Thread Steve Loughran
On 3 Apr 2018, at 01:30, Saisai Shao mailto:sai.sai.s...@gmail.com>> wrote: Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop version in the runtime [1]. Besides this I think some pom changes should be enou

Re: A new external catalog

2018-02-16 Thread Steve Loughran
On 14 Feb 2018, at 19:56, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: Newbie question: I want to add system/integration tests for the new functionality. There are a set of existing tests around Spark Catalog that I can leverage. Great. The provider I’m writing is backed by a web servi

Re: A new external catalog

2018-02-16 Thread Steve Loughran
On 14 Feb 2018, at 13:51, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a quick look and am sufficiently scared for now. I had run into that warning from the S3 stream before. Sigh. things like that are trouble

Re: Regarding NimbusDS JOSE JWT jar 3.9 security vulnerability

2018-02-14 Thread Steve Loughran
might be coming in transitively https://issues.apache.org/jira/browse/HADOOP-14799 On 13 Feb 2018, at 18:18, PJ Fanning mailto:fannin...@yahoo.com>> wrote: Hi Sujith, I didn't find the nimbusds dependency in any spark 2.2 jars. Maybe I missed something. Could you tell us which spark jar has the

  1   2   3   >