Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone, We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 3.5.1. I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 21? Thanks, Steve C This email contains confidential information of and is the copyright of Infomedia. It

Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich, I was just reading random questions on the user list when I noticed that you said: On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote: 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: [Building] Building with JDK11

2022-07-18 Thread Stephen Coy
of Apache Maven. Cheers, Steve C On 18 Jul 2022, at 4:12 pm, Sergey B. mailto:sergey.bushma...@gmail.com>> wrote: Hi Steve, Can you shed some light why do they need $JAVA_HOME at all if everything is already in place? Regards, - Sergey On Mon, Jul 18, 2022 at 4:31 AM Stephen Coy ma

Re: [Building] Building with JDK11

2022-07-17 Thread Stephen Coy
Hi Szymon, There seems to be a common misconception that setting JAVA_HOME will set the version of Java that is used. This is not true, because in most environments you need to have a PATH environment variable set up that points at the version of Java that you want to use. You can set

Re: Retrieve the count of spark nodes

2022-06-08 Thread Stephen Coy
Hi there, We use something like: /* * Force Spark to initialise the defaultParallelism by executing a dummy parallel operation and then return * the resulting defaultParallelism. */ private int getWorkerCount(SparkContext sparkContext) { sparkContext.parallelize(List.of(1, 2, 3,

Re: Using Avro file format with SparkSQL

2022-02-14 Thread Stephen Coy
Hi Morven, We use —packages for all of our spark jobs. Spark downloads the specified jar and all of its dependencies from a Maven repository. This means we never have to build fat or uber jars. It does mean that the Apache Ivy configuration has to be set up correctly though. Cheers, Steve C

Re: Migration to Spark 3.2

2022-01-27 Thread Stephen Coy
-ooxml:jar:5.2.0:compile [INFO]+- org.apache.poi:poi-ooxml-lite:jar:5.2.0:compile [INFO]+- org.apache.xmlbeans:xmlbeans:jar:5.0.3:compile [INFO]+- com.github.virtuald:curvesapi:jar:1.06:compile [INFO]\- org.apache.logging.log4j:log4j-api:jar:2.17.1:compile Le jeu. 27 janv. 2022 à 00:3

Re: Migration to Spark 3.2

2022-01-26 Thread Stephen Coy
Hi Aurélien! Please run mvn dependency:tree and check it for Jackson dependencies. Feel free to respond with the output if you have any questions about it. Cheers, Steve C > On 22 Jan 2022, at 10:49 am, Aurélien Mazoyer wrote: > > Hello, > > I migrated my code to Spark 3.2 and I am

Re: Log4J 2 Support

2021-11-09 Thread Stephen Coy
as do other libs, and that isn't what the shims cover. Could be possible now or with more cleverness but the simple thing didn't work out IIRC. On Tue, Nov 9, 2021, 4:32 PM Stephen Coy mailto:s...@infomedia.com.au>> wrote: Hi there, It’s true that the preponderance of log4j 1.2.x in many e

Re: Log4J 2 Support

2021-11-09 Thread Stephen Coy
Hi there, It’s true that the preponderance of log4j 1.2.x in many existing live projects is kind of a pain in the butt. But there is a solution. 1. Migrate all Spark code to use slf4j APIs; 2. Exclude log4j 1.2.x from any dependencies sucking it in; 3. Include the log4j-over-slf4j bridge jar

Re: Missing module spark-hadoop-cloud in Maven central

2021-06-01 Thread Stephen Coy
I have been building Apache Spark from source just so I can get this dependency. 1. git checkout v3.1.1 2. dev/make-distribution.sh --name hadoop-cloud-3.2 --tgz -Pyarn -Phadoop-3.2 -Pyarn -Phadoop-cloud -Phive-thriftserver -Dhadoop.version=3.2.0 It is kind of a nuisance having to do

Re: pyspark sql load with path of special character

2021-04-25 Thread Stephen Coy
It probably does not like the colons in the path name “…20:04:27+00:00/…”, especially if you’re running on a Windows box. On 24 Apr 2021, at 1:29 am, Regin Quinoa mailto:sweatr...@gmail.com>> wrote: Hi, I am using pyspark sql to load files into table following ```LOAD DATA LOCAL INPATH

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Stephen Coy
Hi there, At risk of stating the obvious, the first step is to ensure that your Spark application and S3 bucket are colocated in the same AWS region. Steve C On 16 Mar 2021, at 3:31 am, Alchemist mailto:alchemistsrivast...@gmail.com>> wrote: How to optimize s3 list S3 file using

Re: Unsubscribe

2020-08-26 Thread Stephen Coy
The instructions for all Apache mail lists are in the mail headers: List-Unsubscribe: On 27 Aug 2020, at 7:49 am, Jeff Evans mailto:jeffrey.wayne.ev...@gmail.com>> wrote: That is not how you unsubscribe. See here for instructions:

Re: S3 read/write from PySpark

2020-08-11 Thread Stephen Coy
:238) at java.base/java.lang.Thread.run(Thread.java:834) On Thu, 6 Aug 2020 at 17:19, Stephen Coy mailto:s...@infomedia.com.au>> wrote: Hi Daniel, It looks like …BasicAWSCredentialsProvider has become org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider. However, the way tha

Re: S3 read/write from PySpark

2020-08-06 Thread Stephen Coy
Hi Daniel, It looks like …BasicAWSCredentialsProvider has become org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider. However, the way that the username and password are provided appears to have changed so you will probably need to look in to that. Cheers, Steve C On 6 Aug 2020, at 11:15

Re: Tab delimited csv import and empty columns

2020-08-05 Thread Stephen Coy
Because its default is the empty string, empty strings become null by default. On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy mailto:s...@infomedia.com.au.invalid>> wrote: That does not work. This is Spark 3.0 by the way. I have been looking at the Spark unit tests and there does not seem to be

Re: Tab delimited csv import and empty columns

2020-07-31 Thread Stephen Coy
quot;) Hope it helps On Thu, 30 Jul 2020 at 08:49, Stephen Coy mailto:s...@infomedia.com.au.invalid>> wrote: Hi there, I’m trying to import a tab delimited file with: Dataset catalogData = sparkSession .read() .option("sep", "\t") .option("header", &qu

Tab delimited csv import and empty columns

2020-07-30 Thread Stephen Coy
Hi there, I’m trying to import a tab delimited file with: Dataset catalogData = sparkSession .read() .option("sep", "\t") .option("header", "true") .csv(args[0]) .cache(); This works great, except for the fact that any column that is empty is given the value null, when I need these

When does SparkContext.defaultParallelism have the correct value?

2020-07-06 Thread Stephen Coy
Hi there, I have found that if I invoke sparkContext.defaultParallelism() too early it will not return the correct value; For example, if I write this: final JavaSparkContext sparkContext = new JavaSparkContext(sparkSession.sparkContext()); final int workerCount =

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-06 Thread Stephen Coy
Hi Steve, While I understand your point regarding the mixing of Hadoop jars, this does not address the java.lang.ClassNotFoundException. Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or Hadoop 3.2. Not Hadoop 3.1. The only place that I have found that missing class is in

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread Stephen Coy
Hi Murat Migdisoglu, Unfortunately you need the secret sauce to resolve this. It is necessary to check out the Apache Spark source code and build it with the right command line options. This is what I have been using: dev/make-distribution.sh --name my-spark --tgz -Pyarn -Phadoop-3.2 -Pyarn

Re: [PySpark] How to write HFiles as an 'append' to the same directory?

2020-03-16 Thread Stephen Coy
I encountered a similar problem when trying to: ds.write().save(“s3a://some-bucket/some/path/table”); which writes the content as a bunch of parquet files in the “folder” named “table”. I am using a Flintrock cluster with the Spark 3.0 preview FWIW. Anyway, I just used the AWS SDK to remove

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
Hi there, I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql. They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always