Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone,

We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 
3.5.1.

I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 
21?

Thanks,

Steve C


This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich,

I was just reading random questions on the user list when I noticed that you 
said:

On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh  wrote:

1) You are using monotonically_increasing_id(), which is not 
collision-resistant in distributed environments like Spark. Multiple hosts
   can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) 
for guaranteed uniqueness.


It’s my understanding that the *Spark* `monotonically_increasing_id()` function 
exists for the exact purpose of generating a collision-resistant unique id 
across nodes on different hosts.
We use it extensively for this purpose and have never encountered an issue.

Are we wrong or are you thinking of a different (not Spark) function?

Cheers,

Steve C




This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick,

When this has happened to me in the past (admittedly via spark-submit) it has 
been because another job was still running and had already claimed some of the 
resources (cores and memory).

I think this can also happen if your configuration tries to claim resources 
that will never be available.

Cheers,

SteveC


On 11 Aug 2023, at 3:36 am, Patrick Tucci  wrote:

Hello,

I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. 
The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS 
for storage.

The query is as follows:

SELECT ME.*, MB.BenefitID
FROM MemberEnrollment ME
JOIN MemberBenefits MB
ON ME.ID = MB.EnrollmentID
WHERE MB.BenefitID = 5
LIMIT 10

The tables are defined as follows:

-- Contains about 3M rows
CREATE TABLE MemberEnrollment
(
ID INT
, MemberID VARCHAR(50)
, StartDate DATE
, EndDate DATE
-- Other columns, but these are the most important
) STORED AS ORC;

-- Contains about 25m rows
CREATE TABLE MemberBenefits
(
EnrollmentID INT
, BenefitID INT
) STORED AS ORC;

When I execute the query, it runs a single broadcast exchange stage, which 
completes after a few seconds. Then everything just hangs. The JDBC/ODBC tab in 
the UI shows the query state as COMPILED, but no stages or tasks are executing 
or pending:



I've let the query run for as long as 30 minutes with no additional stages, 
progress, or errors. I'm not sure where to start troubleshooting.

Thanks for your help,

Patrick

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: [Building] Building with JDK11

2022-07-18 Thread Stephen Coy
Hi Sergey,

I’m willing to be corrected, but I think the use of a JAVA_HOME environment 
variable was something that was started by and continues to be perpetuated by 
Apache Tomcat.

… or maybe Apache Ant, but modern versions of Ant do not need it either.

It is not needed for modern releases of Apache Maven.

Cheers,

Steve C

On 18 Jul 2022, at 4:12 pm, Sergey B. 
mailto:sergey.bushma...@gmail.com>> wrote:

Hi Steve,

Can you shed some light why do they need $JAVA_HOME at all if everything is 
already in place?

Regards,
- Sergey

On Mon, Jul 18, 2022 at 4:31 AM Stephen Coy 
mailto:s...@infomedia.com.au.invalid>> wrote:
Hi Szymon,

There seems to be a common misconception that setting JAVA_HOME will set the 
version of Java that is used.

This is not true, because in most environments you need to have a PATH 
environment variable set up that points at the version of Java that you want to 
use.

You can set JAVA_HOME to anything at all and `java -version` will always return 
the same result.

The way that you configure PATH varies from OS to OS:


  *   MacOS use `/usr/libexec/java_home -v11`
  *   On linux use `sudo alternatives --config java`
  *   On Windows I have no idea

When you do this the `mvn` command will compute the value of JAVA_HOME for its 
own use; there is no need to explicitly set it yourself (unless other Java 
applications that you use insist u[on it).


Cheers,

Steve C

On 16 Jul 2022, at 7:24 am, Szymon Kuryło 
mailto:szymonkur...@gmail.com>> wrote:

Hello,

I'm trying to build a Java 11 Spark distro using the dev/make-distribution.sh 
script.
I have set JAVA_HOME to point to JDK11 location, I've also set the java.version 
property in pom.xml to 11, effectively also setting `maven.compile.source` and 
`maven.compile.target`.
When inspecting classes from the `dist` directory with `javap -v`, I find that 
the class major version is 52, which is specific to JDK8. Am I missing 
something? Is there a reliable way to set the JDK used in the build process?

Thanks,
Szymon K.

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/



Re: [Building] Building with JDK11

2022-07-17 Thread Stephen Coy
Hi Szymon,

There seems to be a common misconception that setting JAVA_HOME will set the 
version of Java that is used.

This is not true, because in most environments you need to have a PATH 
environment variable set up that points at the version of Java that you want to 
use.

You can set JAVA_HOME to anything at all and `java -version` will always return 
the same result.

The way that you configure PATH varies from OS to OS:


  *   MacOS use `/usr/libexec/java_home -v11`
  *   On linux use `sudo alternatives --config java`
  *   On Windows I have no idea

When you do this the `mvn` command will compute the value of JAVA_HOME for its 
own use; there is no need to explicitly set it yourself (unless other Java 
applications that you use insist u[on it).


Cheers,

Steve C

On 16 Jul 2022, at 7:24 am, Szymon Kuryło 
mailto:szymonkur...@gmail.com>> wrote:

Hello,

I'm trying to build a Java 11 Spark distro using the dev/make-distribution.sh 
script.
I have set JAVA_HOME to point to JDK11 location, I've also set the java.version 
property in pom.xml to 11, effectively also setting `maven.compile.source` and 
`maven.compile.target`.
When inspecting classes from the `dist` directory with `javap -v`, I find that 
the class major version is 52, which is specific to JDK8. Am I missing 
something? Is there a reliable way to set the JDK used in the build process?

Thanks,
Szymon K.

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Retrieve the count of spark nodes

2022-06-08 Thread Stephen Coy
Hi there,

We use something like:


/*
 * Force Spark to initialise the defaultParallelism by executing a dummy 
parallel operation and then return
 * the resulting defaultParallelism.
 */
private int getWorkerCount(SparkContext sparkContext) {
sparkContext.parallelize(List.of(1, 2, 3, 4)).collect();
return sparkContext.defaultParallelism();
}


Its useful for setting certain pool sizes dynamically, such as:


sparkContext.hadoopConfiguration().set("fs.s3a.connection.maximum", 
Integer.toString(workerCount * 2));

This  works in our Spark 3.0.1 code; just migrating to 3.2.1 now.

Cheers,

Steve C

On 8 Jun 2022, at 4:28 pm, Poorna Murali 
mailto:poornamur...@gmail.com>> wrote:

Hi,

I would like to know if it is possible to  get the count of live master and 
worker spark nodes running in a system.

Please help to clarify the same.

Thanks,
Poorna

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Using Avro file format with SparkSQL

2022-02-14 Thread Stephen Coy
Hi Morven,

We use —packages for all of our spark jobs. Spark downloads the specified jar 
and all of its dependencies from a Maven repository.

This means we never have to build fat or uber jars.

It does mean that the Apache Ivy configuration has to be set up correctly 
though.

Cheers,

Steve C

> On 15 Feb 2022, at 5:58 pm, Morven Huang  wrote:
>
> I wrote a toy spark job and ran it within my IDE, same error if I don’t add 
> spark-avro to my pom.xml. After putting spark-avro dependency to my pom.xml, 
> everything works fine.
>
> Another thing is, if my memory serves me right, the spark-submit options for 
> extra jars is ‘--jars’ , not ‘--packages’.
>
> Regards,
>
> Morven Huang
>
>
> On 2022/02/10 03:25:28 "Karanika, Anna" wrote:
>> Hello,
>>
>> I have been trying to use spark SQL’s operations that are related to the 
>> Avro file format,
>> e.g., stored as, save, load, in a Java class but they keep failing with the 
>> following stack trace:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed 
>> to find data source: avro. Avro is built-in but external data source module 
>> since Spark 2.4. Please deploy the application as per the deployment section 
>> of "Apache Avro Data Source Guide".
>>at 
>> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
>>at 
>> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
>>at 
>> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
>>at 
>> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
>>at 
>> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
>>at 
>> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
>>at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
>>at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
>> Method)
>>at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
>>at 
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>>at 
>> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>>at 
>> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>>at 
>> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>>at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>>at 
>> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> For context, I am invoking spark-submit and adding arguments --packages 
>> org.apache.spark:spark-avro_2.12:3.2.0.
>> Yet, Spark responds as if the dependency was not added.
>> I am running spark-v3.2.0 (Scala 2.12).
>>
>> On the other hand, everything works great with spark-shell or spark-sql.
>>
>> I would appreciate any advice or feedback to get this running.
>>
>> Thank you,
>> Anna
>>
>>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Migration to Spark 3.2

2022-01-27 Thread Stephen Coy
api:jar:1.1.0:test
[INFO] |  |  +- org.opentest4j:opentest4j:jar:1.2.0:test
[INFO] |  |  \- org.junit.platform:junit-platform-commons:jar:1.5.2:test
[INFO] |  +- org.junit.jupiter:junit-jupiter-params:jar:5.5.2:test
[INFO] |  \- org.junit.jupiter:junit-jupiter-engine:jar:5.5.2:test
[INFO] | \- org.junit.platform:junit-platform-engine:jar:1.5.2:test
[INFO] +- com.fasterxml.jackson.core:jackson-databind:jar:2.12.6:compile
[INFO] +- com.fasterxml.jackson.core:jackson-core:jar:2.12.6:compile
[INFO] +- com.thoughtworks.paranamer:paranamer:jar:2.8:compile
[INFO] +- com.fasterxml.jackson.core:jackson-annotations:jar:2.12.6:compile
[INFO] +- org.mockito:mockito-junit-jupiter:jar:2.23.0:test
[INFO] |  \- org.mockito:mockito-core:jar:2.23.0:test
[INFO] | +- net.bytebuddy:byte-buddy:jar:1.9.0:test
[INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.9.0:test
[INFO] | \- org.objenesis:objenesis:jar:2.6:provided
[INFO] +- org.hamcrest:hamcrest-library:jar:1.3:test
[INFO] |  \- org.hamcrest:hamcrest-core:jar:1.3:test
[INFO] +- com.crealytics:spark-excel_2.12:jar:3.2.0_0.16.0:compile
[INFO] |  +- com.norbitltd:spoiwo_2.12:jar:2.0.0:compile
[INFO] |  +- com.github.pjfanning:excel-streaming-reader:jar:3.2.4:compile
[INFO] |  \- com.github.pjfanning:poi-shared-strings:jar:2.3.0:compile
[INFO] | \- com.h2database:h2:jar:2.0.202:runtime
[INFO] +- org.apache.commons:commons-compress:jar:1.21:compile
[INFO] +- commons-io:commons-io:jar:2.11.0:compile
[INFO] \- org.apache.poi:poi-ooxml:jar:5.2.0:compile
[INFO]+- org.apache.poi:poi-ooxml-lite:jar:5.2.0:compile
[INFO]+- org.apache.xmlbeans:xmlbeans:jar:5.0.3:compile
[INFO]+- com.github.virtuald:curvesapi:jar:1.06:compile
[INFO]\- org.apache.logging.log4j:log4j-api:jar:2.17.1:compile


Le jeu. 27 janv. 2022 à 00:39, Stephen Coy 
mailto:s...@infomedia.com.au>> a écrit :
Hi Aurélien!

Please run

mvn dependency:tree

and check it for Jackson dependencies.

Feel free to respond with the output if you have any questions about it.

Cheers,

Steve C

> On 22 Jan 2022, at 10:49 am, Aurélien Mazoyer 
> mailto:aurel...@aepsilon.com>> wrote:
>
> Hello,
>
> I migrated my code to Spark 3.2 and I am facing some issues. When I run my 
> unit tests via Maven, I get this error:
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.spark.rdd.RDDOperationScope$
> which is not super nice.
>
> However, when I run my test via Intellij, I get the following one:
> java.lang.ExceptionInInitializerError
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
> at org.apache.spark.rdd.RDD.map(RDD.scala:421)
> ...
> Caused by: com.fasterxml.jackson.databind.JsonMappingException: Scala module 
> 2.12.3 requires Jackson Databind version >= 2.12.0 and < 2.13.0
> which is far better imo since it gives me some clue on what is missing in my 
> pom.xml file to make it work. After putting a few more dependencies, my tests 
> are again passing in intellij, but I am stuck on the same error when I am 
> running maven command :-/.
> It seems that jdk and maven versions are the same and both are using the same 
> .m2 directory.
> Any clue on what can be going wrong?
>
> Thank you,
>
> Aurelien

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/



Re: Migration to Spark 3.2

2022-01-26 Thread Stephen Coy
Hi Aurélien!

Please run

mvn dependency:tree

and check it for Jackson dependencies.

Feel free to respond with the output if you have any questions about it.

Cheers,

Steve C

> On 22 Jan 2022, at 10:49 am, Aurélien Mazoyer  wrote:
>
> Hello,
>
> I migrated my code to Spark 3.2 and I am facing some issues. When I run my 
> unit tests via Maven, I get this error:
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.spark.rdd.RDDOperationScope$
> which is not super nice.
>
> However, when I run my test via Intellij, I get the following one:
> java.lang.ExceptionInInitializerError
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
> at org.apache.spark.rdd.RDD.map(RDD.scala:421)
> ...
> Caused by: com.fasterxml.jackson.databind.JsonMappingException: Scala module 
> 2.12.3 requires Jackson Databind version >= 2.12.0 and < 2.13.0
> which is far better imo since it gives me some clue on what is missing in my 
> pom.xml file to make it work. After putting a few more dependencies, my tests 
> are again passing in intellij, but I am stuck on the same error when I am 
> running maven command :-/.
> It seems that jdk and maven versions are the same and both are using the same 
> .m2 directory.
> Any clue on what can be going wrong?
>
> Thank you,
>
> Aurelien

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Log4J 2 Support

2021-11-09 Thread Stephen Coy
Hi Sean,

I have had a more detailed look at what Spark is doing with log4 APIs and at 
this point I suspect that a logj 2.x migration might be more appropriate at the 
code level.

That still does not solve the libraries issue though. That would need more 
investigation.

I could be tempted to tackle it if there is enough interest.

Cheers,

Steve C

On 10 Nov 2021, at 9:42 am, Sean Owen 
mailto:sro...@gmail.com>> wrote:

Yep that's what I tried, roughly - there is an old jira about it. The issue is 
that Spark does need to configure some concrete logging framework in a few 
cases, as do other libs, and that isn't what the shims cover. Could be possible 
now or with more cleverness but the simple thing didn't work out IIRC.

On Tue, Nov 9, 2021, 4:32 PM Stephen Coy 
mailto:s...@infomedia.com.au>> wrote:
Hi there,

It’s true that the preponderance of log4j 1.2.x in many existing live projects 
is kind of a pain in the butt.

But there is a solution.

1. Migrate all Spark code to use slf4j APIs;

2. Exclude log4j 1.2.x from any dependencies sucking it in;

3. Include the log4j-over-slf4j bridge jar and slf4j-api jars;

4. Choose your favourite modern logging implementation and add it as a 
“runtime" dependency together with it’s slf4j binding jar (if needed).

In fact in the short term you can replace steps 1 and 2 with "remove the log4j 
1.2.17 jar from the distribution" and it should still work.

The slf4j project also includes a commons-logging shim for capturing its output 
too.

FWIW, the slf4j project is run by one of the original log4j developers.

Cheers,

Steve C


On 9 Nov 2021, at 11:11 pm, Sean Owen 
mailto:sro...@gmail.com>> wrote:

No plans that I know of. It's not that Spark uses it so much as its 
dependencies. I tried and failed to upgrade it a couple years ago. you are 
welcome to try, and open a PR if successful.

On Tue, Nov 9, 2021 at 6:09 AM Ajay Kumar 
mailto:ajay.praja...@gmail.com>> wrote:
Hi Team,
We wanted to send Spark executor logs to a centralized logging server using TCP 
Socket. I see that the spark log4j version is very old(1.2.17) and it does not 
support JSON logs over tcp sockets on containers.
I wanted to konw what is the plan for upgrading the log4j version to log4j2.
Thanks in advance.
Regards,
Ajay

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/



Re: Log4J 2 Support

2021-11-09 Thread Stephen Coy
Hi there,

It’s true that the preponderance of log4j 1.2.x in many existing live projects 
is kind of a pain in the butt.

But there is a solution.

1. Migrate all Spark code to use slf4j APIs;

2. Exclude log4j 1.2.x from any dependencies sucking it in;

3. Include the log4j-over-slf4j bridge jar and slf4j-api jars;

4. Choose your favourite modern logging implementation and add it as a 
“runtime" dependency together with it’s slf4j binding jar (if needed).

In fact in the short term you can replace steps 1 and 2 with "remove the log4j 
1.2.17 jar from the distribution" and it should still work.

The slf4j project also includes a commons-logging shim for capturing its output 
too.

FWIW, the slf4j project is run by one of the original log4j developers.

Cheers,

Steve C


On 9 Nov 2021, at 11:11 pm, Sean Owen 
mailto:sro...@gmail.com>> wrote:

No plans that I know of. It's not that Spark uses it so much as its 
dependencies. I tried and failed to upgrade it a couple years ago. you are 
welcome to try, and open a PR if successful.

On Tue, Nov 9, 2021 at 6:09 AM Ajay Kumar 
mailto:ajay.praja...@gmail.com>> wrote:
Hi Team,
We wanted to send Spark executor logs to a centralized logging server using TCP 
Socket. I see that the spark log4j version is very old(1.2.17) and it does not 
support JSON logs over tcp sockets on containers.
I wanted to konw what is the plan for upgrading the log4j version to log4j2.
Thanks in advance.
Regards,
Ajay

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Missing module spark-hadoop-cloud in Maven central

2021-06-01 Thread Stephen Coy
I have been building Apache Spark from source just so I can get this dependency.


  1.  git checkout v3.1.1
  2.  dev/make-distribution.sh --name hadoop-cloud-3.2 --tgz -Pyarn 
-Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver  -Dhadoop.version=3.2.0

It is kind of a nuisance having to do this though.

Steve C


On 31 May 2021, at 10:34 pm, Sean Owen 
mailto:sro...@gmail.com>> wrote:

I know it's not enabled by default when the binary artifacts are built, but not 
exactly sure why it's not built separately at all. It's almost a 
dependencies-only pom artifact, but there are two source files. Steve do you 
have an angle on that?

On Mon, May 31, 2021 at 5:37 AM Erik Torres 
mailto:etserr...@gmail.com>> wrote:
Hi,

I'm following this 
documentation
 to configure my Spark-based application to interact with Amazon S3. However, I 
cannot find the spark-hadoop-cloud module in Maven central for the 
non-commercial distribution of Apache Spark. From the documentation I would 
expect that I can get this module as a Maven dependency in my project. However, 
I ended up building the spark-hadoop-cloud module from the Spark's 
code.

Is this the expected way to setup the integration with Amazon S3? I think I'm 
missing something here.

Thanks in advance!

Erik

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: pyspark sql load with path of special character

2021-04-25 Thread Stephen Coy
It probably does not like the colons in the path name “…20:04:27+00:00/…”, 
especially if you’re running on a Windows box.

On 24 Apr 2021, at 1:29 am, Regin Quinoa 
mailto:sweatr...@gmail.com>> wrote:

Hi, I am using pyspark sql to load files into table following
```LOAD DATA LOCAL INPATH '/user/hive/warehouse/students' OVERWRITE INTO TABLE 
test_load;```
 
https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html

It complains pyspark.sql.utils.AnalysisException: load data input path does not 
exist
 when the path string has timestamp in the directory structure like
XX/XX/2021-03-02T20:04:27+00:00/file.parquet

It works with path without timestamp. How to work it around?


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Stephen Coy
Hi there,

At risk of stating the obvious, the first step is to ensure that your Spark 
application and S3 bucket are colocated in the same AWS region.

Steve C

On 16 Mar 2021, at 3:31 am, Alchemist 
mailto:alchemistsrivast...@gmail.com>> wrote:

How to optimize s3 list S3 file using wholeTextFile(): We are using 
wholeTextFile to read data from S3.  As per my understanding wholeTextFile 
first list files of given path.  Since we are using S3 as input source, then 
listing files in a bucket is single-threaded, the S3 API for listing the keys 
in a bucket only returns keys by chunks of 1000 per call.   Since we have at 
millions of files, we are making thousands API calls.  This listing make our 
processing very slow. How can we make listing of S3 faster?

Thanks,

Rachana

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Unsubscribe

2020-08-26 Thread Stephen Coy
The instructions for all Apache mail lists are in the mail headers:


List-Unsubscribe: 



On 27 Aug 2020, at 7:49 am, Jeff Evans 
mailto:jeffrey.wayne.ev...@gmail.com>> wrote:

That is not how you unsubscribe.  See here for instructions: 
https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e

On Wed, Aug 26, 2020, 4:22 PM Annabel Melongo 
mailto:melongo_anna...@yahoo.com.invalid>> 
wrote:
Please remove me from the mailing list

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: S3 read/write from PySpark

2020-08-11 Thread Stephen Coy
Hi there,

Also for the benefit of others, if you attempt to use any version of Hadoop > 
3.2.0 (such as 3.2.1), you will need to update the version of Google Guava used 
by Apache Spark to that consumed by Hadoop.

Hadoop 3.2.1 requires guava-27.0-jre.jar. The latest is guava-29.0-jre.jar 
which also works.

This guava update will resolve the java.lang.NoSuchMethodError issue.

Cheers,

Steve C

On 6 Aug 2020, at 6:06 pm, Daniel Stojanov 
mailto:m...@danielstojanov.com>> wrote:

Hi,

Thanks for your help. Problem solved, but I thought I should add something in 
case this problem is encountered by others.

Both responses are correct; BasicAWSCredentialsProvider is gone, but simply 
making the substitution leads to the traceback just below. 
java.lang.NoSuchMethodError: 'void 
com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
java.lang.Object, java.lang.Object)'
I found that changing all of the Hadoop packages from 3.3.0 to 3.2.0 *and* 
changing the options away from BasicAWSCredentialsProvider solved the problem.

Thanks.
Regards,



Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/pyspark/sql/readwriter.py",
 line 535, in csv
return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File 
"/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__
  File 
"/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/pyspark/sql/utils.py", 
line 131, in deco
return f(*a, **kw)
  File 
"/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.csv.
: java.lang.NoSuchMethodError: 'void 
com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)
at 
org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1580)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:341)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)








On Thu, 6 Aug 2020 at 17:19, Stephen Coy 
mailto:s...@infomedia.com.au>> wrote:
Hi Daniel,

It looks like …BasicAWSCredentialsProvider has become 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.

However, the way that the username and password are provided appears to have 
changed so you will probably need to look in to that.

Cheers,

Steve C

On 6 Aug 2020, at 11:15 am, Daniel Stojanov 
mailto:m...@danielstojanov.com>> wrote:

Hi,

I am trying to read/write files to S3 from PySpark. The procedure that I have 
used is to download Spark, start PySpark with the hadoop-aws, guava, 
aws-java-sdk-bundle packages. The versions are explicitly specified by looking 
up the exact dependency versi

Re: S3 read/write from PySpark

2020-08-06 Thread Stephen Coy
Hi Daniel,

It looks like …BasicAWSCredentialsProvider has become 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.

However, the way that the username and password are provided appears to have 
changed so you will probably need to look in to that.

Cheers,

Steve C

On 6 Aug 2020, at 11:15 am, Daniel Stojanov 
mailto:m...@danielstojanov.com>> wrote:

Hi,

I am trying to read/write files to S3 from PySpark. The procedure that I have 
used is to download Spark, start PySpark with the hadoop-aws, guava, 
aws-java-sdk-bundle packages. The versions are explicitly specified by looking 
up the exact dependency version on Maven. Allowing dependencies to be auto 
determined does not work. This procedure works for Spark 3.0.0 with Hadoop 2.7, 
but does not work for Hadoop 3.2. I get this exception:
java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider


Can somebody point to a procedure for doing this using Spark bundled with 
Hadoop 3.2?

Regards,





To replicate:

Download Spark 3.0.0 with support for Hadoop 3.2.

Launch spark with:

./pyspark --packages 
org.apache.hadoop:hadoop-common:3.3.0,org.apache.hadoop:hadoop-client:3.3.0,org.apache.hadoop:hadoop-aws:3.3.0,com.amazonaws:aws-java-sdk-bundle:1.11.563,com.google.guava:guava:27.1-jre

Then run the following Python.

# Set these 4 parameters as appropriate.
aws_bucket = ""
aws_filename = ""
aws_access_key = ""
aws_secret_key = ""

spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", 
"true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 
"s3.amazonaws.com")
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_key)
df = spark.read.option("header", 
"true").csv(f"s3a://{aws_bucket}/{aws_filename}")



Leads to this error message:



Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/pyspark/sql/readwriter.py",
 line 535, in csv
return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File 
"/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__
  File 
"/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/pyspark/sql/utils.py",
 line 131, in deco
return f(*a, **kw)
  File 
"/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.csv.
: java.io.IOException: From option fs.s3a.aws.credentials.provider 
java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found
at 
org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:645)
at 
org.apache.hadoop.fs.s3a.S3AUtils.buildAWSProviderList(S3AUtils.java:668)
at 
org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:619)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:636)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:390)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

Re: Tab delimited csv import and empty columns

2020-08-05 Thread Stephen Coy
Hi Sean, German and others,

Setting the “nullValue” option (for parsing CSV at least) seems to be an 
exercise in futility.

When parsing the file, 
com.univocity.parsers.common.input.AbstractCharInputReader#getString contains 
the following logic:


String out;
if (len <= 0) {
   out = nullValue;
} else {
   out = new String(buffer, pos, len);
}

resulting in the nullValue being assigned to the column value if it has zero 
length, such as with an empty String.

Later, org.apache.spark.sql.catalyst.csv.UnivocityParser#nullSafeDatum is 
called on the column value:


if (datum == options.nullValue || datum == null) {
  if (!nullable) {
throw new RuntimeException(s"null value found but field $name is not 
nullable.")
  }
  null
} else {
  converter.apply(datum)
}

Therefore, the empty String is first converted to the nullValue, and then 
matched against the nullValue and, bingo, we get the literal null.

For now, the “.na.fill(“”)” addition to the code is doing the right thing for 
me.

Thanks for all the help.


Steve C


On 1 Aug 2020, at 1:40 am, Sean Owen 
mailto:sro...@gmail.com>> wrote:

Try setting nullValue to anything besides the empty string. Because its default 
is the empty string, empty strings become null by default.

On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy 
mailto:s...@infomedia.com.au.invalid>> wrote:
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any 
that load a CSV text file and verify that an empty string maps to an empty 
string which I think is supposed to be the default behaviour because the 
“nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo 
mailto:gschiavonsp...@gmail.com>> wrote:

Hey,

I understand that your empty values in your CSV are "" , if so, try this option:

.option("emptyValue", "\"\"")

Hope it helps

On Thu, 30 Jul 2020 at 08:49, Stephen Coy 
mailto:s...@infomedia.com.au.invalid>> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given 
the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]<https://www.infomedia.com.au/driving-force/?utm_campaign=200630%20Email%20Signature_source=Internal_medium=Email_content=Driving%20Force>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/




Re: Tab delimited csv import and empty columns

2020-07-31 Thread Stephen Coy
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any 
that load a CSV text file and verify that an empty string maps to an empty 
string which I think is supposed to be the default behaviour because the 
“nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo 
mailto:gschiavonsp...@gmail.com>> wrote:

Hey,

I understand that your empty values in your CSV are "" , if so, try this option:

.option("emptyValue", "\"\"")

Hope it helps

On Thu, 30 Jul 2020 at 08:49, Stephen Coy 
mailto:s...@infomedia.com.au.invalid>> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given 
the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]<https://www.infomedia.com.au/driving-force/?utm_campaign=200630%20Email%20Signature_source=Internal_medium=Email_content=Driving%20Force>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/



Tab delimited csv import and empty columns

2020-07-30 Thread Stephen Coy
Hi there,

I’m trying to import a tab delimited file with:

Dataset catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given 
the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


When does SparkContext.defaultParallelism have the correct value?

2020-07-06 Thread Stephen Coy
Hi there,

I have found that if I invoke

sparkContext.defaultParallelism()

too early it will not return the correct value;

For example, if I write this:

final JavaSparkContext sparkContext = new 
JavaSparkContext(sparkSession.sparkContext());
final int workerCount = sparkContext.defaultParallelism();

I will get some small number (which I can’t recall right now).

However, if I insert:

sparkContext.parallelize(List.of(1, 2, 3, 4)).collect()

between these two lines I get the expected value being something like 
node_count * node_core_count;

This seems like a hacky work around solution to me. Is there a better way to 
get this value initialised properly?

FWIW, I need this value to size a connection pool (fs.s3a.connection.maximum) 
correctly in a cluster independent way.

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]
This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-06 Thread Stephen Coy
Hi Steve,

While I understand your point regarding the mixing of Hadoop jars, this does 
not address the java.lang.ClassNotFoundException.

Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or Hadoop 
3.2. Not Hadoop 3.1.

The only place that I have found that missing class is in the Spark 
“hadoop-cloud” source module, and currently the only way to get the jar 
containing it is to build it yourself. If any of the devs are listening it  
would be nice if this was included in the standard distribution. It has a 
sizeable chunk of a repackaged Jetty embedded in it which I find a bit odd.

But I am relatively new to this stuff so I could be wrong.

I am currently running Spark 3.0 clusters with no HDFS. Spark is set up like:

hadoopConfiguration.set("spark.hadoop.fs.s3a.committer.name", "directory");
hadoopConfiguration.set("spark.sql.sources.commitProtocolClass", 
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol");
hadoopConfiguration.set("spark.sql.parquet.output.committer.class", 
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter");
hadoopConfiguration.set("fs.s3a.connection.maximum", Integer.toString(coreCount 
* 2));

Querying and updating s3a data sources seems to be working ok.

Thanks,

Steve C

On 29 Jun 2020, at 10:34 pm, Steve Loughran 
mailto:ste...@cloudera.com.INVALID>> wrote:

you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the 
same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is doomed. 
using a different aws sdk jar is a bit risky, though more recent upgrades have 
all be fairly low stress

On Fri, 19 Jun 2020 at 05:39, murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hi all
I've upgraded my test cluster to spark 3 and change my comitter to directory 
and I still get this error.. The documentations are somehow obscure on that.
Do I need to add a third party jar to support new comitters?

java.lang.ClassNotFoundException: 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol


On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hello all,
we have a hadoop cluster (using yarn) using  s3 as filesystem with s3guard is 
enabled.
We are using hadoop 3.2.1 with spark 2.4.5.

When I try to save a dataframe in parquet format, I get the following exception:
java.lang.ClassNotFoundException: 
com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My relevant spark configurations are as following:
"hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
"fs.s3a.committer.name":
 "magic",
"fs.s3a.committer.magic.enabled": true,
"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",

While spark streaming fails with the exception above, apache beam succeeds 
writing parquet files.
What might be the problem?

Thanks in advance


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread Stephen Coy
Hi Murat Migdisoglu,

Unfortunately you need the secret sauce to resolve this.

It is necessary to check out the Apache Spark source code and build it with the 
right command line options. This is what I have been using:

dev/make-distribution.sh --name my-spark --tgz -Pyarn -Phadoop-3.2  -Pyarn 
-Phadoop-cloud -Dhadoop.version=3.2.1

This will add additional jars into the build.

Copy hadoop-aws-3.2.1.jar, hadoop-openstack-3.2.1.jar and 
spark-hadoop-cloud_2.12-3.0.0.jar into the “jars” directory of your Spark 
distribution. If you are paranoid you could copy/replace all the 
hadoop-*-3.2.1.jar files but I have not found that necessary.

You will also need to upgrade the version of guava that appears in the spark 
distro because Hadoop 3.2.1 bumped this from guava-14.0.1.jar to 
guava-27.0-jre.jar. Otherwise you will get runtime ClassNotFound exceptions.

I have been using this combo for many months now with the Spark 3.0 
pre-releases and it has been working great.

Cheers,

Steve C


On 19 Jun 2020, at 10:24 am, murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:

Hi all
I've upgraded my test cluster to spark 3 and change my comitter to directory 
and I still get this error.. The documentations are somehow obscure on that.
Do I need to add a third party jar to support new comitters?

java.lang.ClassNotFoundException: 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol


On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hello all,
we have a hadoop cluster (using yarn) using  s3 as filesystem with s3guard is 
enabled.
We are using hadoop 3.2.1 with spark 2.4.5.

When I try to save a dataframe in parquet format, I get the following exception:
java.lang.ClassNotFoundException: 
com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My relevant spark configurations are as following:
"hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
"fs.s3a.committer.name":
 "magic",
"fs.s3a.committer.magic.enabled": true,
"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",

While spark streaming fails with the exception above, apache beam succeeds 
writing parquet files.
What might be the problem?

Thanks in advance


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: [PySpark] How to write HFiles as an 'append' to the same directory?

2020-03-16 Thread Stephen Coy
I encountered a similar problem when trying to:

ds.write().save(“s3a://some-bucket/some/path/table”);

which writes the content as a bunch of parquet files in the “folder” named 
“table”.

I am using a Flintrock cluster with the Spark 3.0 preview FWIW.

Anyway, I just used the AWS SDK to remove it (and any “subdirectories”) before 
kicking off the spark machinery.

I can show you how to do this in Java, but I think the Python SDK maybe 
significantly different.

Steve C


On 15 Mar 2020, at 6:23 am, Gautham Acharya 
mailto:gauth...@alleninstitute.org>> wrote:

I have a process in Apache Spark that attempts to write HFiles to S3 in a 
batched process. I want the resulting HFiles in the same directory, as they are 
in the same column family. However, I’m getting a ‘directory already exists 
error’ when I try to run this on AWS EMR. How can I write Hfiles via Spark as 
an ‘append’, like I can do via a CSV?

The batch writing function looks like this:

for col_group in split_cols:
processed_chunk = 
batch_write_pandas_udf_for_col_aggregation(joined_dataframe, col_group, 
pandas_udf_func, group_by_args)

hfile_writer.write_hfiles(processed_chunk, output_path,
  zookeeper_ip, table_name, 
constants.DEFAULT_COL_FAMILY)

The actual function to write the Hfiles is this:

rdd.saveAsNewAPIHadoopFile(output_path,
   constants.OUTPUT_FORMAT_CLASS,
   keyClass=constants.KEY_CLASS,
   valueClass=constants.VALUE_CLASS,
   keyConverter=constants.KEY_CONVERTER,
   valueConverter=constants.VALUE_CONVERTER,
   conf=conf)


The exception I’m getting:


Called with arguments: Namespace(job_args=['matrix_path=/tmp/matrix.csv', 
'metadata_path=/tmp/metadata.csv', 
'output_path=s3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles',
 'group_by_args=cluster_id', 'zookeeper_ip=ip-172-30-5-36.ec2.internal', 
'table_name=test_wide_matrix__8c6b2f09-2ba0-42fc-af8e-656188c0186a'], 
job_name='matrix_transformations')

job_args_tuples: [['matrix_path', '/tmp/matrix.csv'], ['metadata_path', 
'/tmp/metadata.csv'], ['output_path', 
's3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles'],
 ['group_by_args', 'cluster_id'], ['zookeeper_ip', 
'ip-172-30-5-36.ec2.internal'], ['table_name', 
'test_wide_matrix__8c6b2f09-2ba0-42fc-af8e-656188c0186a']]

Traceback (most recent call last):

  File "/mnt/var/lib/hadoop/steps/s-2ZIOR335HH9TR/main.py", line 56, in 

job_module.transform(spark, **job_args)

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/jobs/matrix_transformations/job.py",
 line 93, in transform

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/jobs/matrix_transformations/job.py",
 line 73, in write_split_columnwise_transform

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/output_handler/hfile_writer.py",
 line 44, in write_hfiles

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1438, in 
saveAsNewAPIHadoopFile

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
in deco

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.

: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
s3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles/median
 already exists

at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)

at 
org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.assertConf(SparkHadoopWriter.scala:393)

at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

at 

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
Hi there,

I’m kind of new around here, but I have had experience with all of all the so 
called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as 
well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means 
that such columns are always space padded, and they default to having this 
enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical 
reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and 
then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to 
VARCHAR - I believe that was its purpose in the first place, and it does not 
dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be 
consistently supported, but VARCHAR is a different animal and I would expect it 
to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value 
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of 
databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
 : "Snowflake currently deviates from common CHAR semantics in that strings 
shorter than the maximum length are not space-padded at the end."

MySQL: 
https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql








On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the 
proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin 
mailto:r...@databricks.com>> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of 
both new and old users?

For old users, their old code that was working for char(3) would now stop 
working.

For new users, depending on whether the underlying metastore char(3) is either 
supported but different from ansi Sql (which is not that big of a deal if we 
explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type 
behavior among its usages and configurations. However, the evolution direction 
has been gradually moving forward to be consistent inside Apache Spark because 
we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive 
behavior.)

spark-sql> CREATE TABLE t1(a CHAR(3));
spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

spark-sql> SELECT a, length(a) FROM