Re: PyCharm IDE throws spark error

2020-11-13 Thread Wim Van Leuven
No Java installed? Or process can but find it? Java-home not set? On Fri, 13 Nov 2020 at 23:24, Mich Talebzadeh wrote: > Hi, > > This is basically a simple module > > from pyspark import SparkContext > from pyspark.sql import SQLContext > from pyspark.sql import HiveContext > from pyspark.sql

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Wim Van Leuven
I think Sean is right, but in your argumentation you mention that 'functionality is sacrificed in favour of the availability of resources'. That's where I disagree with you but agree with Sean. That is mostly not true. In your previous posts you also mentioned this . The only reason we sometimes

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Wim Van Leuven
19:19, Sean Owen wrote: > >> Yes, it's reasonable to build an uber-jar in development, using Maven/Ivy >> to resolve dependencies (and of course excluding 'provided' dependencies >> like Spark), and push that to production. That gives you a static artifact >> to run that does not

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Wim Van Leuven
ing on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 21 Oct 2020 at 06:34, Wim Van Leuven < > wim.vanleu...@highestpoint.biz>

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Wim Van Leuven
Sean, Problem with the -packages is that in enterprise settings security might not allow the data environment to link to the internet or even the internal proxying artefect repository. Also, wasn't uberjars an antipattern? For some reason I don't like them... Kind regards -wim On Wed, 21 Oct

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Wim Van Leuven
Hey Mich, This is a very fair question .. I've seen many data engineering teams start out with Scala because technically it is the best choice for many given reasons and basically it is what Spark is. On the other hand, almost all use cases we see these days are data science use cases where

Re: PySpark .collect() output to Scala Array[Row]

2020-05-25 Thread Wim Van Leuven
Looking at the stack trace, your data from Spark gets serialized to an ArrayList (of something) whereas in your scala code you are using an Array of Rows. So, the types don't lign up. That's the exception you are seeing: the JVM searches for a signature that simply does not exist. Try to turn the

Re: find failed test

2020-03-06 Thread Wim Van Leuven
Srsly? On Sat, 7 Mar 2020 at 03:28, Koert Kuipers wrote: > i just ran: > mvn test -fae > log.txt > > at the end of log.txt i find it says there are failures: > [INFO] Spark Project SQL .. FAILURE [47:55 > min] > > that is not very helpful. what tests failed? > >

Re:

2020-03-02 Thread Wim Van Leuven
Ok, good luck! On Mon, 2 Mar 2020 at 10:04, Hamish Whittal wrote: > Enrico, Wim (and privately Neil), thanks for the replies. I will give your > suggestions a whirl. > > Basically Wim recommended a pre-processing step to weed out the > problematic files. I am going to build that into the

Re:

2020-03-01 Thread Wim Van Leuven
Hey Hamish, I don't think there is 'automatic fix' for this problem ... Are you reading those as partitions of a single dataset? Or are you processing them individually? As apparently, your incoming data is not stable, you should implement a preprocessing step on each file to check and, if

Performance of PySpark 2.3.2 on Microsoft Windows

2019-11-18 Thread Wim Van Leuven
Hello, we are writing a lot of data processing pipelines for Spark using pyspark and add a lot of integration tests. In our enterprise environment, a lot of people are running Windows PCs and we notice that build times are really slow on Windows because of the integration tests. These metrics