Re: PyCharm IDE throws spark error

2020-11-13 Thread Wim Van Leuven
No Java installed? Or process can but find it? Java-home not set?

On Fri, 13 Nov 2020 at 23:24, Mich Talebzadeh 
wrote:

> Hi,
>
> This is basically a simple module
>
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> from pyspark.sql import HiveContext
> from pyspark.sql import SparkSession
> from pyspark.sql import Row
> from pyspark.sql.types import StringType, ArrayType
> from pyspark.sql.functions import udf, col
> import random
> import string
> import math
> *spark = SparkSession.builder.appName("sparkp"*
> *).enableHiveSupport().getOrCreate()*
>
> and comes back with the following error
>
>
> Traceback (most recent call last):
>
>   File "C:/Users/whg220/PycharmProjects/sparkp/venv/Scripts/sparkp.py",
> line 43, in 
>
> spark =
> SparkSession.builder.appName("sparkp").enableHiveSupport().getOrCreate()
>
>   File
> "C:\Users\whg220\spark\spark-2.3.4-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\sql\session.py",
> line 173, in getOrCreate
>
>   File
> "C:\Users\whg220\spark\spark-2.3.4-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\context.py",
> line 363, in getOrCreate
>
>   File
> "C:\Users\whg220\spark\spark-2.3.4-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\context.py",
> line 129, in __init__
>
>   File
> "C:\Users\whg220\spark\spark-2.3.4-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\context.py",
> line 312, in _ensure_initialized
>
>   File
> "C:\Users\whg220\spark\spark-2.3.4-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\java_gateway.py",
> line 46, in launch_gateway
>
>   File
> "C:\Users\whg220\spark\spark-2.3.4-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\java_gateway.py",
> line 101, in _launch_gateway
>
>   File "C:\Program Files\Anaconda3\lib\subprocess.py", line 707, in
> __init__
>
> restore_signals, start_new_session)
>
>   File "C:\Program Files\Anaconda3\lib\subprocess.py", line 990, in
> _execute_child
>
> startupinfo)
>
> FileNotFoundError: [WinError 2] The system cannot find the file specified
>
>
> very frustrating. Any help will be appreciated.
>
>
> Thanks
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Wim Van Leuven
I think Sean is right, but in your argumentation you mention that
'functionality
is sacrificed in favour of the availability of resources'. That's where I
disagree with you but agree with Sean. That is mostly not true.

In your previous posts you also mentioned this . The only reason we
sometimes have to bail out to Scala is for performance with certain udfs

On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
wrote:

> Thanks for the feedback Sean.
>
> Kind regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>
>> I don't find this trolling; I agree with the observation that 'the skills
>> you have' are a valid and important determiner of what tools you pick.
>> I disagree that you just have to pick the optimal tool for everything.
>> Sounds good until that comes in contact with the real world.
>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>> you're doing DataFrame operations. By design. So I can't see there being
>> one answer to this.
>>
>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> this is turning into a troll now, can you please stop this?
>>>
>>> No one uses Scala where Python should be used, and no one uses Python
>>> where Scala should be used - it all depends on requirements. Everyone
>>> understands polyglot programming and how to use relevant technologies best
>>> to their advantage.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>


Re: Why spark-submit works with package not with jar

2020-10-21 Thread Wim Van Leuven
We actually zipped the full conda environments  during our build and ship
those

On Wed, 21 Oct 2020 at 20:25, Mich Talebzadeh 
wrote:

> How about PySpark? What process can that go through to not depend on
> external repo access in production
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 21 Oct 2020 at 19:19, Sean Owen  wrote:
>
>> Yes, it's reasonable to build an uber-jar in development, using Maven/Ivy
>> to resolve dependencies (and of course excluding 'provided' dependencies
>> like Spark), and push that to production. That gives you a static artifact
>> to run that does not depend on external repo access in production.
>>
>> On Wed, Oct 21, 2020 at 1:15 PM Wim Van Leuven <
>> wim.vanleu...@highestpoint.biz> wrote:
>>
>>> I like an artefact repo as the proper solution. Problem with
>>> environments that haven't yet fully embraced devops: artefact repos are
>>> considered development tools and are often not yet used to promote packages
>>> to production, air gapped if necessary.
>>> -wim
>>>
>>


Re: Why spark-submit works with package not with jar

2020-10-21 Thread Wim Van Leuven
I like an artefact repo as the proper solution. Problem with environments
that haven't yet fully embraced devops: artefact repos are considered
development tools and are often not yet used to promote packages to
production, air gapped if necessary.
-wim

On Wed, 21 Oct 2020 at 19:00, Mich Talebzadeh 
wrote:

>
> Hi Wim,
>
> This is an issue DEV/OPS face all the time. Cannot access the internet
> behind the company firewall. There is Nexus
> <https://www.sonatype.com/nexus/repository-pro> for this that manages
> dependencies with usual load times in seconds. However, only authorised
> accounts can request it through a service account. I concur it is messy.
>
> cheers,
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 21 Oct 2020 at 06:34, Wim Van Leuven <
> wim.vanleu...@highestpoint.biz> wrote:
>
>> Sean,
>>
>> Problem with the -packages is that in enterprise settings security might
>> not allow the data environment to link to the internet or even the internal
>> proxying artefect repository.
>>
>> Also, wasn't uberjars an antipattern? For some reason I don't like them...
>>
>> Kind regards
>> -wim
>>
>>
>>
>> On Wed, 21 Oct 2020 at 01:06, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks again all.
>>>
>>> Anyway as Nicola suggested I used the trench war approach to sort this
>>> out by just using jars and working out their dependencies in ~/.ivy2/jars
>>> directory using grep -lRi  :)
>>>
>>>
>>> This now works with just using jars (new added ones in grey) after
>>> resolving the dependencies
>>>
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>
>>> --master yarn \
>>>
>>> --deploy-mode client \
>>>
>>> --conf spark.executor.memoryOverhead=3000 \
>>>
>>> --class org.apache.spark.repl.Main \
>>>
>>> --name "my own Spark shell on Yarn" "$@" \
>>>
>>> --driver-class-path /home/hduser/jars/ddhybrid.jar \
>>>
>>> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>>>
>>>/home/hduser/jars/ddhybrid.jar, \
>>>
>>>
>>>  /home/hduser/jars/com.google.http-client_google-http-client-1.24.1.jar, \
>>>
>>>
>>>  
>>> /home/hduser/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar,
>>> \
>>>
>>>
>>>  /home/hduser/jars/com.google.cloud.bigdataoss_util-1.9.4.jar, \
>>>
>>>
>>>  /home/hduser/jars/com.google.api-client_google-api-client-1.24.1.jar, \
>>>
>>>
>>> /home/hduser/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar, \
>>>
>>>
>>>  
>>> /home/hduser/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar,
>>> \
>>>
>>>
>>>  
>>> /home/hduser/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar,
>>> \
>>>
>>>/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar \
>>>
>>>
>>> Compared to using the package itself as before
>>>
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>
>>> --master yarn \
>>>
>>> --deploy-mode client \
>>>
>>> --conf spark.executor.memoryOverhead=3000 \
>>>
>>> --class org.apache.spark.repl.Main \
>>>
>>> --name "my own Spark shell on Yarn" "$@" \
>>>
>>> --driver-class-path /home/hduser/jars/ddhybrid.jar \
>>>
>>> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>>>
>>>/home/hduser/jars/ddhybrid.jar \
>>>
>>>
>>> --packages
>>> com.github.samelamin:spark-bigquery_2.11:0.2.6
>>>
>>>
>>>
>>> I think as Sean suggested this approach may or may not work (a manual
>>> process) and if jars change, the whole thing has to be re-evaluated adding
>>> to the complexity.
>>>
>>>
>>> Cheers
>>>
>>>
>>> On Tue, 20 Oct 2020 at 23:01, Sean Owen  wrote:
>>>
>>>> Rather, let --packages (via Ivy) worry about them, because they tell
>>>> Ivy what they need.
>>>> There's no 100% guarantee that conflicting dependencies are resolved in
>>>> a way that works in every single case, which you run into sometimes when
>>>> using incompatible libraries, but yes this is the point of --packages and
>>>> Ivy.
>>>>
>>>> On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Thanks again all.
>>>>>
>>>>> Hi Sean,
>>>>>
>>>>> As I understood from your statement, you are suggesting just use
>>>>> --packages without worrying about individual jar dependencies?
>>>>>
>>>>>>
>>>>>>>>


Re: Why spark-submit works with package not with jar

2020-10-20 Thread Wim Van Leuven
Sean,

Problem with the -packages is that in enterprise settings security might
not allow the data environment to link to the internet or even the internal
proxying artefect repository.

Also, wasn't uberjars an antipattern? For some reason I don't like them...

Kind regards
-wim



On Wed, 21 Oct 2020 at 01:06, Mich Talebzadeh 
wrote:

> Thanks again all.
>
> Anyway as Nicola suggested I used the trench war approach to sort this out
> by just using jars and working out their dependencies in ~/.ivy2/jars
> directory using grep -lRi  :)
>
>
> This now works with just using jars (new added ones in grey) after
> resolving the dependencies
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master yarn \
>
> --deploy-mode client \
>
> --conf spark.executor.memoryOverhead=3000 \
>
> --class org.apache.spark.repl.Main \
>
> --name "my own Spark shell on Yarn" "$@" \
>
> --driver-class-path /home/hduser/jars/ddhybrid.jar \
>
> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>
>/home/hduser/jars/ddhybrid.jar, \
>
>
>  /home/hduser/jars/com.google.http-client_google-http-client-1.24.1.jar, \
>
>
>  
> /home/hduser/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar,
> \
>
>
>  /home/hduser/jars/com.google.cloud.bigdataoss_util-1.9.4.jar, \
>
>
>  /home/hduser/jars/com.google.api-client_google-api-client-1.24.1.jar, \
>
>
> /home/hduser/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar, \
>
>
>  
> /home/hduser/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar,
> \
>
>
>  
> /home/hduser/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar,
> \
>
>/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar \
>
>
> Compared to using the package itself as before
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master yarn \
>
> --deploy-mode client \
>
> --conf spark.executor.memoryOverhead=3000 \
>
> --class org.apache.spark.repl.Main \
>
> --name "my own Spark shell on Yarn" "$@" \
>
> --driver-class-path /home/hduser/jars/ddhybrid.jar \
>
> --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>
>/home/hduser/jars/ddhybrid.jar \
>
>
> --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>
>
>
> I think as Sean suggested this approach may or may not work (a manual
> process) and if jars change, the whole thing has to be re-evaluated adding
> to the complexity.
>
>
> Cheers
>
>
> On Tue, 20 Oct 2020 at 23:01, Sean Owen  wrote:
>
>> Rather, let --packages (via Ivy) worry about them, because they tell Ivy
>> what they need.
>> There's no 100% guarantee that conflicting dependencies are resolved in a
>> way that works in every single case, which you run into sometimes when
>> using incompatible libraries, but yes this is the point of --packages and
>> Ivy.
>>
>> On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks again all.
>>>
>>> Hi Sean,
>>>
>>> As I understood from your statement, you are suggesting just use
>>> --packages without worrying about individual jar dependencies?
>>>

>>


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Wim Van Leuven
Hey Mich,

This is a very fair question .. I've seen many data engineering teams start
out with Scala because technically it is the best choice for many given
reasons and basically it is what Spark is.

On the other hand, almost all use cases we see these days are data science
use cases where people mostly do python. So, if you need those two worlds
collaborate and even handover code, you don't want the ideological battle
of Scala vs Python. We chose python for the sake of everybody speaking the
same language.

But it is true, if you do Spark DataFrames, because then PySpark is a thin
layer around everything on the JVM. Even the discussion of Python UDFs
don't hold up. If it works as a Python function (and most of the time it
does) why do Scala? If however, performance characteristics show you
otherwise, implement those UDFs on the JVM.

Problem with Python? Good engineering practices translated in tools are
much more rare ... a build tool like Maven for Java or SBT for Scala don't
exist ... yet? You can look at PyBuilder for this.

So, referring to the website you mention ... in practice, because of the
many data science use cases out there, I see many Spark shops prefer python
over Scala because Spark gravitates to dataframes where the downsides of
Python do not stack up. Performance of python as a driver program which is
just the glue code, becomes irrelevant compared to the processing you are
doing on the JVM. We even notice that Python is much easier and we hear
echoes that finding (good?) Scala engineers is hard(er).

So, conclusion: Python brings data engineers and data science together. If
you only do data engineering, Scala can be the better choice. It depends on
the context.

Hope this helps
-wim

On Fri, 9 Oct 2020 at 23:27, Mich Talebzadeh 
wrote:

> Thanks
>
> So ignoring Python lambdas is it a matter of individuals familiarity with
> the language that is the most important factor? Also I have noticed that
> Spark document preferences have been switched from Scala to Python as the
> first example. However, some codes for example JDBC calls are the same for
> Scala and Python.
>
> Some examples like this website
> 
> claim that Scala performance is an order of magnitude better than Python
> and also when it comes to concurrency Scala is a better choice. Maybe it is
> pretty old (2018)?
>
> Also (and may be my ignorance I have not researched it) does Spark offer
> REPL in the form of spark-shell with Python?
>
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
> wrote:
>
>> As long as you don't use python lambdas in your Spark job there should be
>> almost no difference between the Scala and Python dataframe code. Once you
>> introduce python lambdas you will hit some significant serialization
>> penalties as well as have to run actual work code in python. As long as no
>> lambdas are used, everything will operate with Catalyst compiled java code
>> so there won't be a big difference between python and scala.
>>
>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's 

Re: PySpark .collect() output to Scala Array[Row]

2020-05-25 Thread Wim Van Leuven
Looking at the stack trace, your data from Spark gets serialized to an
ArrayList (of something) whereas in your scala code you are using an Array
of Rows. So, the types don't lign up. That's the exception you are seeing:
the JVM searches for a signature that simply does not exist.

Try to turn the Array into a java.util.ArrayList?
-w

On Tue, 26 May 2020 at 03:04, Nick Ruest  wrote:

> Hi,
>
> I've hit a wall with trying to implement a couple of Scala methods in a
> Python version of our project. I've implemented a number of these
> already, but I'm getting hung up with this one.
>
> My Python function looks like this:
>
> def Write_Graphml(data, graphml_path, sc):
> return sc.getOrCreate()._jvm.io
> .archivesunleashed.app.WriteGraphML(data,
> graphml_path).apply
>
>
> Where data is a DataFrame that has been collected; data.collect().
>
> On the Scala side is it basically:
>
> object WriteGraphML {
>   apply(data: Array[Row], graphmlPath: String): Boolean = {
> ...
> massages an Array[Row] into GraphML
> ...
> True
> }
>
> When I try to use it in PySpark, I end up getting this error message:
>
> Py4JError: An error occurred while calling
> None.io.archivesunleashed.app.WriteGraphML. Trace:
> py4j.Py4JException: Constructor
> io.archivesunleashed.app.WriteGraphML([class java.util.ArrayList, class
> java.lang.String]) does not exist
> at
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
> at
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
> at py4j.Gateway.invoke(Gateway.java:237)
> at
>
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
>
>
> I originally dug into what the error message stated, and tried a variety
> of tweaks such as:
>
> sc.getOrCreate()._jvm.io.archivesunleashed.app.WriteGraphML.apply(data,
> graphml_path)
>
> And, went as far as trying get_attr, and calling the "WriteGraphML$" and
> few other varieties with that method.
>
> All the results produced the same variety of error message above; that
> the Constructor or method does not exist.
>
> I came across this[1] Based on lots of Googling and Stack Overflow
> searches, and it has me thinking that the problem is because of how Py4J
> is passing off the Python List (data) to the JVM, and then passing it to
> Scala. It's ending up as an ArrayList instead of an Array[Row].
>
> Do I need to tweak data before it is passed to Write_Graphml? Or am I
> doing something else wrong here.
>
> I had originally posted a version of this message to the dev list, and
> Sean Owen suggested WriteGraphML should be a implemented as a class, not
> an object. Is that the right path? I have a number of other Scala
> functions implemented in the PySpark side of our project that are
> objects, and everything works fine.
>
> ...and is there a best practices guide or documentation for implementing
> Scala functions in PySpark? I've found a number of blog posts that have
> been helpful.
>
> Thanks in advance for any help!
>
> cheers!
>
> -nruest
>
> [1]
> https://stackoverflow.com/questions/61928886/pyspark-list-to-scala-sequence
>
>


Re: find failed test

2020-03-06 Thread Wim Van Leuven
Srsly?

On Sat, 7 Mar 2020 at 03:28, Koert Kuipers  wrote:

> i just ran:
> mvn test -fae > log.txt
>
> at the end of log.txt i find it says there are failures:
> [INFO] Spark Project SQL .. FAILURE [47:55
> min]
>
> that is not very helpful. what tests failed?
>
> i could go scroll up but the file has 21,517 lines. ok let's skip that.
>
> so i figure there are test reports in sql/core/target. i was right! its
> sq/core/target/surefire-reports. but it has 276 files, so thats still a bit
> much to go through. i assume there is some nice summary that shows me the
> failed tests... maybe SparkTestSuite.txt? its 2687 lines, so again a bit
> much, but i do go through it and find nothing useful.
>
> so... how do i quickly find out which test failed exactly?
> there must be some maven trick here?
>
> thanks!
>


Re:

2020-03-02 Thread Wim Van Leuven
Ok, good luck!

On Mon, 2 Mar 2020 at 10:04, Hamish Whittal 
wrote:

> Enrico, Wim (and privately Neil), thanks for the replies. I will give your
> suggestions a whirl.
>
> Basically Wim recommended a pre-processing step to weed out the
> problematic files. I am going to build that into the pipeline. I am not
> sure how the problems are creeping in because this is a regular lift from a
> PGSQL db/table. And so some of these files are correct and some are
> patently wrong.
>
> I'm working around the problem by trying small subsets of the 3000+ files,
> but until I can weed out the problem files the processing is going to fail.
> I need something more bulletproof than what I'm doing. So this is what I'm
> going to try now.
>
> Hamish
>
> On Mon, Mar 2, 2020 at 10:15 AM Enrico Minack 
> wrote:
>
>> Looks like the schema of some files is unexpected.
>>
>> You could either run parquet-tools on each of the files and extract the
>> schema to find the problematic files:
>>
>> hdfs -stat "%n" hdfs://
>> ip-172-24-89-229.blaah.com:8020/user/hadoop/origdata/part-*.parquet
>> 
>> | while read file
>> do
>>echo -n "$file: "
>>hadoop jar parquet-tools-1.9.0.jar schema $file
>> done
>>
>>
>> https://confusedcoders.com/data-engineering/hadoop/how-to-view-content-of-parquet-files-on-s3hdfs-from-hadoop-cluster-using-parquet-tools
>>
>>
>> Or you can use Spark to investigate the parquet files in parallel:
>>
>> spark.sparkContext
>>   
>> .binaryFiles("hdfs://ip-172-24-89-229.blaah.com:8020/user/hadoop/origdata/part-*.parquet
>>  
>> ")
>>   .map { case (path, _) =>
>> import collection.JavaConverters._val file = 
>> HadoopInputFile.fromPath(new Path(path), new Configuration())
>> val reader = ParquetFileReader.open(file)
>> try {
>>   val schema = reader.getFileMetaData().getSchema
>>   (
>> schema.getName,schema.getFields.asScala.map(f => (
>>   Option(f.getId).map(_.intValue()),  f.getName,  
>> Option(f.getOriginalType).map(_.name()),  
>> Option(f.getRepetition).map(_.name()))
>> ).toArray
>>   )
>> } finally {
>>   reader.close()
>> }
>>   }
>>   .toDF("schema name", "fields")
>>   .show(false)
>>
>> .binaryFiles provides you all filenames that match the given pattern as
>> an RDD, so the following .map is executed on the Spark executors.
>> The map then opens each parquet file via ParquetFileReader and provides
>> access to its schema and data.
>>
>> I hope this points you in the right direction.
>>
>> Enrico
>>
>>
>> Am 01.03.20 um 22:56 schrieb Hamish Whittal:
>>
>> Hi there,
>>
>> I have an hdfs directory with thousands of files. It seems that some of
>> them - and I don't know which ones - have a problem with their schema and
>> it's causing my Spark application to fail with this error:
>>
>> Caused by: org.apache.spark.sql.execution.QueryExecutionException:
>> Parquet column cannot be converted in file hdfs://
>> ip-172-24-89-229.blaah.com:8020/user/hadoop/origdata/part-0-8b83989a-e387-4f64-8ac5-22b16770095e-c000.snappy.parquet.
>> Column: [price], Expected: double, Found: FIXED_LEN_BYTE_ARRAY
>>
>> The problem is not only that it's causing the application to fail, but
>> every time if does fail, I have to copy that file out of the directory and
>> start the app again.
>>
>> I thought of trying to use try-except, but I can't seem to get that to
>> work.
>>
>> Is there any advice anyone can give me because I really can't see myself
>> going through thousands of files trying to figure out which ones are broken.
>>
>> Thanks in advance,
>>
>> hamish
>>
>>
>>
>
> --
> Cloud-Fundis.co.za
> Cape Town, South Africa
> +27 79 614 4913
>


Re:

2020-03-01 Thread Wim Van Leuven
Hey Hamish,

I don't think there is 'automatic fix' for this problem ...
Are you reading those as partitions of a single dataset? Or are you
processing them individually?

As apparently, your incoming data is not stable, you should implement a
preprocessing step on each file to check and, if necessary,  to align the
faulty datasets with what you expect.

HTH
-wim


Performance of PySpark 2.3.2 on Microsoft Windows

2019-11-18 Thread Wim Van Leuven
Hello,

we are writing a lot of data processing pipelines for Spark using pyspark
and add a lot of integration tests.

In our enterprise environment, a lot of people are running Windows PCs and
we notice that build times are really slow on Windows because of the
integration tests. These metrics are compared against the run of the builds
on Mac (dev PCs) or Linux (our CI servers are Linux).

We can not identify easily what is causing the slow down, but it's mostly
pyspark communicating with spark on the JVM.

Any pointers/clues to where to look for more information?
Obviously, plain help in the matter is more then welcome as well.

Kind regards,
Wim