Re: [Pyspark 2.4] not able to partition the data frame by dates

2019-07-31 Thread Rishi Shah
Thanks for your prompt reply Gourav. I am using Spark 2.4.0 (cloudera distribution). The job consistently threw this error, so I narrowed down the dataset by adding a date filter (date rang: 2018-01-01 to 2018-06-30).. However it's still throwing the same error! *command*: spark2-submit --master

Re: [Pyspark 2.4] not able to partition the data frame by dates

2019-07-31 Thread Gourav Sengupta
Hi Rishi, there is no version as 2.4 :), can you please specify the exact SPARK version you are using? How are you starting the SPARK session? And what is the environment? I know this issue occurs intermittently over large writes in S3 and has to do with S3 eventual consistency issues. Just

[Pyspark 2.4] not able to partition the data frame by dates

2019-07-31 Thread Rishi Shah
Hi All, I have a dataframe of size 2.7T (parquet) which I need to partition by date, however below spark program doesn't help - keeps failing due to *file already exists exception..* df = spark.read.parquet(INPUT_PATH)

Announcing .NET for Apache Spark 0.4.0

2019-07-31 Thread Terry Kim
We are thrilled to announce that .NET for Apache Spark 0.4.0 has been just released ! Some of the highlights of this release include: - Apache Arrow backed UDFs (Vector UDF, Grouped Map UDF) - Robust UDF-related assembly loading -

Re: Spark Image resizing

2019-07-31 Thread Patrick McCarthy
It won't be very efficient but you could write a python UDF using PythonMagick - https://wiki.python.org/moin/ImageMagick If you have PyArrow > 0.10 then you might be able to get a boost by saving images in a column as BinaryType and writing a PandasUDF. On Wed, Jul 31, 2019 at 6:22 AM Nick

Re: Core allocation is scattered

2019-07-31 Thread Muthu Jayakumar
>I am running a spark job with 20 cores but i did not understand why my application get 1-2 cores on couple of machines why not it just run on two nodes like node1=16 cores and node 2=4 cores . but cores are allocated like node1=2 node =1-node 14=1 like that. I believe that's the intended

Re: Kafka Integration libraries put in the fat jar

2019-07-31 Thread Spico Florin
Hi! Thanks to Jacek Laskowski , I found the answer here https://stackoverflow.com/questions/51792203/how-to-get-spark-kafka-org-apache-sparkspark-sql-kafka-0-10-2-112-1-0-dependen Just add the maven shade plugin:

Re: Spark Image resizing

2019-07-31 Thread Nick Dawes
Any other way of resizing the image before creating the DataFrame in Spark? I know opencv does it. But I don't have opencv on my cluster. I have Anaconda python packages installed on my cluster. Any ideas will be appreciated. Thank you! On Tue, Jul 30, 2019, 4:17 PM Nick Dawes wrote: > Hi > >