Re: Learning Spark

2019-07-04 Thread Vikas Garg
I am currently working as a data engineer and I am working on Power BI, SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and also able to run queries through Spark on multi node cluster DB (I am using Vertica DB and later will move on HDFS or SQL Server). I have good knowledge

Re: Learning Spark

2019-07-04 Thread ayan guha
My best advise is to go through the docs and listen to lots of demo/videos from spark committers. On Fri, 5 Jul 2019 at 3:03 pm, Kurt Fehlhauer wrote: > Are you a data scientist or data engineer? > > > On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg wrote: > >> Hi, >> >> I am new Spark learner. Can

Re: Learning Spark

2019-07-04 Thread Kurt Fehlhauer
Are you a data scientist or data engineer? On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg wrote: > Hi, > > I am new Spark learner. Can someone guide me with the strategy towards > getting expertise in PySpark. > > Thanks!!! >

Learning Spark

2019-07-04 Thread Vikas Garg
Hi, I am new Spark learner. Can someone guide me with the strategy towards getting expertise in PySpark. Thanks!!!

Avro support broken?

2019-07-04 Thread Paul Wais
Dear List, Has anybody gotten avro support to work in pyspark? I see multiple reports of it being broken on Stackoverflow and added my own repro to this ticket:

Spark 2.4.3 with hadoop 3.2 docker image.

2019-07-04 Thread José Luis Pedrosa
Hi All I'm trying to create docker images that can access azure services using abfs hadoop driver, which is only available in haddop 3.2. So I downloaded spark without Hadoop and generated spark images using the docker-image-tool.sh itself. In a new image using the resulting image as FROM,

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Silvio Fiorito
You need to first repartition (at a minimum by bucketColumn1) since each task will write out the buckets/files. If the bucket keys are distributed randomly across the RDD partitions, then you will get multiple files per bucket. From: Arwin Tio Date: Thursday, July 4, 2019 at 3:22 AM To:

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Phillip Henry
Hi, Arwin. If I understand you correctly, this is totally expected behaviour. I don't know much about saving to S3 but maybe you could write to HDFS first then copy everything to S3? I think the write to HDFS will probably be much faster as Spark/HDFS will write locally or to a machine on the

Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Arwin Tio
I am trying to use Spark's **bucketBy** feature on a pretty large dataset. ```java dataframe.write() .format("parquet") .bucketBy(500, bucketColumn1, bucketColumn2) .mode(SaveMode.Overwrite) .option("path", "s3://my-bucket") .saveAsTable("my_table"); ``` The problem is that