date:20190704

Avro support broken?

2019-07-04 Thread Paul Wais

Dear List, Has anybody gotten avro support to work in pyspark? I see multiple reports of it being broken on Stackoverflow and added my own repro to this ticket:

Learning Spark

2019-07-04 Thread Vikas Garg

Hi, I am new Spark learner. Can someone guide me with the strategy towards getting expertise in PySpark. Thanks!!!

Spark 2.4.3 with hadoop 3.2 docker image.

2019-07-04 Thread José Luis Pedrosa

Hi All I'm trying to create docker images that can access azure services using abfs hadoop driver, which is only available in haddop 3.2. So I downloaded spark without Hadoop and generated spark images using the docker-image-tool.sh itself. In a new image using the resulting image as FROM,

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Silvio Fiorito

You need to first repartition (at a minimum by bucketColumn1) since each task will write out the buckets/files. If the bucket keys are distributed randomly across the RDD partitions, then you will get multiple files per bucket. From: Arwin Tio Date: Thursday, July 4, 2019 at 3:22 AM To:

Re: Learning Spark

2019-07-04 Thread Kurt Fehlhauer

Are you a data scientist or data engineer? On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg wrote: > Hi, > > I am new Spark learner. Can someone guide me with the strategy towards > getting expertise in PySpark. > > Thanks!!! >

Re: Learning Spark

2019-07-04 Thread ayan guha

My best advise is to go through the docs and listen to lots of demo/videos from spark committers. On Fri, 5 Jul 2019 at 3:03 pm, Kurt Fehlhauer wrote: > Are you a data scientist or data engineer? > > > On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg wrote: > >> Hi, >> >> I am new Spark learner. Can

Re: Learning Spark

2019-07-04 Thread Vikas Garg

I am currently working as a data engineer and I am working on Power BI, SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and also able to run queries through Spark on multi node cluster DB (I am using Vertica DB and later will move on HDFS or SQL Server). I have good knowledge

Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Arwin Tio

I am trying to use Spark's **bucketBy** feature on a pretty large dataset. ```java dataframe.write() .format("parquet") .bucketBy(500, bucketColumn1, bucketColumn2) .mode(SaveMode.Overwrite) .option("path", "s3://my-bucket") .saveAsTable("my_table"); ``` The problem is that

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Phillip Henry

Hi, Arwin. If I understand you correctly, this is totally expected behaviour. I don't know much about saving to S3 but maybe you could write to HDFS first then copy everything to S3? I think the write to HDFS will probably be much faster as Spark/HDFS will write locally or to a machine on the

Avro support broken?

Learning Spark

Spark 2.4.3 with hadoop 3.2 docker image.

Re: Parquet 'bucketBy' creates a ton of files

Re: Learning Spark

Re: Learning Spark

Re: Learning Spark

Parquet 'bucketBy' creates a ton of files

Re: Parquet 'bucketBy' creates a ton of files

9 matches

Site Navigation

Mail list logo

Footer information