Dear List,
Has anybody gotten avro support to work in pyspark? I see multiple
reports of it being broken on Stackoverflow and added my own repro to
this ticket:
Hi,
I am new Spark learner. Can someone guide me with the strategy towards
getting expertise in PySpark.
Thanks!!!
Hi All
I'm trying to create docker images that can access azure services using
abfs hadoop driver, which is only available in haddop 3.2.
So I downloaded spark without Hadoop and generated spark images using the
docker-image-tool.sh itself.
In a new image using the resulting image as FROM,
You need to first repartition (at a minimum by bucketColumn1) since each task
will write out the buckets/files. If the bucket keys are distributed randomly
across the RDD partitions, then you will get multiple files per bucket.
From: Arwin Tio
Date: Thursday, July 4, 2019 at 3:22 AM
To:
Are you a data scientist or data engineer?
On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg wrote:
> Hi,
>
> I am new Spark learner. Can someone guide me with the strategy towards
> getting expertise in PySpark.
>
> Thanks!!!
>
My best advise is to go through the docs and listen to lots of demo/videos
from spark committers.
On Fri, 5 Jul 2019 at 3:03 pm, Kurt Fehlhauer wrote:
> Are you a data scientist or data engineer?
>
>
> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg wrote:
>
>> Hi,
>>
>> I am new Spark learner. Can
I am currently working as a data engineer and I am working on Power BI,
SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
also able to run queries through Spark on multi node cluster DB (I am using
Vertica DB and later will move on HDFS or SQL Server).
I have good knowledge
I am trying to use Spark's **bucketBy** feature on a pretty large dataset.
```java
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
```
The problem is that
Hi, Arwin.
If I understand you correctly, this is totally expected behaviour.
I don't know much about saving to S3 but maybe you could write to HDFS
first then copy everything to S3? I think the write to HDFS will probably
be much faster as Spark/HDFS will write locally or to a machine on the