Re: Moving millions of file using spark

2021-06-16 Thread Molotch
Definitely not a spark task. Moving files within the same filesystem is merely a linking exercise, you don't have to actually move any data. Write a shell script creating hard links in the new location, once you're satisfied, remove the old links, profit. -- Sent from:

Re: Reading Large File in Pyspark

2021-05-27 Thread Molotch
You can specify the line separator to make spark split your records into separate rows. df = spark.read.option("lineSep","^^^").text("path") Then you need to df.select(split("value", "***").as("arrayColumn")) the column into an array and map over it with getItem to create a column for each

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Molotch
I would say the pros and cons of Python vs Scala is both down to Spark, the languages in themselves and what kind of data engineer you will get when you try to hire for the different solutions. With Pyspark you get less functionality and increased complexity with the py4j java interop compared

Re: [pyspark 2.3+] Dedupe records

2020-05-30 Thread Molotch
thing else. The memory overhead of JVM objects and GC runs might be brutal on your performance and memory usage depending on your dataset and use case. br, molotch -- Sent from: http://apache-spark-user-list.1001560.n3.