Re: Moving millions of file using spark

2021-06-16 Thread Molotch
Definitely not a spark task.

Moving files within the same filesystem is merely a linking exercise, you
don't have to actually move any data. Write a shell script creating hard
links in the new location, once you're satisfied, remove the old links,
profit.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading Large File in Pyspark

2021-05-27 Thread Molotch
You can specify the line separator to make spark split your records into
separate rows.

df = spark.read.option("lineSep","^^^").text("path")

Then you need to df.select(split("value", "***").as("arrayColumn")) the
column into an array and map over it with getItem to create a column for
each property.

df.select((0 until 8).map(i => $"arrayColumn".getItem(i).as(s"col$i")): _* )

Then you should have a DataFrame with each record on a row and each property
in a column.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Molotch
I would say the pros and cons of Python vs Scala is both down to Spark, the
languages in themselves and what kind of data engineer you will get when you
try to hire for the different solutions. 

With Pyspark you get less functionality and increased complexity with the
py4j java interop compared to vanilla Spark. Why would you want that? Maybe
you want the Python ML tools and have a clear use case, then go for it. If
not, avoid the increased complexity and reduced functionality of Pyspark.

Python vs Scala? Idiomatic Python is a lesson in bad programming
habits/ideas, there's no other way to put it. Do you really want programmers
enjoying coding i such a language hacking away at your system?

Scala might be far from perfect with the plethora of ways to express
yourself. But Python < 3.5 is not fit for anything except simple scripting
IMO.

Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
fine idea. Coding an entire ETL library including state management, the
whole kitchen including the sink, Scala everyday of the week.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [pyspark 2.3+] Dedupe records

2020-05-30 Thread Molotch
The performant way would be to partition your dataset into reasonably small
chunks and use a bloom filter to see if the entity might be in your set
before you make a lookup.

Check the bloom filter, if the entity might be in the set, rely on partition
pruning to read and backfill the relevant partition. If the entity isn't in
the set, just save as new data.

Sooner or later you probably would want to compact the appended partitions
to reduce the amount of small files.

Delta Lake has update and compation semantics unless you want to do it
manually.

Since 2.4.0 Spark is also able to prune buckets. But as far as I know
there's no way to backfill a single bucket. If it was the combination of
partition and bucket pruning could dramatically limit the amount data you
needed to read/write from/to disk.

RDD vs Dataframe, I'm not sure exactly how and when Tungsten is able to be
used when using RDD:s, if at all. Because of that I always try to use
Dataframes and the built in fucntions as long as possible just to get the
sweet offheap allocation and the "expressions to byte code" thingy along the
Catalyst optimizations. That will probably make more for your performance
than anything else. The memory overhead of JVM objects and GC runs might be
brutal on your performance and memory usage depending on your dataset and
use case.


br,

molotch



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org