Re: Moving millions of file using spark
Definitely not a spark task. Moving files within the same filesystem is merely a linking exercise, you don't have to actually move any data. Write a shell script creating hard links in the new location, once you're satisfied, remove the old links, profit. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Reading Large File in Pyspark
You can specify the line separator to make spark split your records into separate rows. df = spark.read.option("lineSep","^^^").text("path") Then you need to df.select(split("value", "***").as("arrayColumn")) the column into an array and map over it with getItem to create a column for each property. df.select((0 until 8).map(i => $"arrayColumn".getItem(i).as(s"col$i")): _* ) Then you should have a DataFrame with each record on a row and each property in a column. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Scala vs Python for ETL with Spark
I would say the pros and cons of Python vs Scala is both down to Spark, the languages in themselves and what kind of data engineer you will get when you try to hire for the different solutions. With Pyspark you get less functionality and increased complexity with the py4j java interop compared to vanilla Spark. Why would you want that? Maybe you want the Python ML tools and have a clear use case, then go for it. If not, avoid the increased complexity and reduced functionality of Pyspark. Python vs Scala? Idiomatic Python is a lesson in bad programming habits/ideas, there's no other way to put it. Do you really want programmers enjoying coding i such a language hacking away at your system? Scala might be far from perfect with the plethora of ways to express yourself. But Python < 3.5 is not fit for anything except simple scripting IMO. Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a fine idea. Coding an entire ETL library including state management, the whole kitchen including the sink, Scala everyday of the week. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [pyspark 2.3+] Dedupe records
The performant way would be to partition your dataset into reasonably small chunks and use a bloom filter to see if the entity might be in your set before you make a lookup. Check the bloom filter, if the entity might be in the set, rely on partition pruning to read and backfill the relevant partition. If the entity isn't in the set, just save as new data. Sooner or later you probably would want to compact the appended partitions to reduce the amount of small files. Delta Lake has update and compation semantics unless you want to do it manually. Since 2.4.0 Spark is also able to prune buckets. But as far as I know there's no way to backfill a single bucket. If it was the combination of partition and bucket pruning could dramatically limit the amount data you needed to read/write from/to disk. RDD vs Dataframe, I'm not sure exactly how and when Tungsten is able to be used when using RDD:s, if at all. Because of that I always try to use Dataframes and the built in fucntions as long as possible just to get the sweet offheap allocation and the "expressions to byte code" thingy along the Catalyst optimizations. That will probably make more for your performance than anything else. The memory overhead of JVM objects and GC runs might be brutal on your performance and memory usage depending on your dataset and use case. br, molotch -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org