Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-24 Thread Sean Owen
If you have the same amount of resource (cores, memory, etc) on one machine, that is pretty much always going to be faster than using those same resources split across several machines. Even if you have somewhat more resource available on a cluster, the distributed version could be slower if you,

A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-24 Thread javaguy Java
Hi, I made a post on stackoverflow that I can't seem to make any headway on https://stackoverflow.com/questions/63834379/spark-performance-local-faster-than-cluster Before someone starts making suggestions on changing the code; note that the code and example on the above post is from a Udemy

Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-24 Thread Michael Mior
If you want to ensure the persisted RDD has been calculated first, just run foreach with a dummy function first to force evaluation. -- Michael Mior michael.m...@gmail.com Le jeu. 24 sept. 2020 à 00:38, Arya Ketan a écrit : > > Thanks, we were able to validate the same behaviour. > > On Wed, 23

Re: [Pyspark 3 Debug] Date values reset to Unix epoch

2020-09-24 Thread EveLiao
I can't see your code and return values. Can you post them again? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Pyspark 3 Debug] Date values reset to Unix epoch

2020-09-24 Thread Andrew Mullins
My apologies, my code sections were eaten. Code: import datetime as dt import pyspark def get_spark(): return pyspark.sql.SparkSession.builder.enableHiveSupport().getOrCreate() if __name__ == '__main__': spark = get_spark() table = spark.createDataFrame( [("1234",

Re: Let multiple jobs share one rdd?

2020-09-24 Thread Khalid Mammadov
Perhaps you can use Global Temp Views? https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createGlobalTempView On 24/09/2020 14:52, Gang Li wrote: Hi all, There are three jobs, among which the first rdd is the same. Can the first rdd be calculated once,

[Pyspark 3 Debug] Date values reset to Unix epoch

2020-09-24 Thread Andrew Mullins
I am encountering a bug with a broken unit test - it passes on Pyspark 2.4.4 but fails on Pyspark 3.0. I've managed to create a minimal reproducible example of the issue. The following code: Returns the following on Pyspark 3: On Pyspark 2.4.4, the final table has the correct date value.

Re: Distribute entire columns to executors

2020-09-24 Thread Jeff Evans
I think you can just select the columns you need into new DataFrames, then process those separately. val dfFirstTwo = ds.select("Col1", "Col2") # do whatever with this one dfFirstTwo.sort(...) # similar for the next two columns val dfNextTwo = ds.select("Col3", "Col4") dfNextTwo.sort(...) These

Let multiple jobs share one rdd?

2020-09-24 Thread Gang Li
Hi all, There are three jobs, among which the first rdd is the same. Can the first rdd be calculated once, and then the subsequent operations will be calculated in parallel? My code is as follows: sqls

Re: Distribute entire columns to executors

2020-09-24 Thread Lalwani, Jayesh
You could covert columns to rows. Some thing like this val cols = [“A”, “B”, “C”] df.flatMap( row => { cols.map(c => (row.getAsTimeStamp(“timestamp”), row.getAsInt(c), c) ) }).toDF(“timestamp”, “value”, “colName”) If you are using dataframes, all of your columns are of the same type. If

Re: Edge AI with Spark

2020-09-24 Thread Deepak Sharma
Near edge would work in this case. On Edge doesn't makes much sense , specially if its distributed processing framework such as spark. On Thu, Sep 24, 2020 at 3:12 PM Gourav Sengupta wrote: > hi, > > its better to use lighter frameworks over edge. Some of the edge devices I > work on run at

Distribute entire columns to executors

2020-09-24 Thread Pedro Cardoso
Hello, Is it possible in Spark to map partitions such that partitions are column-based and not row-based? My use-case is to compute temporal series of numerical values. I.e: Exponential moving averages over the values of a given dataset's column. Suppose there is a dataset with roughly 200

Re: Edge AI with Spark

2020-09-24 Thread Gourav Sengupta
hi, its better to use lighter frameworks over edge. Some of the edge devices I work on run at over 40 to 50 degree celsius, therefore using lighter frameworks will be useful for the health of the device. Regards, Gourav On Thu, Sep 24, 2020 at 8:42 AM ayan guha wrote: > Too broad a question 

Re: Edge AI with Spark

2020-09-24 Thread ayan guha
Too broad a question  and the short answer is yes and long answer is it depends. Essentially spark is a compute engine so it can be wrapped into any containerized model and deployed at the edge. I believe there are various implemntation available On Thu, 24 Sep 2020 at 5:19 pm, Marco

Edge AI with Spark

2020-09-24 Thread Marco Sassarini
Hi, I'd like to know if Spark supports edge AI: can Spark run on physical device such as mobile devices running Android/iOS? Best regards, Marco Sassarini [cid:b995380c-a2a9-47fd-a865-edcad29e4206] Marco Sassarini Artificial Intelligence Department office: +39 0434 562 978