How does Spark handle timestamps during Pandas dataframe conversion
I've summarized this question in detail in this StackOverflow question with code snippets and logs: https://stackoverflow.com/questions/45308406/how-does-spark-handle-timestamp-types-during-pandas-dataframe-conversion/. Looking for efficient solutions to this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-timestamps-during-Pandas-dataframe-conversion-tp29004.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Informing Spark about specific Partitioning scheme to avoid shuffles
Hi everyone, My environment is PySpark with Spark 2.0.0. I'm using spark to load data from a large number of files into a Spark dataframe with fields say field1 to field10. While loading my data I have ensured that records are partitioned by field1 and field2(without using partitionBy). This was done when loading the data into a RDD of lists and before the .toDF() call. So I assume Spark would not already know that such a partitioning exists and might trigger a shuffle if I call a shuffling transform using field1 or field2 as keys and then cache that information. Is it possible to inform Spark once I've created the data-frame about my custom partitioning scheme? Or would spark have already discovered this somehow before the shuffling transform call? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Informing-Spark-about-specific-Partitioning-scheme-to-avoid-shuffles-tp28922.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark UI crashes on Large Workloads
Hi, I have a pyspark App which when provided a huge amount of data as input throws the error explained here sometimes: https://stackoverflow.com/questions/32340639/unable-to-understand-error-sparklistenerbus-has-already-stopped-dropping-event. All my code is running inside the main function, and the only slightly peculiar thing I am doing in this app is using a custom PySpark ML Transformer(Modified from https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml). Could this be the issue? How can I debug why this is happening? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-crashes-on-Large-Workloads-tp28873.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
PySpark working with Generators
Hi, I have this file reading function is called /foo/ which reads contents into a list of lists or into a generator of list of lists representing the same file. When reading as a complete chunk(1 record array) I do something like: rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x) I'd like to now do something similar but with the generator, so that I can work with more cores and a lower memory. I'm not sure how to tackle this since generators cannot be pickled and thus I'm not sure how to ditribute the work of reading each file_path on the rdd? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Using Spark with Local File System/NFS
Hi, I've downloaded and kept the same set of data files on all my cluster nodes, in the same absolute path - say /home/xyzuser/data/*. I am now trying to perform an operation(say open(filename).read()) on all these files in spark, but by passing local file paths. I was under the assumption that as long as the worker can find the file path it will be able to execute it. However, my Spark tasks fail with the error(/home/xyzuser/data/* is not present) - and Im sure its present on all my worker nodes. If this experiment was successful I was planning to setup a NFS (actually more like a read-only cloud persistent disk connected to my cluster nodes in dataproc) and use that instead. What exactly is going wrong here? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-with-Local-File-System-NFS-tp28781.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Merging multiple Pandas dataframes
Hi, I am iteratively receiving a file which can only be opened as a Pandas dataframe. For the first such file I receive, I am converting this to a Spark dataframe using the 'createDataframe' utility function. The next file onward, I am converting it and union'ing it into the first Spark dataframe(the schema always stays the same). After each union, I am persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted all such files to a single spark dataframe I am coalescing it. Following some tips from this Stack Overflow post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions). Any suggestions for optimizing this process further? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Best alternative for Category Type in Spark Dataframe
Hi, I'm trying to convert a Pandas -> Spark dataframe. One of the columns I have is of the Category type in Pandas. But there does not seem to be support for this same type in Spark. What is the best alternative? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-Spark-Dataframe-tp28764.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org