How does Spark handle timestamps during Pandas dataframe conversion

2017-07-27 Thread saatvikshah1994
I've summarized this question in detail in this StackOverflow question with code snippets and logs: https://stackoverflow.com/questions/45308406/how-does-spark-handle-timestamp-types-during-pandas-dataframe-conversion/. Looking for efficient solutions to this? -- View this message in context:

Informing Spark about specific Partitioning scheme to avoid shuffles

2017-07-22 Thread saatvikshah1994
Hi everyone, My environment is PySpark with Spark 2.0.0. I'm using spark to load data from a large number of files into a Spark dataframe with fields say field1 to field10. While loading my data I have ensured that records are partitioned by field1 and field2(without using partitionBy). This

Spark UI crashes on Large Workloads

2017-07-17 Thread saatvikshah1994
Hi, I have a pyspark App which when provided a huge amount of data as input throws the error explained here sometimes: https://stackoverflow.com/questions/32340639/unable-to-understand-error-sparklistenerbus-has-already-stopped-dropping-event. All my code is running inside the main function, and

PySpark working with Generators

2017-06-29 Thread saatvikshah1994
Hi, I have this file reading function is called /foo/ which reads contents into a list of lists or into a generator of list of lists representing the same file. When reading as a complete chunk(1 record array) I do something like: rdd = file_paths_rdd.map(lambda x:

Using Spark with Local File System/NFS

2017-06-22 Thread saatvikshah1994
Hi, I've downloaded and kept the same set of data files on all my cluster nodes, in the same absolute path - say /home/xyzuser/data/*. I am now trying to perform an operation(say open(filename).read()) on all these files in spark, but by passing local file paths. I was under the assumption that

Merging multiple Pandas dataframes

2017-06-19 Thread saatvikshah1994
Hi, I am iteratively receiving a file which can only be opened as a Pandas dataframe. For the first such file I receive, I am converting this to a Spark dataframe using the 'createDataframe' utility function. The next file onward, I am converting it and union'ing it into the first Spark

Best alternative for Category Type in Spark Dataframe

2017-06-15 Thread saatvikshah1994
Hi, I'm trying to convert a Pandas -> Spark dataframe. One of the columns I have is of the Category type in Pandas. But there does not seem to be support for this same type in Spark. What is the best alternative? -- View this message in context: