Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Judit Planas
Dear Nick, Thanks for your quick reply. I quickly implemented your proposal, but I do not see any improvement. In fact, the test data set of around 3 GB occupies a total of 10 GB in worker memory, and the execution time of queries is like 4 times slower

Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Nick Pentreath
You will need to use PySpark vectors to store in a DataFrame. They can be created from Numpy arrays as follows: from pyspark.ml.linalg import Vectors df = spark.createDataFrame([("src1", "pkey1", 1, Vectors.dense(np.array([0, 1, 2])))]) On Wed, 28 Jun 2017 at 12:23 Judit Planas

[PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Judit Planas
Dear all, I am trying to store a NumPy array (loaded from an HDF5 dataset) into one cell of a DataFrame, but I am having problems. In short, my data layout is similar to a database, where I have a few columns with metadata (source of information, primary key,