How does Spark handle timestamps during Pandas dataframe conversion

2017-07-27 Thread saatvikshah1994
I've summarized this question in detail in this StackOverflow question with
code snippets and logs:
https://stackoverflow.com/questions/45308406/how-does-spark-handle-timestamp-types-during-pandas-dataframe-conversion/.
Looking for efficient solutions to this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-timestamps-during-Pandas-dataframe-conversion-tp29004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Informing Spark about specific Partitioning scheme to avoid shuffles

2017-07-22 Thread saatvikshah1994
Hi everyone,

My environment is PySpark with Spark 2.0.0. 

I'm using spark to load data from a large number of files into a Spark
dataframe with fields say field1 to field10. While loading my data I have
ensured that records are partitioned by field1 and field2(without using
partitionBy). This was done when loading the data into a RDD of lists and
before the .toDF() call. So I assume Spark would not already know that such
a partitioning exists and might trigger a shuffle if I call a shuffling
transform using field1 or field2 as keys and then cache that information. 

Is it possible to inform Spark once I've created the data-frame about my
custom partitioning scheme? Or would spark have already discovered this
somehow before the shuffling transform call? 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Informing-Spark-about-specific-Partitioning-scheme-to-avoid-shuffles-tp28922.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark UI crashes on Large Workloads

2017-07-17 Thread saatvikshah1994
Hi,

I have a pyspark App which when provided a huge amount of data as input
throws the error explained here sometimes:
https://stackoverflow.com/questions/32340639/unable-to-understand-error-sparklistenerbus-has-already-stopped-dropping-event.
All my code is running inside the main function, and the only slightly
peculiar thing I am doing in this app is using a custom PySpark ML
Transformer(Modified from
https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml).
Could this be the issue? How can I debug why this is happening?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-crashes-on-Large-Workloads-tp28873.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



PySpark working with Generators

2017-06-29 Thread saatvikshah1994
Hi,

I have this file reading function is called /foo/ which reads contents into
a list of lists or into a generator of list of lists representing the same
file.

When reading as a complete chunk(1 record array) I do something like:
rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)

I'd like to now do something similar but with the generator, so that I can
work with more cores and a lower memory. I'm not sure how to tackle this
since generators cannot be pickled and thus I'm not sure how to ditribute
the work of reading each file_path on the rdd?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Using Spark with Local File System/NFS

2017-06-22 Thread saatvikshah1994
Hi,

I've downloaded and kept the same set of data files on all my cluster nodes,
in the same absolute path - say /home/xyzuser/data/*. I am now trying to
perform an operation(say open(filename).read()) on all these files in spark,
but by passing local file paths. I was under the assumption that as long as
the worker can find the file path it will be able to execute it. However, my
Spark tasks fail with the error(/home/xyzuser/data/* is not present) - and
Im sure its present on all my worker nodes.

If this experiment was successful I was planning to setup a NFS (actually
more like a read-only cloud persistent disk connected to my cluster nodes in
dataproc) and use that instead.

What exactly is going wrong here?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-with-Local-File-System-NFS-tp28781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Merging multiple Pandas dataframes

2017-06-19 Thread saatvikshah1994
Hi, 

I am iteratively receiving a file which can only be opened as a Pandas
dataframe. For the first such file I receive, I am converting this to a
Spark dataframe using the 'createDataframe' utility function. The next file
onward, I am converting it and union'ing it into the first Spark
dataframe(the schema always stays the same). After each union, I am
persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted
all such files to a single spark dataframe I am coalescing it. Following
some tips from this Stack Overflow
post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions).
   

Any suggestions for optimizing this process further?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Best alternative for Category Type in Spark Dataframe

2017-06-15 Thread saatvikshah1994
Hi, 
I'm trying to convert a Pandas -> Spark dataframe. One of the columns I have
is of the Category type in Pandas. But there does not seem to be support for
this same type in Spark. What is the best alternative?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-Spark-Dataframe-tp28764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org