from:"saatvikshah1994"

How does Spark handle timestamps during Pandas dataframe conversion

2017-07-27 Thread saatvikshah1994

I've summarized this question in detail in this StackOverflow question with
code snippets and logs:
https://stackoverflow.com/questions/45308406/how-does-spark-handle-timestamp-types-during-pandas-dataframe-conversion/.
Looking for efficient solutions to this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-handle-timestamps-during-Pandas-dataframe-conversion-tp29004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Informing Spark about specific Partitioning scheme to avoid shuffles

2017-07-22 Thread saatvikshah1994

Hi everyone,

My environment is PySpark with Spark 2.0.0. 

I'm using spark to load data from a large number of files into a Spark
dataframe with fields say field1 to field10. While loading my data I have
ensured that records are partitioned by field1 and field2(without using
partitionBy). This was done when loading the data into a RDD of lists and
before the .toDF() call. So I assume Spark would not already know that such
a partitioning exists and might trigger a shuffle if I call a shuffling
transform using field1 or field2 as keys and then cache that information. 

Is it possible to inform Spark once I've created the data-frame about my
custom partitioning scheme? Or would spark have already discovered this
somehow before the shuffling transform call? 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Informing-Spark-about-specific-Partitioning-scheme-to-avoid-shuffles-tp28922.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark UI crashes on Large Workloads

2017-07-17 Thread saatvikshah1994

Hi,

I have a pyspark App which when provided a huge amount of data as input
throws the error explained here sometimes:
https://stackoverflow.com/questions/32340639/unable-to-understand-error-sparklistenerbus-has-already-stopped-dropping-event.
All my code is running inside the main function, and the only slightly
peculiar thing I am doing in this app is using a custom PySpark ML
Transformer(Modified from
https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml).
Could this be the issue? How can I debug why this is happening?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-crashes-on-Large-Workloads-tp28873.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

PySpark working with Generators

2017-06-29 Thread saatvikshah1994

Hi,

I have this file reading function is called /foo/ which reads contents into
a list of lists or into a generator of list of lists representing the same
file.

When reading as a complete chunk(1 record array) I do something like:
rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)

I'd like to now do something similar but with the generator, so that I can
work with more cores and a lower memory. I'm not sure how to tackle this
since generators cannot be pickled and thus I'm not sure how to ditribute
the work of reading each file_path on the rdd?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Using Spark with Local File System/NFS

2017-06-22 Thread saatvikshah1994

Hi,

I've downloaded and kept the same set of data files on all my cluster nodes,
in the same absolute path - say /home/xyzuser/data/*. I am now trying to
perform an operation(say open(filename).read()) on all these files in spark,
but by passing local file paths. I was under the assumption that as long as
the worker can find the file path it will be able to execute it. However, my
Spark tasks fail with the error(/home/xyzuser/data/* is not present) - and
Im sure its present on all my worker nodes.

If this experiment was successful I was planning to setup a NFS (actually
more like a read-only cloud persistent disk connected to my cluster nodes in
dataproc) and use that instead.

What exactly is going wrong here?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-with-Local-File-System-NFS-tp28781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Merging multiple Pandas dataframes

2017-06-19 Thread saatvikshah1994

Hi, 

I am iteratively receiving a file which can only be opened as a Pandas
dataframe. For the first such file I receive, I am converting this to a
Spark dataframe using the 'createDataframe' utility function. The next file
onward, I am converting it and union'ing it into the first Spark
dataframe(the schema always stays the same). After each union, I am
persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted
all such files to a single spark dataframe I am coalescing it. Following
some tips from this Stack Overflow
post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions).
   

Any suggestions for optimizing this process further?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Best alternative for Category Type in Spark Dataframe

2017-06-15 Thread saatvikshah1994

Hi, 
I'm trying to convert a Pandas -> Spark dataframe. One of the columns I have
is of the Category type in Pandas. But there does not seem to be support for
this same type in Spark. What is the best alternative?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-Spark-Dataframe-tp28764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

How does Spark handle timestamps during Pandas dataframe conversion

Informing Spark about specific Partitioning scheme to avoid shuffles

Spark UI crashes on Large Workloads

PySpark working with Generators

Using Spark with Local File System/NFS

Merging multiple Pandas dataframes

Best alternative for Category Type in Spark Dataframe

7 matches

Site Navigation

Mail list logo

Footer information