Hi All,
This is Li Jin. We (me and my fellow colleagues at Two Sigma) have been
using Spark for time series analysis for the past two years and it has been
a success to scale up our time series analysis.
Recently, we start a conversation with Reynold about potential
opportunities to collaborate
/start-master.sh script) and at least one worker node (can
> be started using SPARK_HOME/sbin/start-slave.sh script).SparkConf should
> use master node address to create (spark://host:port)
>
> Thanks!
>
> Gangadhar
> From: Li Jin <ice.xell...@gmail.com<mailto:ic
Hi,
I am wondering does pyspark standalone (local) mode support multi
cores/executors?
Thanks,
Li
I am not an expert on this but here is what I think:
Catalyst maintains information on whether a plan node is ordered. If your
dataframe is a result of a order by, catalyst will skip the sorting when it
does merge sort join. If you dataframe is created from storage, for
instance. ParquetRelation,
Yeoul,
I think a you can run an microbench for pyspark
serialization/deserialization would be to run a withColumn + a python udf
that returns a constant and compare that with similar code in
Scala.
I am not sure if there is way to measure just the serialization code,
because pyspark API only