Questions on Python support with Spark

2018-11-09 Thread Arijit Tarafdar
Hello All, We have a requirement to run PySpark in standalone cluster mode and also reference python libraries (egg/wheel) which are not local but placed in a distributed storage like HDFS. From the code it looks like none of cases are supported. Questions are: 1. Why is PySpark

[Spark-SQL] - Creating Hive Metastore Parquet table from Avro schema

2018-11-09 Thread pradeepbaji
Hello Everyone, I have my parquet files stored on HDFS. I am trying to create a table in Hive Metastore from Spark SQL. I have an Avro schema file from which I generated the parquet files. I am doing the following to create the table. 1) Firstly create an Avro dummy table from the schema

What is BDV in Spark Source

2018-11-09 Thread Soheil Pourbafrani
Hi, Checking the Spark Sources, I faced with a type BDV: breeze.linalg.{DenseVector => BDV} and they used it in calculating IDF from Term Frequencies. What is it exactly?

[Spark on K8s] Scaling experiences sharing

2018-11-09 Thread Li Gao
Hi Spark Community, I am reaching out to see if there are current large scale production or pre-production deployment of Spark on k8s for batch and micro batch jobs. Large scale means running 100s of thousand spark jobs daily and 1000s of concurrent spark jobs on a single k8s cluster and 10s of

Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-09 Thread bsikander
Could you please give some feedback. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-09 Thread purna pradeep
Thanks this is a great news Can you please lemme if dynamic resource allocation is available in spark 2.4? I’m using spark 2.3.2 on Kubernetes, do I still need to provide executor memory options as part of spark submit command or spark will manage required executor memory based on the spark job