Questions on Python support with Spark
Hello All, We have a requirement to run PySpark in standalone cluster mode and also reference python libraries (egg/wheel) which are not local but placed in a distributed storage like HDFS. From the code it looks like none of cases are supported. Questions are: 1. Why is PySpark supported only in standalone client mode? 2. Why –py-files only support local files and not files stored in remote stores? We will like to update the Spark code to support these scenarios but just want to be aware of any technical difficulties that the community has faced while trying to support those. Thanks, Arijit
[Spark-SQL] - Creating Hive Metastore Parquet table from Avro schema
Hello Everyone, I have my parquet files stored on HDFS. I am trying to create a table in Hive Metastore from Spark SQL. I have an Avro schema file from which I generated the parquet files. I am doing the following to create the table. 1) Firstly create an Avro dummy table from the schema file. spark.sql(""" CREATE TABLE db_test.avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.url'='/avro-schema/schema.avsc')""") This step is successful and I have a table created in hive-metastore. 2) Now create an external table with the same schema as the first one and with location pointing to parquet files directory. spark.sql(“”” CREATE EXTERNAL TABLE db_test.parquet_test LIKE db_test.avro_test STORED AS PARQUET LOCATION ‘/parquet-data-dir’ “””) This step is failing. Looks like Spark SQL doesn’t like the word “LIKE” in the create statement. The same statement works fine from the Hive shell. *Can someone please help me to with the parquet table creation from the Avro Schema? * Is this a bug in spark sql that it doesn't parse "LIKE"? Here is the error that the spark is throwing. Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'LIKE' expecting (line 1, pos 136) == SQL == CREATE EXTERNAL TABLE db_test.parquet_test LIKE db_test.avro_test STORED AS PARQUET LOCATION ‘/parquet-data-dir’ -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
What is BDV in Spark Source
Hi, Checking the Spark Sources, I faced with a type BDV: breeze.linalg.{DenseVector => BDV} and they used it in calculating IDF from Term Frequencies. What is it exactly?
[Spark on K8s] Scaling experiences sharing
Hi Spark Community, I am reaching out to see if there are current large scale production or pre-production deployment of Spark on k8s for batch and micro batch jobs. Large scale means running 100s of thousand spark jobs daily and 1000s of concurrent spark jobs on a single k8s cluster and 10s of millions of spark executor pods daily (not concurrently). If you happen to run and develop Spark on k8s at such scale, I'd want to learn about your experience and scaling challenges and solutions. Thank you, Li
Re: [Spark-Core] Long scheduling delays (1+ hour)
Could you please give some feedback. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
Thanks this is a great news Can you please lemme if dynamic resource allocation is available in spark 2.4? I’m using spark 2.3.2 on Kubernetes, do I still need to provide executor memory options as part of spark submit command or spark will manage required executor memory based on the spark job size ? On Thu, Nov 8, 2018 at 2:18 PM Marcelo Vanzin wrote: > +user@ > > >> -- Forwarded message - > >> From: Wenchen Fan > >> Date: Thu, Nov 8, 2018 at 10:55 PM > >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 > >> To: Spark dev list > >> > >> > >> Hi all, > >> > >> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release > adds Barrier Execution Mode for better integration with deep learning > frameworks, introduces 30+ built-in and higher-order functions to deal with > complex data type easier, improves the K8s integration, along with > experimental Scala 2.12 support. Other major updates include the built-in > Avro data source, Image data source, flexible streaming sinks, elimination > of the 2GB block size limitation during transfer, Pandas UDF improvements. > In addition, this release continues to focus on usability, stability, and > polish while resolving around 1100 tickets. > >> > >> We'd like to thank our contributors and users for their contributions > and early feedback to this release. This release would not have been > possible without you. > >> > >> To download Spark 2.4.0, head over to the download page: > http://spark.apache.org/downloads.html > >> > >> To view the release notes: > https://spark.apache.org/releases/spark-release-2-4-0.html > >> > >> Thanks, > >> Wenchen > >> > >> PS: If you see any issues with the release notes, webpage or published > artifacts, please contact me directly off-list. > > > > -- > Marcelo > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >