Questions on Python support with Spark

2018-11-09 Thread Arijit Tarafdar
Hello All,

We have a requirement to run PySpark in standalone cluster mode and also 
reference python libraries (egg/wheel) which are not local but placed in a 
distributed storage like HDFS. From the code it looks like none of cases are 
supported.

Questions are:


  1.  Why is PySpark supported only in standalone client mode?
  2.  Why –py-files only support local files and not files stored in remote 
stores?

We will like to update the Spark code to support these scenarios but just want 
to be aware of any technical difficulties that the community has faced while 
trying to support those.

Thanks, Arijit


[Spark-SQL] - Creating Hive Metastore Parquet table from Avro schema

2018-11-09 Thread pradeepbaji
Hello Everyone, 

I have my parquet files stored on HDFS. I am trying to create a table in
Hive Metastore from Spark SQL. I have an Avro schema file from which I
generated the parquet files. 

I am doing the following to create the table. 

1) Firstly create an Avro dummy table from the schema file. 

spark.sql("""
  CREATE TABLE 
 db_test.avro_test 
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
  STORED AS 
 INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
 OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
  TBLPROPERTIES ('avro.schema.url'='/avro-schema/schema.avsc')""")

This step is successful and I have a table created in hive-metastore.

2) Now create an external table with the same schema as the first one and
with location pointing to parquet files directory. 

spark.sql(“””
   CREATE EXTERNAL TABLE db_test.parquet_test 
   LIKE db_test.avro_test 
  STORED AS PARQUET LOCATION ‘/parquet-data-dir’
“””)

This step is failing. Looks like Spark SQL doesn’t like the word “LIKE” in
the create statement. The same statement works fine from the Hive shell. 

*Can someone please help me to with the parquet table creation from the Avro
Schema? *
Is this a bug in spark sql that it doesn't parse "LIKE"? 


Here is the error that the spark is throwing. 
Exception in thread "main"
org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input 'LIKE' expecting (line 1, pos 136)

== SQL ==
CREATE EXTERNAL TABLE db_test.parquet_test LIKE db_test.avro_test STORED AS
PARQUET LOCATION ‘/parquet-data-dir’
-^^^

at
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
at
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



What is BDV in Spark Source

2018-11-09 Thread Soheil Pourbafrani
Hi,

Checking the Spark Sources, I faced with a type BDV:

breeze.linalg.{DenseVector => BDV}

and they used it in calculating IDF from Term Frequencies. What is it
exactly?


[Spark on K8s] Scaling experiences sharing

2018-11-09 Thread Li Gao
Hi Spark Community,

I am reaching out to see if there are current large scale production or
pre-production deployment of Spark on k8s for batch and micro batch jobs.
Large scale means running 100s of thousand spark jobs daily and 1000s of
concurrent spark jobs on a single k8s cluster and 10s of millions of spark
executor pods daily (not concurrently).

If you happen to run and develop Spark on k8s at such scale, I'd want to
learn about your experience and scaling challenges and solutions.

Thank you,
Li


Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-09 Thread bsikander
Could you please give some feedback.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-09 Thread purna pradeep
Thanks this is a great news

Can you please lemme if dynamic resource allocation is available in spark
2.4?

I’m using spark 2.3.2 on Kubernetes, do I still need to provide executor
memory options as part of spark submit command or spark will manage
required executor memory based on the spark job size ?

On Thu, Nov 8, 2018 at 2:18 PM Marcelo Vanzin 
wrote:

> +user@
>
> >> -- Forwarded message -
> >> From: Wenchen Fan 
> >> Date: Thu, Nov 8, 2018 at 10:55 PM
> >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
> >> To: Spark dev list 
> >>
> >>
> >> Hi all,
> >>
> >> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
> adds Barrier Execution Mode for better integration with deep learning
> frameworks, introduces 30+ built-in and higher-order functions to deal with
> complex data type easier, improves the K8s integration, along with
> experimental Scala 2.12 support. Other major updates include the built-in
> Avro data source, Image data source, flexible streaming sinks, elimination
> of the 2GB block size limitation during transfer, Pandas UDF improvements.
> In addition, this release continues to focus on usability, stability, and
> polish while resolving around 1100 tickets.
> >>
> >> We'd like to thank our contributors and users for their contributions
> and early feedback to this release. This release would not have been
> possible without you.
> >>
> >> To download Spark 2.4.0, head over to the download page:
> http://spark.apache.org/downloads.html
> >>
> >> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-4-0.html
> >>
> >> Thanks,
> >> Wenchen
> >>
> >> PS: If you see any issues with the release notes, webpage or published
> artifacts, please contact me directly off-list.
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>