Re: Run spark 2.2 on yarn as usual java application

2018-03-18 Thread Jacek Laskowski
Hi, What's the deployment process then (if not using spark-submit)? How is the AM deployed? Why would you want to skip spark-submit? Jacek On 19 Mar 2018 00:20, "Serega Sheypak" wrote: > Hi, Is it even possible to run spark on yarn as usual java application? > I've built jat using maven with s

Calling Pyspark functions in parallel

2018-03-18 Thread Debabrata Ghosh
Hi, My dataframe is having 2000 rows. For processing each row it consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds , which is a very high time. Further, I am contemplating to run the function in parallel. For example, I would like to divide the tota

Re: [PySpark SQL] sql function to_date and to_timestamp return the same data type

2018-03-18 Thread Hyukjin Kwon
Mind if I ask a reproducer? seems returning timestamps fine: >>> from pyspark.sql.functions import * >>> spark.range(1).select(to_timestamp(current_timestamp())).printSchema() root |-- to_timestamp(current_timestamp()): timestamp (nullable = false) >>> spark.range(1).select(to_timestamp(current_

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread ayan guha
Hi The is not with spark in this case, it is with Oracle. If you do not know which columns to apply date-related conversion rule, then you have a problem. You should try either a) Define some config file where you can define table name, date column name and date-format @ source so that you can

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Deepak Sharma
The other approach would to write to temp table and then merge the data. But this may be expensive solution. Thanks Deepak On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy wrote: > Hi, > > I am trying to read data from Hive as DataFrame, then trying to write the > DF into the Oracle data base. I

Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Gurusamy Thirupathy
Hi, I am trying to read data from Hive as DataFrame, then trying to write the DF into the Oracle data base. In this case, the date field/column in hive is with Type Varchar(20) but the corresponding column type in Oracle is Date. While reading from hive , the hive table names are dynamically decid

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan
You can use EXPLAIN statement to see optimized plan for each query. ( https://stackoverflow.com/questions/35883620/spark-how-can-get-the-logical-physical-query-execution-using-thirft-hive ). 2018-03-19 0:52 GMT+07:00 CPC : > Hi nguyen, > > Thank you for quick response. But what i am trying to und

Run spark 2.2 on yarn as usual java application

2018-03-18 Thread Serega Sheypak
Hi, Is it even possible to run spark on yarn as usual java application? I've built jat using maven with spark-yarn dependency and I manually populate SparkConf with all hadoop properties. SparkContext fails to start with exception: 1. Caused by: java.lang.IllegalStateException: Library director

Re: parquet late column materialization

2018-03-18 Thread CPC
Hi nguyen, Thank you for quick response. But what i am trying to understand is in both query predicate evolution require only one column. So actually spark does not need to read all column in projection if they are not used in filter predicate. Just to give an example, amazon redshift has this kin

Re: NPE in Subexpression Elimination optimization

2018-03-18 Thread Jacek Laskowski
Hi, Filled https://issues.apache.org/jira/browse/SPARK-23731 and am working on a workaround (aka fix). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Maste

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan
Hi @CPC, Parquet is column storage format, so if you want to read data from only one column, you can do that without accessing all of your data. Spark SQL consists of a query optimizer ( see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it will optimi

parquet late column materialization

2018-03-18 Thread CPC
Hi everybody, I try to understand how spark reading parquet files but i am confused a little bit. I have a table with 4 columns and named businesskey,transactionname,request and response Request and response columns are huge columns(10-50kb). when i execute a query like "select * from mytable wher

Re: Append more files to existing partitioned data

2018-03-18 Thread Serega Sheypak
Thanks a lot! 2018-03-18 9:30 GMT+01:00 Denis Bolshakov : > Please checkout. > > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand > > > and > > org.apache.spark.sql.execution.datasources.WriteRelation > > > I guess it's managed by > > job.getConfiguration.set(DATASOURC

Re: Scala - Spark for beginners

2018-03-18 Thread Gerard Maas
This is a good start: https://github.com/deanwampler/JustEnoughScalaForSpark And the corresponding talk: https://www.youtube.com/watch?v=LBoSgiLV_NQ There're many more resources if you search for it. -kr, Gerard. On Sun, Mar 18, 2018 at 11:15 AM, Mahender Sarangam < mahender.bigd...@outlook.com

Scala - Spark for beginners

2018-03-18 Thread Mahender Sarangam
Hi, Can any one share with me nice tutorials on Spark with Scala like videos, blogs for beginners. Mostly focusing on writing scala code. Thanks in advance. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Dynamic Key JSON Parsing

2018-03-18 Thread Mahender Sarangam
Hi, I'm new to Spark and Scala, need help on transforming Nested JSON using Scala. We have upstream returning JSON like { "id": 100, "text": "Hello, world." Users : [ "User1": { "name": "Brett", "id": 200, "Type" : "Employee" "empid":"2" }, "Use

Re: Append more files to existing partitioned data

2018-03-18 Thread Denis Bolshakov
Please checkout. org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand and org.apache.spark.sql.execution.datasources.WriteRelation I guess it's managed by job.getConfiguration.set(DATASOURCE_WRITEJOBUUID, uniqueWriteJobId.toString) On 17 March 2018 at 20:46, Serega

Accessing a file that was passed via --files to spark submit

2018-03-18 Thread Vitaliy Pisarev
I am submitting a script to spark-submit and passing it a file using --files property. Later on I need to read it in a worker. I don't understand what API I should use to do that. I figured I'd try just: with open('myfile'): but this did not work. I am able to pass the file using the addFile me