Calling Pyspark functions in parallel

2018-03-18 Thread Debabrata Ghosh
Hi, My dataframe is having 2000 rows. For processing each row it consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds , which is a very high time. Further, I am contemplating to run the function in parallel. For example, I would like to divide the

Re: [PySpark SQL] sql function to_date and to_timestamp return the same data type

2018-03-18 Thread Hyukjin Kwon
Mind if I ask a reproducer? seems returning timestamps fine: >>> from pyspark.sql.functions import * >>> spark.range(1).select(to_timestamp(current_timestamp())).printSchema() root |-- to_timestamp(current_timestamp()): timestamp (nullable = false) >>>

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread ayan guha
Hi The is not with spark in this case, it is with Oracle. If you do not know which columns to apply date-related conversion rule, then you have a problem. You should try either a) Define some config file where you can define table name, date column name and date-format @ source so that you can

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Deepak Sharma
The other approach would to write to temp table and then merge the data. But this may be expensive solution. Thanks Deepak On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy wrote: > Hi, > > I am trying to read data from Hive as DataFrame, then trying to write the > DF into

Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Gurusamy Thirupathy
Hi, I am trying to read data from Hive as DataFrame, then trying to write the DF into the Oracle data base. In this case, the date field/column in hive is with Type Varchar(20) but the corresponding column type in Oracle is Date. While reading from hive , the hive table names are dynamically

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan
You can use EXPLAIN statement to see optimized plan for each query. ( https://stackoverflow.com/questions/35883620/spark-how-can-get-the-logical-physical-query-execution-using-thirft-hive ). 2018-03-19 0:52 GMT+07:00 CPC : > Hi nguyen, > > Thank you for quick response. But

Run spark 2.2 on yarn as usual java application

2018-03-18 Thread Serega Sheypak
Hi, Is it even possible to run spark on yarn as usual java application? I've built jat using maven with spark-yarn dependency and I manually populate SparkConf with all hadoop properties. SparkContext fails to start with exception: 1. Caused by: java.lang.IllegalStateException: Library

Re: parquet late column materialization

2018-03-18 Thread CPC
Hi nguyen, Thank you for quick response. But what i am trying to understand is in both query predicate evolution require only one column. So actually spark does not need to read all column in projection if they are not used in filter predicate. Just to give an example, amazon redshift has this

Re: NPE in Subexpression Elimination optimization

2018-03-18 Thread Jacek Laskowski
Hi, Filled https://issues.apache.org/jira/browse/SPARK-23731 and am working on a workaround (aka fix). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan
Hi @CPC, Parquet is column storage format, so if you want to read data from only one column, you can do that without accessing all of your data. Spark SQL consists of a query optimizer ( see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it will

parquet late column materialization

2018-03-18 Thread CPC
Hi everybody, I try to understand how spark reading parquet files but i am confused a little bit. I have a table with 4 columns and named businesskey,transactionname,request and response Request and response columns are huge columns(10-50kb). when i execute a query like "select * from mytable

Re: Append more files to existing partitioned data

2018-03-18 Thread Serega Sheypak
Thanks a lot! 2018-03-18 9:30 GMT+01:00 Denis Bolshakov : > Please checkout. > > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand > > > and > > org.apache.spark.sql.execution.datasources.WriteRelation > > > I guess it's managed by > >

Re: Scala - Spark for beginners

2018-03-18 Thread Gerard Maas
This is a good start: https://github.com/deanwampler/JustEnoughScalaForSpark And the corresponding talk: https://www.youtube.com/watch?v=LBoSgiLV_NQ There're many more resources if you search for it. -kr, Gerard. On Sun, Mar 18, 2018 at 11:15 AM, Mahender Sarangam <

Dynamic Key JSON Parsing

2018-03-18 Thread Mahender Sarangam
Hi, I'm new to Spark and Scala, need help on transforming Nested JSON using Scala. We have upstream returning JSON like { "id": 100, "text": "Hello, world." Users : [ "User1": { "name": "Brett", "id": 200, "Type" : "Employee" "empid":"2" },

Re: Append more files to existing partitioned data

2018-03-18 Thread Denis Bolshakov
Please checkout. org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand and org.apache.spark.sql.execution.datasources.WriteRelation I guess it's managed by job.getConfiguration.set(DATASOURCE_WRITEJOBUUID, uniqueWriteJobId.toString) On 17 March 2018 at 20:46, Serega

Accessing a file that was passed via --files to spark submit

2018-03-18 Thread Vitaliy Pisarev
I am submitting a script to spark-submit and passing it a file using --files property. Later on I need to read it in a worker. I don't understand what API I should use to do that. I figured I'd try just: with open('myfile'): but this did not work. I am able to pass the file using the addFile