date:20180318

Calling Pyspark functions in parallel

2018-03-18 Thread Debabrata Ghosh

Hi, My dataframe is having 2000 rows. For processing each row it consider 3 seconds and so sequentially it takes 2000 * 3 = 6000 seconds , which is a very high time. Further, I am contemplating to run the function in parallel. For example, I would like to divide the

Re: [PySpark SQL] sql function to_date and to_timestamp return the same data type

2018-03-18 Thread Hyukjin Kwon

Mind if I ask a reproducer? seems returning timestamps fine: >>> from pyspark.sql.functions import * >>> spark.range(1).select(to_timestamp(current_timestamp())).printSchema() root |-- to_timestamp(current_timestamp()): timestamp (nullable = false) >>>

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread ayan guha

Hi The is not with spark in this case, it is with Oracle. If you do not know which columns to apply date-related conversion rule, then you have a problem. You should try either a) Define some config file where you can define table name, date column name and date-format @ source so that you can

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Deepak Sharma

The other approach would to write to temp table and then merge the data. But this may be expensive solution. Thanks Deepak On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy wrote: > Hi, > > I am trying to read data from Hive as DataFrame, then trying to write the > DF into

Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Gurusamy Thirupathy

Hi, I am trying to read data from Hive as DataFrame, then trying to write the DF into the Oracle data base. In this case, the date field/column in hive is with Type Varchar(20) but the corresponding column type in Oracle is Date. While reading from hive , the hive table names are dynamically

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan

You can use EXPLAIN statement to see optimized plan for each query. ( https://stackoverflow.com/questions/35883620/spark-how-can-get-the-logical-physical-query-execution-using-thirft-hive ). 2018-03-19 0:52 GMT+07:00 CPC : > Hi nguyen, > > Thank you for quick response. But

Run spark 2.2 on yarn as usual java application

2018-03-18 Thread Serega Sheypak

Hi, Is it even possible to run spark on yarn as usual java application? I've built jat using maven with spark-yarn dependency and I manually populate SparkConf with all hadoop properties. SparkContext fails to start with exception: 1. Caused by: java.lang.IllegalStateException: Library

Re: parquet late column materialization

2018-03-18 Thread CPC

Hi nguyen, Thank you for quick response. But what i am trying to understand is in both query predicate evolution require only one column. So actually spark does not need to read all column in projection if they are not used in filter predicate. Just to give an example, amazon redshift has this

Re: NPE in Subexpression Elimination optimization

2018-03-18 Thread Jacek Laskowski

Hi, Filled https://issues.apache.org/jira/browse/SPARK-23731 and am working on a workaround (aka fix). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan

Hi @CPC, Parquet is column storage format, so if you want to read data from only one column, you can do that without accessing all of your data. Spark SQL consists of a query optimizer ( see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it will

parquet late column materialization

2018-03-18 Thread CPC

Hi everybody, I try to understand how spark reading parquet files but i am confused a little bit. I have a table with 4 columns and named businesskey,transactionname,request and response Request and response columns are huge columns(10-50kb). when i execute a query like "select * from mytable

Re: Append more files to existing partitioned data

2018-03-18 Thread Serega Sheypak

Thanks a lot! 2018-03-18 9:30 GMT+01:00 Denis Bolshakov : > Please checkout. > > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand > > > and > > org.apache.spark.sql.execution.datasources.WriteRelation > > > I guess it's managed by > >

Re: Scala - Spark for beginners

2018-03-18 Thread Gerard Maas

This is a good start: https://github.com/deanwampler/JustEnoughScalaForSpark And the corresponding talk: https://www.youtube.com/watch?v=LBoSgiLV_NQ There're many more resources if you search for it. -kr, Gerard. On Sun, Mar 18, 2018 at 11:15 AM, Mahender Sarangam <

Dynamic Key JSON Parsing

2018-03-18 Thread Mahender Sarangam

Hi, I'm new to Spark and Scala, need help on transforming Nested JSON using Scala. We have upstream returning JSON like { "id": 100, "text": "Hello, world." Users : [ "User1": { "name": "Brett", "id": 200, "Type" : "Employee" "empid":"2" },

Re: Append more files to existing partitioned data

2018-03-18 Thread Denis Bolshakov

Please checkout. org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand and org.apache.spark.sql.execution.datasources.WriteRelation I guess it's managed by job.getConfiguration.set(DATASOURCE_WRITEJOBUUID, uniqueWriteJobId.toString) On 17 March 2018 at 20:46, Serega

Accessing a file that was passed via --files to spark submit

2018-03-18 Thread Vitaliy Pisarev

I am submitting a script to spark-submit and passing it a file using --files property. Later on I need to read it in a worker. I don't understand what API I should use to do that. I figured I'd try just: with open('myfile'): but this did not work. I am able to pass the file using the addFile

Calling Pyspark functions in parallel

Re: [PySpark SQL] sql function to_date and to_timestamp return the same data type

Re: Hive to Oracle using Spark - Type(Date) conversion issue

Re: Hive to Oracle using Spark - Type(Date) conversion issue

Hive to Oracle using Spark - Type(Date) conversion issue

Re: parquet late column materialization

Run spark 2.2 on yarn as usual java application

Re: parquet late column materialization

Re: NPE in Subexpression Elimination optimization

Re: parquet late column materialization

parquet late column materialization

Re: Append more files to existing partitioned data

Re: Scala - Spark for beginners

Dynamic Key JSON Parsing

Re: Append more files to existing partitioned data

Accessing a file that was passed via --files to spark submit

16 matches

Site Navigation

Mail list logo

Footer information