Spark dataset cache vs tempview

2016-11-05 Thread Rohit Verma
I have a parquet file which I reading atleast 4-5 times within my application. I was wondering what is most efficient thing to do. Option 1. While writing parquet file, immediately read it back to dataset and call cache. I am assuming by doing an immediate read I might use some existing

Reading csv files with quoted fields containing embedded commas

2016-11-05 Thread Femi Anthony
Hi, I am trying to process a very large comma delimited csv file and I am running into problems. The main problem is that some fields contain quoted strings with embedded commas. It seems as if PySpark is unable to properly parse lines containing such fields like say Pandas does. Here is the code

Re: Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Deepak Sharma
Hi Rohit You can use accumulators and increase it on every record processing. At last you can get the value of accumulator on driver , which will give you the count. HTH Deepak On Nov 5, 2016 20:09, "Rohit Verma" wrote: > I am using spark to read from database and

Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Rohit Verma
I am using spark to read from database and write in hdfs as parquet file. Here is code snippet. private long etlFunction(SparkSession spark){ spark.sqlContext().setConf("spark.sql.parquet.compression.codec", “SNAPPY"); Properties properties = new Properties();

unsubscribe

2016-11-05 Thread junius zhou

why visitCreateFileFormat doesn`t support hive STORED BY ,just support store as

2016-11-05 Thread ????????YDB??????????
why visitCreateFileFormat doesn`t support hive STORED BY ,just support story as when i update spark1.6.2 to spark2.0.1 so what i want to ask is .does it on plan to support hive stored by ? or never support that ? configureOutputJobProperties is quit important ,is there any other method to

SparkLauncer 2.0.1 version working incosistently in yarn-client mode

2016-11-05 Thread Elkhan Dadashov
Hi, I'm running Spark 2.0.1 version with Spark Launcher 2.0.1 version on Yarn cluster. I launch map task which spawns Spark job via SparkLauncher#startApplication(). Deploy mode is yarn-client. I'm running in Mac laptop. I have this snippet of code: SparkAppHandle appHandle =