Re: Databricks fails to read the csv file with blank line at the file header

Ashok Kumar Mon, 28 Mar 2016 14:55:23 -0700

Thanks a ton sir. Very helpful 

    On Monday, 28 March 2016, 22:36, Mich Talebzadeh 
<mich.talebza...@gmail.com> wrote:


 Pretty straight forward
#!/bin/ksh
DIR="hdfs://<hostname>:9000/data/stg/accounts/nw/xxxxxxxxx"
#
## Remove the blank header line from the spreadsheets and compress them
#
echo `date` " ""=======  Started Removing blank header line and Compressing all 
csv FILEs"
for FILE in `ls *.csv`
do
  sed '1d' ${FILE} > ${FILE}.tmp
  mv -f ${FILE}.tmp ${FILE}
  /usr/bin/bzip2 ${FILE}
done
#
## Clear out hdfs staging directory
#
echo `date` " ""=======  Started deleting old files from hdfs staging directory 
${DIR}"
hdfs dfs -rm -r ${DIR}/*.bz2
echo `date` " ""=======  Started Putting bz2 fileS to hdfs staging directory 
${DIR}"
for FILE in `ls *.bz2`
do
  hdfs dfs -copyFromLocal ${FILE} ${DIR}
done
echo `date` " ""=======  Checking that all files are moved to hdfs staging 
directory"
hdfs dfs -ls ${DIR}
exit 0HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 28 March 2016 at 22:24, Ashok Kumar <ashok34...@yahoo.com> wrote:

Hello Mich
If you accommodate can you please share your approach to steps 1-3 above. 
Best regards 

    On Sunday, 27 March 2016, 14:53, Mich Talebzadeh 
<mich.talebza...@gmail.com> wrote:
 

 Pretty simple as usual it is a combination of ETL and ELT. 
Basically csv files are loaded into staging directory on host, compressed 
before pushing into hdfs
   
   - ETL --> Get rid of the header blank line on the csv files
   - ETL --> Compress the csv files
   - ETL --> Put the compressed CVF files  into hdfs staging directory
   - ELT --> Use databricks to load the csv files
   - ELT --> Spark FP to prcess the csv data
   - ELT --> register it as a temporary table 
   - ELT --> Create an ORC table in a named database in compressed zlib2 format 
in Hive database
   - ELT --> Insert/select from temporary table to Hive table

So the data is stored in an ORC table and one can do whatever analysis using 
Spark, Hive etc


Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 27 March 2016 at 03:05, Koert Kuipers <ko...@tresata.com> wrote:

To me this is expected behavior that I would not want fixed, but if you look at 
the recent commits for spark-csv it has one that deals this...On Mar 26, 2016 
21:25, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:


Hi,
I have a standard csv file (saved as csv in HDFS) that has first line of blank 
at the headeras follows
[blank line]
Date, Type, Description, Value, Balance, Account Name, Account Number[blank 
line]22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN 
AE","'638585-60125663",

When I read this file using the following standard
val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", 
"true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
it crashes.
java.util.NoSuchElementException
        at java.util.ArrayList$Itr.next(ArrayList.java:794)

 If I go and manually delete the first blank line it works OK
val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", 
"true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
df: org.apache.spark.sql.DataFrame = [Date: string,  Type: string,  
Description: string,  Value: double,  Balance: double,  Account Name: string,  
Account Number: string]

I can easily write a shell script to get rid of blank line. I was wondering if 
databricks does have a flag to get rid of the first blank line in csv file 
format?
P.S. If the file is stored as DOS text file, this problem goes away.
Thanks
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com

Re: Databricks fails to read the csv file with blank line at the file header

Reply via email to