You can do the conversion of character set (is this the issue?) as part of your 
loading process in Spark.
As far as i know the spark CSV package is based on Hadoop TextFileInputformat. 
This format to my best of knowledge supports only utf-8. So you have to do a 
conversion from windows to utf-8. If you refer to language specific settings 
(numbers, dates etc) - this is also not supported.

I started to work on the hadoopoffice library (which you can use with Spark) 
where you can read Excel files directly 
(https://github.com/ZuInnoTe/hadoopoffice).However, there is no official 
release - yet. There you can specify also the language in which you want to 
represent data values, numbers etc. when reading the file.

> On 17 Nov 2016, at 14:11, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi,
> 
> In the past with Databricks package for csv files on occasions I had to do 
> some cleaning at Linux directory level before ingesting CSV file into HDFS 
> staging directory for Spark to read it.
> 
> I have a more generic issue that may have to be ready.
> 
> Assume that a provides using FTP to push CSV files into Windows directories. 
> The whole solution is built around windows and .NET.
> 
> Now you want to ingest those files into HDFS and process them with Spark CSV.
> 
> One can create NFS directories visible to Windows server and HDFS as well. 
> However, there may be issues with character sets etc. What are the best ways 
> of handling this? One way would be to use some scripts to make these 
> spreadsheet time files compatible with Linux and then load them into HDFS. 
> For example I know that if I saved a Excel spresheet file with DOS FORMAT, 
> that file will work OK with Spark CSV.  Are there tools to do this as well?
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  

Reply via email to