Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Jörn Franke
You can use Apache POI DateUtil to convert double to Date 
(https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html). 
Alternatively you can try HadoopOffice 
(https://github.com/ZuInnoTe/hadoopoffice/wiki), it supports Spark 1.x or Spark 
2.0 ds.

> On 16. Aug 2017, at 20:15, Aakash Basu  wrote:
> 
> Hey Irving,
> 
> Thanks for a quick revert. In Excel that column is purely string, I actually 
> want to import that as a String and later play around the DF to convert it 
> back to date type, but the API itself is not allowing me to dynamically 
> assign a Schema to the DF and I'm forced to inferSchema, where itself, it is 
> converting all numeric columns to double (Though, I don't know how then the 
> date column is getting converted to double if it is string in the Excel 
> source).
> 
> Thanks,
> Aakash.
> 
> 
> On 16-Aug-2017 11:39 PM, "Irving Duran"  wrote:
> I think there is a difference between the actual value in the cell and what 
> Excel formats that cell.  You probably want to import that field as a string 
> or not have it as a date format in Excel.
> 
> Just a thought
> 
> 
> Thank You,
> 
> Irving Duran
> 
>> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu  
>> wrote:
>> Hey all,
>> 
>> Forgot to attach the link to the overriding Schema through external 
>> package's discussion.
>> 
>> https://github.com/crealytics/spark-excel/pull/13
>> 
>> You can see my comment there too.
>> 
>> Thanks,
>> Aakash.
>> 
>>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu  
>>> wrote:
>>> Hi all,
>>> 
>>> I am working on PySpark (Python 3.6 and Spark 2.1.1) and trying to fetch 
>>> data from an excel file using 
>>> spark.read.format("com.crealytics.spark.excel"), but it is inferring double 
>>> for a date type column.
>>> 
>>> The detailed description is given here (the question I posted) -
>>> 
>>> https://stackoverflow.com/questions/45713699/inferschema-using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>> 
>>> 
>>> Found it is a probable bug with the crealytics excel read package.
>>> 
>>> Can somebody help me with a workaround for this?
>>> 
>>> Thanks,
>>> Aakash.
>> 
> 
> 


Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
Hey Irving,

Thanks for a quick revert. In Excel that column is purely string, I
actually want to import that as a String and later play around the DF to
convert it back to date type, but the API itself is not allowing me to
dynamically assign a Schema to the DF and I'm forced to inferSchema, where
itself, it is converting all numeric columns to double (Though, I don't
know how then the date column is getting converted to double if it is
string in the Excel source).

Thanks,
Aakash.


On 16-Aug-2017 11:39 PM, "Irving Duran"  wrote:

I think there is a difference between the actual value in the cell and what
Excel formats that cell.  You probably want to import that field as a
string or not have it as a date format in Excel.

Just a thought


Thank You,

Irving Duran

On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu 
wrote:

> Hey all,
>
> Forgot to attach the link to the overriding Schema through external
> package's discussion.
>
> https://github.com/crealytics/spark-excel/pull/13
>
> You can see my comment there too.
>
> Thanks,
> Aakash.
>
> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu 
> wrote:
>
>> Hi all,
>>
>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
>> fetch data from an excel file using
>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
>> double for a date type column.
>>
>> The detailed description is given here (the question I posted) -
>>
>> https://stackoverflow.com/questions/45713699/inferschema-usi
>> ng-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>
>>
>> Found it is a probable bug with the crealytics excel read package.
>>
>> Can somebody help me with a workaround for this?
>>
>> Thanks,
>> Aakash.
>>
>
>


Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Irving Duran
I think there is a difference between the actual value in the cell and what
Excel formats that cell.  You probably want to import that field as a
string or not have it as a date format in Excel.

Just a thought


Thank You,

Irving Duran

On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu 
wrote:

> Hey all,
>
> Forgot to attach the link to the overriding Schema through external
> package's discussion.
>
> https://github.com/crealytics/spark-excel/pull/13
>
> You can see my comment there too.
>
> Thanks,
> Aakash.
>
> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu 
> wrote:
>
>> Hi all,
>>
>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
>> fetch data from an excel file using
>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
>> double for a date type column.
>>
>> The detailed description is given here (the question I posted) -
>>
>> https://stackoverflow.com/questions/45713699/inferschema-
>> using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>
>>
>> Found it is a probable bug with the crealytics excel read package.
>>
>> Can somebody help me with a workaround for this?
>>
>> Thanks,
>> Aakash.
>>
>
>


Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
Hey all,

Forgot to attach the link to the overriding Schema through external
package's discussion.

https://github.com/crealytics/spark-excel/pull/13

You can see my comment there too.

Thanks,
Aakash.

On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu 
wrote:

> Hi all,
>
> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
> fetch data from an excel file using
> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
> double for a date type column.
>
> The detailed description is given here (the question I posted) -
>
> https://stackoverflow.com/questions/45713699/inferschema-using-spark-read-
> formatcom-crealytics-spark-excel-is-inferring-d
>
>
> Found it is a probable bug with the crealytics excel read package.
>
> Can somebody help me with a workaround for this?
>
> Thanks,
> Aakash.
>


Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
Hi all,

I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to fetch
data from an excel file using
*spark.read.format("com.crealytics.spark.excel")*, but it is inferring
double for a date type column.

The detailed description is given here (the question I posted) -

https://stackoverflow.com/questions/45713699/inferschema-using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d


Found it is a probable bug with the crealytics excel read package.

Can somebody help me with a workaround for this?

Thanks,
Aakash.


Re: Restart streaming query spark 2.1 structured streaming

2017-08-16 Thread purna pradeep
And also is query.stop() is graceful stop operation?what happens to already
received data will it be processed ?

On Tue, Aug 15, 2017 at 7:21 PM purna pradeep 
wrote:

> Ok thanks
>
> Few more
>
> 1.when I looked into the documentation it says onQueryprogress is not
> threadsafe ,So Is this method would be the right place to refresh cache?and
> no need to restart query if I choose listener ?
>
> The methods are not thread-safe as they may be called from different
> threads.
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala
>
>
>
> 2.if I use streamingquerylistner onqueryprogress my understanding is
> method will be executed only when the query is in progress so if I refresh
> data frame here without restarting  query will it impact application ?
>
> 3.should I use unpersist (Boolean) blocking method or async method
> unpersist() as the data size is big.
>
> I feel your solution is better as it stops query --> refresh cache -->
> starts query if I compromise on little downtime even cached dataframe is
> huge .I'm not sure how listener behaves as it's asynchronous, correct me if
> I'm wrong.
>
> On Tue, Aug 15, 2017 at 6:36 PM Tathagata Das 
> wrote:
>
>> Both works. The asynchronous method with listener will have less of down
>> time, just that the first trigger/batch after the asynchronous
>> unpersist+persist will probably take longer as it has to reload the data.
>>
>>
>> On Tue, Aug 15, 2017 at 2:29 PM, purna pradeep 
>> wrote:
>>
>>> Thanks tathagata das actually I'm planning to something like this
>>>
>>> activeQuery.stop()
>>>
>>> //unpersist and persist cached data frame
>>>
>>> df.unpersist()
>>>
>>> //read the updated data //data size of df is around 100gb
>>>
>>> df.persist()
>>>
>>>  activeQuery = startQuery()
>>>
>>>
>>> the cached data frame size around 100gb ,so the question is this the
>>> right place to refresh this huge cached data frame ?
>>>
>>> I'm also trying to refresh cached data frame in onqueryprogress() method
>>> in a class which extends StreamingQuerylistner
>>>
>>> Would like to know which is the best place to refresh cached data frame
>>> and why
>>>
>>> Thanks again for the below response
>>>
>>> On Tue, Aug 15, 2017 at 4:45 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 You can do something like this.


 def startQuery(): StreamingQuery = {
// create your streaming dataframes
// start the query with the same checkpoint directory}

 // handle to the active queryvar activeQuery: StreamingQuery = null
 while(!stopped) {

if (activeQuery = null) { // if query not active, start query
  activeQuery = startQuery()

} else if (shouldRestartQuery())  {  // check your condition and 
 restart query
  activeQuery.stop()
  activeQuery = startQuery()
}

activeQuery.awaitTermination(100)   // wait for 100 ms.
// if there is any error it will throw exception and quit the loop
// otherwise it will keep checking the condition every 100ms}




 On Tue, Aug 15, 2017 at 1:13 PM, purna pradeep  wrote:

> Thanks Michael
>
> I guess my question is little confusing ..let me try again
>
>
> I would like to restart streaming query programmatically while my
> streaming application is running based on a condition and why I want to do
> this
>
> I want to refresh a cached data frame based on a condition and the
> best way to do this restart streaming query suggested by Tdas below for
> similar problem
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3cCA+AHuKn+vSEWkJD=bsst6g5bdzdas6wmn+fwmn4jtm1x1nd...@mail.gmail.com%3e
>
> I do understand that checkpoint if helps in recovery and failures but
> I would like to know "how to restart streaming query programmatically
> without stopping my streaming application"
>
> In place of query.awaittermination should I need to have an logic to
> restart query? Please suggest
>
>
> On Tue, Aug 15, 2017 at 3:26 PM Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> See
>> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
>>
>> Though I think that this currently doesn't work with the console sink.
>>
>> On Tue, Aug 15, 2017 at 9:40 AM, purna pradeep <
>> purna2prad...@gmail.com> wrote:
>>
>>> Hi,
>>>

 I'm trying to restart a streaming query to refresh cached data
 frame

 Where and how should I restart streaming query

>>>
>>>
>>> val sparkSes = SparkSession
>>>
>>>   .builder

Reading parquet file in stream

2017-08-16 Thread HARSH TAKKAR
Hi

I want to read a hdfs directory which contains parquet files, how can i
stream data from this directory using streaming context (ssc.fileStream) ?


Harsh


Thrift-Server JDBC ResultSet Cursor Reset or Previous

2017-08-16 Thread Imran Rajjad
Dear List,

Are there any future plans to implement cursor reset or previous record
functionality in Thrift Server`s JDBC driver? Are there any other
alternatives?

java.sql.SQLException: Method not supported
at
org.apache.hive.jdbc.HiveBaseResultSet.previous(HiveBaseResultSet.java:643)

regards
Imran

-- 
I.R


Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-16 Thread Takeshi Yamamuro
Hi,

Since the csv source currently supports ascii-compatible charset, so I
guess shift-jis also works well.
You could check Hyukjin's comment in
https://issues.apache.org/jira/browse/SPARK-21289 for more info.


On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho  wrote:

> My apologies,
>
> It was a problem of our Hadoop cluster.
> When we tested the same code on another cluster (HDP-based), it worked
> without any problem.
>
> ```scala
> ## make sjis text
> cat a.txt
> 8月データだけでやってみよう
> nkf -W -s a.txt >b.txt
> cat b.txt
> 87n%G!<%?$@$1$G$d$C$F$_$h$&
> nkf -s -w b.txt
> 8月データだけでやってみよう
> hdfs dfs -put a.txt b.txt
>
> ## YARN mode test
> spark.read.option("encoding", "utf-8").csv("a.txt").show(1)
> +--+
> |   _c0|
> +--+
> |8月データだけでやってみよう|
> +--+
>
> spark.read.option("encoding", "sjis").csv("b.txt").show(1)
> +--+
> |   _c0|
> +--+
> |8月データだけでやってみよう|
> +--+
>
> spark.read.option("encoding", "utf-8").option("multiLine",
> true).csv("a.txt").show(1)
> +--+
> |   _c0|
> +--+
> |8月データだけでやってみよう|
> +--+
>
> spark.read.option("encoding", "sjis").option("multiLine",
> true).csv("b.txt").show(1)
> +--+
> |   _c0|
> +--+
> |8月データだけでやってみよう|
> +--+
> ```
>
> I am still digging the root cause and will share it later :-)
>
> Best wishes,
> Han-Choel
>
>
> On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho  wrote:
>
>> Dear Spark ML members,
>>
>>
>> I experienced a trouble in using "multiLine" option to load CSV data with
>> Shift-JIS encoding.
>> When option("multiLine", true) is specified, option("encoding",
>> "encoding-name") just doesn't work anymore.
>>
>>
>> In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
>> method doesn't use parser.options.charset at all.
>>
>> object MultiLineCSVDataSource extends CSVDataSource {
>>   override val isSplitable: Boolean = false
>>
>>   override def readFile(
>>   conf: Configuration,
>>   file: PartitionedFile,
>>   parser: UnivocityParser,
>>   schema: StructType): Iterator[InternalRow] = {
>> UnivocityParser.parseStream(
>>   CodecStreams.createInputStreamWithCloseResource(conf,
>> file.filePath),
>>   parser.options.headerFlag,
>>   parser,
>>   schema)
>>   }
>>   ...
>>
>> On the other hand, TextInputCSVDataSource.readFile() method uses it:
>>
>>   override def readFile(
>>   conf: Configuration,
>>   file: PartitionedFile,
>>   parser: UnivocityParser,
>>   schema: StructType): Iterator[InternalRow] = {
>> val lines = {
>>   val linesReader = new HadoopFileLinesReader(file, conf)
>>   Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ =>
>> linesReader.close()))
>>   linesReader.map { line =>
>> new String(line.getBytes, 0, line.getLength,
>> parser.options.charset)// < charset option is used here.
>>   }
>> }
>>
>> val shouldDropHeader = parser.options.headerFlag && file.start == 0
>> UnivocityParser.parseIterator(lines, shouldDropHeader, parser,
>> schema)
>>   }
>>
>>
>> It seems like a bug.
>> Is there anyone who had the same problem before?
>>
>>
>> Best wishes,
>> Han-Cheol
>>
>> --
>> ==
>> Han-Cheol Cho, Ph.D.
>> Data scientist, Data Science Team, Data Laboratory
>> NHN Techorus Corp.
>>
>> Homepage: https://sites.google.com/site/priancho/
>> ==
>>
>
>
>
> --
> ==
> Han-Cheol Cho, Ph.D.
> Data scientist, Data Science Team, Data Laboratory
> NHN Techorus Corp.
>
> Homepage: https://sites.google.com/site/priancho/
> ==
>



-- 
---
Takeshi Yamamuro