date:20170816

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Jörn Franke

You can use Apache POI DateUtil to convert double to Date 
(https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html). 
Alternatively you can try HadoopOffice 
(https://github.com/ZuInnoTe/hadoopoffice/wiki), it supports Spark 1.x or Spark 
2.0 ds.

> On 16. Aug 2017, at 20:15, Aakash Basu  wrote:
> 
> Hey Irving,
> 
> Thanks for a quick revert. In Excel that column is purely string, I actually 
> want to import that as a String and later play around the DF to convert it 
> back to date type, but the API itself is not allowing me to dynamically 
> assign a Schema to the DF and I'm forced to inferSchema, where itself, it is 
> converting all numeric columns to double (Though, I don't know how then the 
> date column is getting converted to double if it is string in the Excel 
> source).
> 
> Thanks,
> Aakash.
> 
> 
> On 16-Aug-2017 11:39 PM, "Irving Duran"  wrote:
> I think there is a difference between the actual value in the cell and what 
> Excel formats that cell.  You probably want to import that field as a string 
> or not have it as a date format in Excel.
> 
> Just a thought
> 
> 
> Thank You,
> 
> Irving Duran
> 
>> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu  
>> wrote:
>> Hey all,
>> 
>> Forgot to attach the link to the overriding Schema through external 
>> package's discussion.
>> 
>> https://github.com/crealytics/spark-excel/pull/13
>> 
>> You can see my comment there too.
>> 
>> Thanks,
>> Aakash.
>> 
>>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu  
>>> wrote:
>>> Hi all,
>>> 
>>> I am working on PySpark (Python 3.6 and Spark 2.1.1) and trying to fetch 
>>> data from an excel file using 
>>> spark.read.format("com.crealytics.spark.excel"), but it is inferring double 
>>> for a date type column.
>>> 
>>> The detailed description is given here (the question I posted) -
>>> 
>>> https://stackoverflow.com/questions/45713699/inferschema-using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>> 
>>> 
>>> Found it is a probable bug with the crealytics excel read package.
>>> 
>>> Can somebody help me with a workaround for this?
>>> 
>>> Thanks,
>>> Aakash.
>> 
> 
>

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu

Hey Irving,

Thanks for a quick revert. In Excel that column is purely string, I
actually want to import that as a String and later play around the DF to
convert it back to date type, but the API itself is not allowing me to
dynamically assign a Schema to the DF and I'm forced to inferSchema, where
itself, it is converting all numeric columns to double (Though, I don't
know how then the date column is getting converted to double if it is
string in the Excel source).

Thanks,
Aakash.

On 16-Aug-2017 11:39 PM, "Irving Duran"  wrote:

I think there is a difference between the actual value in the cell and what
Excel formats that cell.  You probably want to import that field as a
string or not have it as a date format in Excel.

Just a thought

Thank You,

Irving Duran

On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu 
wrote:

> Hey all,
>
> Forgot to attach the link to the overriding Schema through external
> package's discussion.
>
> https://github.com/crealytics/spark-excel/pull/13
>
> You can see my comment there too.
>
> Thanks,
> Aakash.
>
> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu 
> wrote:
>
>> Hi all,
>>
>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
>> fetch data from an excel file using
>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
>> double for a date type column.
>>
>> The detailed description is given here (the question I posted) -
>>
>> https://stackoverflow.com/questions/45713699/inferschema-usi
>> ng-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>
>>
>> Found it is a probable bug with the crealytics excel read package.
>>
>> Can somebody help me with a workaround for this?
>>
>> Thanks,
>> Aakash.
>>
>
>

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Irving Duran

I think there is a difference between the actual value in the cell and what
Excel formats that cell.  You probably want to import that field as a
string or not have it as a date format in Excel.

Just a thought


Thank You,

Irving Duran

On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu 
wrote:

> Hey all,
>
> Forgot to attach the link to the overriding Schema through external
> package's discussion.
>
> https://github.com/crealytics/spark-excel/pull/13
>
> You can see my comment there too.
>
> Thanks,
> Aakash.
>
> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu 
> wrote:
>
>> Hi all,
>>
>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
>> fetch data from an excel file using
>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
>> double for a date type column.
>>
>> The detailed description is given here (the question I posted) -
>>
>> https://stackoverflow.com/questions/45713699/inferschema-
>> using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>
>>
>> Found it is a probable bug with the crealytics excel read package.
>>
>> Can somebody help me with a workaround for this?
>>
>> Thanks,
>> Aakash.
>>
>
>

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu

Hey all,

Forgot to attach the link to the overriding Schema through external
package's discussion.

https://github.com/crealytics/spark-excel/pull/13

You can see my comment there too.

Thanks,
Aakash.

On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu 
wrote:

> Hi all,
>
> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
> fetch data from an excel file using
> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
> double for a date type column.
>
> The detailed description is given here (the question I posted) -
>
> https://stackoverflow.com/questions/45713699/inferschema-using-spark-read-
> formatcom-crealytics-spark-excel-is-inferring-d
>
>
> Found it is a probable bug with the crealytics excel read package.
>
> Can somebody help me with a workaround for this?
>
> Thanks,
> Aakash.
>

Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu

Hi all,

I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to fetch
data from an excel file using
*spark.read.format("com.crealytics.spark.excel")*, but it is inferring
double for a date type column.

The detailed description is given here (the question I posted) -

https://stackoverflow.com/questions/45713699/inferschema-using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d


Found it is a probable bug with the crealytics excel read package.

Can somebody help me with a workaround for this?

Thanks,
Aakash.

Re: Restart streaming query spark 2.1 structured streaming

2017-08-16 Thread purna pradeep

And also is query.stop() is graceful stop operation?what happens to already
received data will it be processed ?

On Tue, Aug 15, 2017 at 7:21 PM purna pradeep 
wrote:

> Ok thanks
>
> Few more
>
> 1.when I looked into the documentation it says onQueryprogress is not
> threadsafe ,So Is this method would be the right place to refresh cache?and
> no need to restart query if I choose listener ?
>
> The methods are not thread-safe as they may be called from different
> threads.
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala
>
>
>
> 2.if I use streamingquerylistner onqueryprogress my understanding is
> method will be executed only when the query is in progress so if I refresh
> data frame here without restarting  query will it impact application ?
>
> 3.should I use unpersist (Boolean) blocking method or async method
> unpersist() as the data size is big.
>
> I feel your solution is better as it stops query --> refresh cache -->
> starts query if I compromise on little downtime even cached dataframe is
> huge .I'm not sure how listener behaves as it's asynchronous, correct me if
> I'm wrong.
>
> On Tue, Aug 15, 2017 at 6:36 PM Tathagata Das 
> wrote:
>
>> Both works. The asynchronous method with listener will have less of down
>> time, just that the first trigger/batch after the asynchronous
>> unpersist+persist will probably take longer as it has to reload the data.
>>
>>
>> On Tue, Aug 15, 2017 at 2:29 PM, purna pradeep 
>> wrote:
>>
>>> Thanks tathagata das actually I'm planning to something like this
>>>
>>> activeQuery.stop()
>>>
>>> //unpersist and persist cached data frame
>>>
>>> df.unpersist()
>>>
>>> //read the updated data //data size of df is around 100gb
>>>
>>> df.persist()
>>>
>>>  activeQuery = startQuery()
>>>
>>>
>>> the cached data frame size around 100gb ,so the question is this the
>>> right place to refresh this huge cached data frame ?
>>>
>>> I'm also trying to refresh cached data frame in onqueryprogress() method
>>> in a class which extends StreamingQuerylistner
>>>
>>> Would like to know which is the best place to refresh cached data frame
>>> and why
>>>
>>> Thanks again for the below response
>>>
>>> On Tue, Aug 15, 2017 at 4:45 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 You can do something like this.


 def startQuery(): StreamingQuery = {
// create your streaming dataframes
// start the query with the same checkpoint directory}

 // handle to the active queryvar activeQuery: StreamingQuery = null
 while(!stopped) {

if (activeQuery = null) { // if query not active, start query
  activeQuery = startQuery()

} else if (shouldRestartQuery())  {  // check your condition and 
 restart query
  activeQuery.stop()
  activeQuery = startQuery()
}

activeQuery.awaitTermination(100)   // wait for 100 ms.
// if there is any error it will throw exception and quit the loop
// otherwise it will keep checking the condition every 100ms}




 On Tue, Aug 15, 2017 at 1:13 PM, purna pradeep  wrote:

> Thanks Michael
>
> I guess my question is little confusing ..let me try again
>
>
> I would like to restart streaming query programmatically while my
> streaming application is running based on a condition and why I want to do
> this
>
> I want to refresh a cached data frame based on a condition and the
> best way to do this restart streaming query suggested by Tdas below for
> similar problem
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3cCA+AHuKn+vSEWkJD=bsst6g5bdzdas6wmn+fwmn4jtm1x1nd...@mail.gmail.com%3e
>
> I do understand that checkpoint if helps in recovery and failures but
> I would like to know "how to restart streaming query programmatically
> without stopping my streaming application"
>
> In place of query.awaittermination should I need to have an logic to
> restart query? Please suggest
>
>
> On Tue, Aug 15, 2017 at 3:26 PM Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> See
>> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
>>
>> Though I think that this currently doesn't work with the console sink.
>>
>> On Tue, Aug 15, 2017 at 9:40 AM, purna pradeep <
>> purna2prad...@gmail.com> wrote:
>>
>>> Hi,
>>>

 I'm trying to restart a streaming query to refresh cached data
 frame

 Where and how should I restart streaming query

>>>
>>>
>>> val sparkSes = SparkSession
>>>
>>>   .builder

Reading parquet file in stream

2017-08-16 Thread HARSH TAKKAR

Hi

I want to read a hdfs directory which contains parquet files, how can i
stream data from this directory using streaming context (ssc.fileStream) ?


Harsh

Thrift-Server JDBC ResultSet Cursor Reset or Previous

2017-08-16 Thread Imran Rajjad

Dear List,

Are there any future plans to implement cursor reset or previous record
functionality in Thrift Server`s JDBC driver? Are there any other
alternatives?

java.sql.SQLException: Method not supported
at
org.apache.hive.jdbc.HiveBaseResultSet.previous(HiveBaseResultSet.java:643)

regards
Imran

-- 
I.R

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-16 Thread Takeshi Yamamuro

Hi,

Since the csv source currently supports ascii-compatible charset, so I
guess shift-jis also works well.
You could check Hyukjin's comment in
https://issues.apache.org/jira/browse/SPARK-21289 for more info.


On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho  wrote:

> My apologies,
>
> It was a problem of our Hadoop cluster.
> When we tested the same code on another cluster (HDP-based), it worked
> without any problem.
>
> ```scala
> ## make sjis text
> cat a.txt
> 8月データだけでやってみよう
> nkf -W -s a.txt >b.txt
> cat b.txt
> 87n%G!<%?$@$1$G$d$C$F$_$h$&
> nkf -s -w b.txt
> 8月データだけでやってみよう
> hdfs dfs -put a.txt b.txt
>
> ## YARN mode test
> spark.read.option("encoding", "utf-8").csv("a.txt").show(1)
> +--+
> |   _c0|
> +--+
> |8月データだけでやってみよう|
> +--+
>
> spark.read.option("encoding", "sjis").csv("b.txt").show(1)
> +--+
> |   _c0|
> +--+
> |8月データだけでやってみよう|
> +--+
>
> spark.read.option("encoding", "utf-8").option("multiLine",
> true).csv("a.txt").show(1)
> +--+
> |   _c0|
> +--+
> |8月データだけでやってみよう|
> +--+
>
> spark.read.option("encoding", "sjis").option("multiLine",
> true).csv("b.txt").show(1)
> +--+
> |   _c0|
> +--+
> |8月データだけでやってみよう|
> +--+
> ```
>
> I am still digging the root cause and will share it later :-)
>
> Best wishes,
> Han-Choel
>
>
> On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho  wrote:
>
>> Dear Spark ML members,
>>
>>
>> I experienced a trouble in using "multiLine" option to load CSV data with
>> Shift-JIS encoding.
>> When option("multiLine", true) is specified, option("encoding",
>> "encoding-name") just doesn't work anymore.
>>
>>
>> In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
>> method doesn't use parser.options.charset at all.
>>
>> object MultiLineCSVDataSource extends CSVDataSource {
>>   override val isSplitable: Boolean = false
>>
>>   override def readFile(
>>   conf: Configuration,
>>   file: PartitionedFile,
>>   parser: UnivocityParser,
>>   schema: StructType): Iterator[InternalRow] = {
>> UnivocityParser.parseStream(
>>   CodecStreams.createInputStreamWithCloseResource(conf,
>> file.filePath),
>>   parser.options.headerFlag,
>>   parser,
>>   schema)
>>   }
>>   ...
>>
>> On the other hand, TextInputCSVDataSource.readFile() method uses it:
>>
>>   override def readFile(
>>   conf: Configuration,
>>   file: PartitionedFile,
>>   parser: UnivocityParser,
>>   schema: StructType): Iterator[InternalRow] = {
>> val lines = {
>>   val linesReader = new HadoopFileLinesReader(file, conf)
>>   Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ =>
>> linesReader.close()))
>>   linesReader.map { line =>
>> new String(line.getBytes, 0, line.getLength,
>> parser.options.charset)// < charset option is used here.
>>   }
>> }
>>
>> val shouldDropHeader = parser.options.headerFlag && file.start == 0
>> UnivocityParser.parseIterator(lines, shouldDropHeader, parser,
>> schema)
>>   }
>>
>>
>> It seems like a bug.
>> Is there anyone who had the same problem before?
>>
>>
>> Best wishes,
>> Han-Cheol
>>
>> --
>> ==
>> Han-Cheol Cho, Ph.D.
>> Data scientist, Data Science Team, Data Laboratory
>> NHN Techorus Corp.
>>
>> Homepage: https://sites.google.com/site/priancho/
>> ==
>>
>
>
>
> --
> ==
> Han-Cheol Cho, Ph.D.
> Data scientist, Data Science Team, Data Laboratory
> NHN Techorus Corp.
>
> Homepage: https://sites.google.com/site/priancho/
> ==
>



-- 
---
Takeshi Yamamuro

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

Re: Restart streaming query spark 2.1 structured streaming

Reading parquet file in stream

Thrift-Server JDBC ResultSet Cursor Reset or Previous

Re: Reading CSV with multiLine option invalidates encoding option.

9 matches

Site Navigation

Mail list logo

Footer information