Re: [discuss] SparkR CRAN feasibility check server problem

2018-11-10 Thread Felix Cheung
It’s a great point about min R version. From what I see, mostly because of 
fixes and packages support, most users of R are fairly up to date? So perhaps 
3.4 as min version is reasonable esp. for Spark 3.

Are we getting traction with CRAN sysadmin? It seems like this has been broken 
a few times.



From: Liang-Chi Hsieh 
Sent: Saturday, November 10, 2018 2:32 AM
To: dev@spark.apache.org
Subject: Re: [discuss] SparkR CRAN feasibility check server problem


Yeah, thanks Hyukjin Kwon for bringing this up for discussion.

I don't know how higher versions of R are widely used across R community. If
R version 3.1.x was not very commonly used, I think we can discuss to
upgrade minimum R version in next Spark version.

If we ended up with not upgrading, we can discuss with CRAN sysadmin to fix
it by the service side automatically that prevents malformed R packages
info. So we don't need to fix it manually every time.



Hyukjin Kwon wrote
>> Can upgrading R able to fix the issue. Is this perhaps not necessarily
> malform but some new format for new versions perhaps?
> That's my guess. I am not totally sure about it tho.
>
>> Anyway we should consider upgrading R version if that fixes the problem.
> Yea, we should. If we should, it should be more them R 3.4. Maybe it's
> good
> time to start to talk about minimum R version. 3.1.x is too old. It's
> released 4.5 years ago.
> R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0,
> deprecating lower versions, bumping up R to 3.4 might be reasonable
> option.
>
> Adding Shane as well.
>
> If we ended up with not upgrading it, I will forward this email to CRAN
> sysadmin to discuss further anyway.
>
>
>
> 2018년 11월 2일 (금) 오후 12:51, Felix Cheung <

> felixcheung@

> >님이 작성:
>
>> Thanks for being this up and much appreciate with keeping on top of this
>> at all times.
>>
>> Can upgrading R able to fix the issue. Is this perhaps not necessarily
>> malform but some new format for new versions perhaps? Anyway we should
>> consider upgrading R version if that fixes the problem.
>>
>> As an option we could also disable the repo check in Jenkins but I can
>> see
>> that could also be problematic.
>>
>>
>> On Thu, Nov 1, 2018 at 7:35 PM Hyukjin Kwon <

> gurwls223@

> > wrote:
>>
>>> Hi all,
>>>
>>> I want to raise the CRAN failure issue because it started to block Spark
>>> PRs time to time. Since the number
>>> of PRs grows hugely in Spark community, this is critical to not block
>>> other PRs.
>>>
>>> There has been a problem at CRAN (See
>>> https://github.com/apache/spark/pull/20005 for analysis).
>>> To cut it short, the root cause is malformed package info from
>>> https://cran.r-project.org/src/contrib/PACKAGES
>>> from server side, and this had to be fixed by requesting it to CRAN
>>> sysaadmin's help.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24152 <- newly open. I am
>>> pretty sure it's the same issue
>>> https://issues.apache.org/jira/browse/SPARK-25923 <- reopen/resolved 2
>>> times
>>> https://issues.apache.org/jira/browse/SPARK-22812
>>>
>>> This happened 5 times for roughly about 10 months, causing blocking
>>> almost all PRs in Apache Spark.
>>> Historically, it blocked whole PRs for few days once, and whole Spark
>>> community had to stop working.
>>>
>>> I assume this has been not a super big big issue so far for other
>>> projects or other people because apparently
>>> higher version of R has some logics to handle this malformed documents
>>> (at least I verified R 3.4.0 works fine).
>>>
>>> For our side, Jenkins has low R version (R 3.1.1 if that's not updated
>>> from what I have seen before),
>>> which is unable to parse the malformed server's response.
>>>
>>> So, I want to talk about how we are going to handle this. Possible
>>> solutions are:
>>>
>>> 1. We should start a talk with CRAN sysadmin to permanently prevent this
>>> issue
>>> 2. We upgrade R to 3.4.0 in Jenkins (however we will not be able to test
>>> low R versions)
>>> 3. ...
>>>
>>> If if we fine, I would like to suggest to forward this email to CRAN
>>> sysadmin to discuss further about this.
>>>
>>> Adding Liang-Chi Felix and Shivaram who I already talked about this few
>>> times before.
>>>
>>> Thanks all.
>>>
>>>
>>>
>>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] SparkR CRAN feasibility check server problem

2018-11-10 Thread Liang-Chi Hsieh


Yeah, thanks Hyukjin Kwon for bringing this up for discussion.

I don't know how higher versions of R are widely used across R community. If
R version 3.1.x was not very commonly used, I think we can discuss to
upgrade minimum R version in next Spark version.

If we ended up with not upgrading, we can discuss with CRAN sysadmin to fix
it by the service side automatically that prevents malformed R packages
info. So we don't need to fix it manually every time.



Hyukjin Kwon wrote
>> Can upgrading R able to fix the issue. Is this perhaps  not necessarily
> malform but some new format for new versions perhaps?
> That's my guess. I am not totally sure about it tho.
> 
>> Anyway we should consider upgrading R version if that fixes the problem.
> Yea, we should. If we should, it should be more them R 3.4. Maybe it's
> good
> time to start to talk about minimum R version. 3.1.x is too old. It's
> released 4.5 years ago.
> R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0,
> deprecating lower versions, bumping up R to 3.4 might be reasonable
> option.
> 
> Adding Shane as well.
> 
> If we ended up with not upgrading it, I will forward this email to CRAN
> sysadmin to discuss further anyway.
> 
> 
> 
> 2018년 11월 2일 (금) 오후 12:51, Felix Cheung <

> felixcheung@

> >님이 작성:
> 
>> Thanks for being this up and much appreciate with keeping on top of this
>> at all times.
>>
>> Can upgrading R able to fix the issue. Is this perhaps  not necessarily
>> malform but some new format for new versions perhaps? Anyway we should
>> consider upgrading R version if that fixes the problem.
>>
>> As an option we could also disable the repo check in Jenkins but I can
>> see
>> that could also be problematic.
>>
>>
>> On Thu, Nov 1, 2018 at 7:35 PM Hyukjin Kwon <

> gurwls223@

> > wrote:
>>
>>> Hi all,
>>>
>>> I want to raise the CRAN failure issue because it started to block Spark
>>> PRs time to time. Since the number
>>> of PRs grows hugely in Spark community, this is critical to not block
>>> other PRs.
>>>
>>> There has been a problem at CRAN (See
>>> https://github.com/apache/spark/pull/20005 for analysis).
>>> To cut it short, the root cause is malformed package info from
>>> https://cran.r-project.org/src/contrib/PACKAGES
>>> from server side, and this had to be fixed by requesting it to CRAN
>>> sysaadmin's help.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24152 <- newly open. I am
>>> pretty sure it's the same issue
>>> https://issues.apache.org/jira/browse/SPARK-25923 <- reopen/resolved 2
>>> times
>>> https://issues.apache.org/jira/browse/SPARK-22812
>>>
>>> This happened 5 times for roughly about 10 months, causing blocking
>>> almost all PRs in Apache Spark.
>>> Historically, it blocked whole PRs for few days once, and whole Spark
>>> community had to stop working.
>>>
>>> I assume this has been not a super big big issue so far for other
>>> projects or other people because apparently
>>> higher version of R has some logics to handle this malformed documents
>>> (at least I verified R 3.4.0 works fine).
>>>
>>> For our side, Jenkins has low R version (R 3.1.1 if that's not updated
>>> from what I have seen before),
>>> which is unable to parse the malformed server's response.
>>>
>>> So, I want to talk about how we are going to handle this. Possible
>>> solutions are:
>>>
>>> 1. We should start a talk with CRAN sysadmin to permanently prevent this
>>> issue
>>> 2. We upgrade R to 3.4.0 in Jenkins (however we will not be able to test
>>> low R versions)
>>> 3. ...
>>>
>>> If if we fine, I would like to suggest to forward this email to CRAN
>>> sysadmin to discuss further about this.
>>>
>>> Adding Liang-Chi Felix and Shivaram who I already talked about this few
>>> times before.
>>>
>>> Thanks all.
>>>
>>>
>>>
>>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] SparkR CRAN feasibility check server problem

2018-11-10 Thread Hyukjin Kwon
> Can upgrading R able to fix the issue. Is this perhaps  not necessarily
malform but some new format for new versions perhaps?
That's my guess. I am not totally sure about it tho.

> Anyway we should consider upgrading R version if that fixes the problem.
Yea, we should. If we should, it should be more them R 3.4. Maybe it's good
time to start to talk about minimum R version. 3.1.x is too old. It's
released 4.5 years ago.
R 3.4.0 is released 1.5 years ago. Considering the timing for Spark 3.0,
deprecating lower versions, bumping up R to 3.4 might be reasonable option.

Adding Shane as well.

If we ended up with not upgrading it, I will forward this email to CRAN
sysadmin to discuss further anyway.



2018년 11월 2일 (금) 오후 12:51, Felix Cheung 님이 작성:

> Thanks for being this up and much appreciate with keeping on top of this
> at all times.
>
> Can upgrading R able to fix the issue. Is this perhaps  not necessarily
> malform but some new format for new versions perhaps? Anyway we should
> consider upgrading R version if that fixes the problem.
>
> As an option we could also disable the repo check in Jenkins but I can see
> that could also be problematic.
>
>
> On Thu, Nov 1, 2018 at 7:35 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I want to raise the CRAN failure issue because it started to block Spark
>> PRs time to time. Since the number
>> of PRs grows hugely in Spark community, this is critical to not block
>> other PRs.
>>
>> There has been a problem at CRAN (See
>> https://github.com/apache/spark/pull/20005 for analysis).
>> To cut it short, the root cause is malformed package info from
>> https://cran.r-project.org/src/contrib/PACKAGES
>> from server side, and this had to be fixed by requesting it to CRAN
>> sysaadmin's help.
>>
>> https://issues.apache.org/jira/browse/SPARK-24152 <- newly open. I am
>> pretty sure it's the same issue
>> https://issues.apache.org/jira/browse/SPARK-25923 <- reopen/resolved 2
>> times
>> https://issues.apache.org/jira/browse/SPARK-22812
>>
>> This happened 5 times for roughly about 10 months, causing blocking
>> almost all PRs in Apache Spark.
>> Historically, it blocked whole PRs for few days once, and whole Spark
>> community had to stop working.
>>
>> I assume this has been not a super big big issue so far for other
>> projects or other people because apparently
>> higher version of R has some logics to handle this malformed documents
>> (at least I verified R 3.4.0 works fine).
>>
>> For our side, Jenkins has low R version (R 3.1.1 if that's not updated
>> from what I have seen before),
>> which is unable to parse the malformed server's response.
>>
>> So, I want to talk about how we are going to handle this. Possible
>> solutions are:
>>
>> 1. We should start a talk with CRAN sysadmin to permanently prevent this
>> issue
>> 2. We upgrade R to 3.4.0 in Jenkins (however we will not be able to test
>> low R versions)
>> 3. ...
>>
>> If if we fine, I would like to suggest to forward this email to CRAN
>> sysadmin to discuss further about this.
>>
>> Adding Liang-Chi Felix and Shivaram who I already talked about this few
>> times before.
>>
>> Thanks all.
>>
>>
>>
>>


Re: Arrow optimization in conversion from R DataFrame to Spark DataFrame

2018-11-10 Thread Hyukjin Kwon
Thanks guys ! 👍

2018년 11월 10일 (토) 오전 7:35, Bryan Cutler 님이 작성:

> Great work Hyukjin!  I'm not too familiar with R, but I'll take a look at
> the PR.
>
> Bryan
>
> On Fri, Nov 9, 2018 at 9:19 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Thanks Hyukjin! Very cool results
>>
>> Shivaram
>> On Fri, Nov 9, 2018 at 10:58 AM Felix Cheung 
>> wrote:
>> >
>> > Very cool!
>> >
>> >
>> > 
>> > From: Hyukjin Kwon 
>> > Sent: Thursday, November 8, 2018 10:29 AM
>> > To: dev
>> > Subject: Arrow optimization in conversion from R DataFrame to Spark
>> DataFrame
>> >
>> > Hi all,
>> >
>> > I am trying to introduce R Arrow optimization by reusing PySpark Arrow
>> optimization.
>> >
>> > It boosts R DataFrame > Spark DataFrame up to roughly 900% ~ 1200%
>> faster.
>> >
>> > Looks working fine so far; however, I would appreciate if you guys have
>> some time to take a look (https://github.com/apache/spark/pull/22954) so
>> that we can directly go ahead as soon as R API of Arrow is released.
>> >
>> > More importantly, I want some more people who're more into Arrow R API
>> side but also interested in Spark side. I have already cc'ed some people I
>> know but please come, review and discuss for both Spark side and Arrow side.
>> >
>> > Thanks.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Spark Utf 8 encoding

2018-11-10 Thread Jörn Franke
Is the original file indeed utf-8? Especially Windows environments tend to mess 
up the files (E.g. Java on Windows does not use by default UTF-8). However, 
also the software that processed the data before could have modified it.

> Am 10.11.2018 um 02:17 schrieb lsn24 :
> 
> Hello,
> 
> Per the documentation default character encoding of spark is UTF-8. But
> when i try to read non ascii characters, spark tend to read it as question
> marks. What am I doing wrong ?. Below is my Syntax:
> 
> val ds = spark.read.textFile("a .bz2 file from hdfs");
> ds.show();
> 
> The string "KøBENHAVN"  gets displayed as "K�BENHAVN"
> 
> I did the testing on spark shell, ran it the same command as a part of spark
> Job. Both yields the same result.
> 
> I don't know what I am missing . I read the documentation, I couldn't find
> any explicit config etc.
> 
> Any pointers will be greatly appreciated!
> 
> Thanks
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org