from:"Felix Cheung"

[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2018-01-07 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315436#comment-16315436
 ] 

Felix Cheung commented on SPARK-22632:
--

yes, first I'd agree we should generalize this to R & Python
second, I think in general the different treatment of timezone between language 
and Spark has been a source of confusion (has been reported at least a few 
times)
lastly, this isn't a regression AFAIK, so not necessarily a blocker for 2.3, 
although might be very good to have.


> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-07 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315434#comment-16315434
 ] 

Felix Cheung commented on SPARK-21727:
--

I think we should use
is.atomic(object)

?

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Kubernetes backend and docker images

2018-01-06 Thread Felix Cheung

+1

Thanks for taking on this.
That was my feedback on one of the long comment thread as well, I think we 
should have one docker image instead of 3 (also pending in the fork are python 
and R variant, we should consider having one that we official release instead 
of 9, for example)



From: 蒋星博 
Sent: Friday, January 5, 2018 10:57:53 PM
To: Marcelo Vanzin
Cc: dev
Subject: Re: Kubernetes backend and docker images

Agree it should be nice to have this simplification, and users can still create 
their custom images by copy/modifying the default one.
Thanks for bring this out Marcelo!

2018-01-05 17:06 GMT-08:00 Marcelo Vanzin 
>:
Hey all, especially those working on the k8s stuff.

Currently we have 3 docker images that need to be built and provided
by the user when starting a Spark app: driver, executor, and init
container.

When the initial review went by, I asked why do we need 3, and I was
told that's because they have different entry points. That never
really convinced me, but well, everybody wanted to get things in to
get the ball rolling.

But I still think that's not the best way to go. I did some pretty
simple hacking and got things to work with a single image:

https://github.com/vanzin/spark/commit/k8s-img

Is there a reason why that approach would not work? You could still
create separate images for driver and executor if wanted, but there's
no reason I can see why we should need 3 images for the simple case.

Note that the code there can be cleaned up still, and I don't love the
idea of using env variables to propagate arguments to the container,
but that works for now.

--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-01-03 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310872#comment-16310872
 ] 

Felix Cheung commented on SPARK-16693:
--

I thought we did but I couldn't find any record.
I suppose we keep this till 2.4.0

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22933) R Structured Streaming API for withWatermark, trigger, partitionBy

2018-01-03 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-22933.
--
  Resolution: Fixed
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> R Structured Streaming API for withWatermark, trigger, partitionBy
> --
>
> Key: SPARK-22933
> URL: https://issues.apache.org/jira/browse/SPARK-22933
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-01-02 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308367#comment-16308367
 ] 

Felix Cheung commented on SPARK-16693:
--

These are all non public methods, so officially not public APIs, but people 
have been known to call them.




> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14037) count(df) is very slow for dataframe constructed using SparkR::createDataFrame

2018-01-01 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-14037:
-
Summary: count(df) is very slow for dataframe constructed using 
SparkR::createDataFrame  (was: count(df) is very slow for dataframe constrcuted 
using SparkR::createDataFrame)

> count(df) is very slow for dataframe constructed using SparkR::createDataFrame
> --
>
> Key: SPARK-14037
> URL: https://issues.apache.org/jira/browse/SPARK-14037
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>Reporter: Samuel Alexander
>  Labels: performance, sparkR
> Attachments: console.log, spark_ui.png, spark_ui_ray.png
>
>
> Any operations on dataframe created using SparkR::createDataFrame is very 
> slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, 
> sep=","). And then converted into Spark dataframe using sp_df <- 
> createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- 
> read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = 
> "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But 
> other operations like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16366) Time comparison failures in SparkR unit tests

2018-01-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307728#comment-16307728
 ] 

Felix Cheung commented on SPARK-16366:
--

not 100% sure, but a a similar timestamp test failure was reported at one point 
and I thought both cases were caused by the machine time zone and R picking 
that up.


> Time comparison failures in SparkR unit tests
> -
>
> Key: SPARK-16366
> URL: https://issues.apache.org/jira/browse/SPARK-16366
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>
> Got the following failures when running SparkR unit tests:
> {panel}
> Failed 
> -
> 1. Failure: date functions on a DataFrame (@test_sparkSQL.R#1261) 
> --
> collect(select(df2, from_utc_timestamp(df2$b, "JST")))[, 1] not equal to 
> c(as.POSIXlt("2012-12-13 21:34:00 UTC"), as.POSIXlt("2014-12-15 10:24:34 
> UTC")).
> Attributes: < Component "tzone": 1 string mismatch >
> 2. Failure: date functions on a DataFrame (@test_sparkSQL.R#1263) 
> --
> collect(select(df2, to_utc_timestamp(df2$b, "JST")))[, 1] not equal to 
> c(as.POSIXlt("2012-12-13 03:34:00 UTC"), as.POSIXlt("2014-12-14 16:24:34 
> UTC")).
> Attributes: < Component "tzone": 1 string mismatch >
> {panel}
> My environment is Ubuntu 14.04, R 3.2.5, LC_TIME=zh_CN.UTF-8
> {code}
> > t<-c(as.POSIXlt("2012-12-13 21:34:00 UTC"), as.POSIXlt("2014-12-15 10:24:34 
> > UTC"))
> > attr(t, "tzone")
> [1] """CST" "CST"
> {code}
> The "tzone" attribute should be cleared before time comparison.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-01-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307725#comment-16307725
 ] 

Felix Cheung commented on SPARK-16693:
--

[~shivaram]we didn't do this, not sure if we should to get this in 2.3.0.
what do you think? 

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17762) invokeJava fails when serialized argument list is larger than INT_MAX (2,147,483,647) bytes

2018-01-01 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307724#comment-16307724
 ] 

Felix Cheung commented on SPARK-17762:
--

is this still needed after SPARK-17790 is fixed?

> invokeJava fails when serialized argument list is larger than INT_MAX 
> (2,147,483,647) bytes
> ---
>
> Key: SPARK-17762
> URL: https://issues.apache.org/jira/browse/SPARK-17762
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> We call {{writeBin}} within {{writeRaw}} which is called from invokeJava on 
> the serialized arguments list. Unfortunately, {{writeBin}} has a hard-coded 
> limit set to {{R_LEN_T_MAX}} (which is itself set to {{INT_MAX}} in base). 
> To work around it, we can check for this case and serialize the batch in 
> multiple parts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22933) R Structured Streaming API for withWatermark, trigger, partitionBy

2017-12-31 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-22933:


 Summary: R Structured Streaming API for withWatermark, trigger, 
partitionBy
 Key: SPARK-22933
 URL: https://issues.apache.org/jira/browse/SPARK-22933
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Structured Streaming
Affects Versions: 2.3.0
Reporter: Felix Cheung
Assignee: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [Discussion] 0.8.0 Release

2017-12-30 Thread Felix Cheung

+1


From: Jeff Zhang 
Sent: Wednesday, December 27, 2017 3:36:20 PM
To: dev@zeppelin.apache.org
Subject: Re: [Discussion] 0.8.0 Release

I will update that jira, and anyone can link jiras that he think is
critical for 0.8.0, but it is not guaranteed to be included in 0.8.0, it
needs community consensus.


Miquel Angel Andreu Febrer 于2017年12月28日周四
上午3:03写道：

> Thanks for your help
>
> As you well said it would be necessary to update the issue link
> https://issues.apache.org/jira/browse/ZEPPELIN-2385  only with the links
> that are included in the 0.8.0
>
> Who is going to do that job?
>
>
> El 27 dic. 2017 19:35, "moon soo Lee"  escribió:
>
> > Cool. Thanks for the volunteer.
> >
> > Why don't we update issue link of
> > https://issues.apache.org/jira/browse/ZEPPELIN-2385? I think issue link
> is
> > bit out dated.
> > We can remove issue links if they're not expected in 0.8.0, add issue
> links
> > if they must be included in 0.8.0.
> > So everyone can track the progress of 0.8.0 in the same place.
> >
> > Thanks,
> > moon
> >
> >
> > On Wed, Dec 27, 2017 at 8:17 AM Belousov Maksim Eduardovich <
> > m.belou...@tinkoff.ru> wrote:
> >
> > > This is great news.
> > >
> > > Our team will take part in testing and bug fixing.
> > >
> > >
> > > Thanks,
> > >
> > > Maksim Belousov
> > >
> > >
> > > -Original Message-
> > > From: Miquel Angel Andreu Febrer [mailto:miquelangeland...@gmail.com]
> > > Sent: Wednesday, December 27, 2017 3:46 PM
> > > To: dev@zeppelin.apache.org
> > > Subject: Re: [Discussion] 0.8.0 Release
> > >
> > > Hello everyone,
> > >
> > > I think it's a good idea, I do not know if we are pending some
> important
> > > issues for the 0.8.0, we should see in jira. I can help with the
> > management
> > > of the 0.8.0 release if the new release will be launched at the end of
> > > January
> > >
> > > Regards
> > >
> > > El 27 dic. 2017 13:35, "Jeff Zhang"  escribió:
> > >
> > > > Hi folks,
> > > >
> > > > It is a long time since our last release, and many new features are
> > > > added into 0.8.0, so I think we could think about to prepare the
> > > > release of 0.8.0, if no volunteer to be the release manager, I
> > > > volunteer to be the release manager of this version. My current
> > > > thinking is to make the release at the end of January. Any thinking
> and
> > > comments ?
> > > >
> > >
> >
>

Re: [DISCUSS] Increase a few numbers in source code

2017-12-30 Thread Felix Cheung

Thanks! They look reasonable to me. Please feel free to open a PR

From: Belousov Maksim Eduardovich 
Sent: Saturday, December 30, 2017 5:20:02 AM
To: dev@zeppelin.apache.org
Subject: [DISCUSS] Increase a few numbers in source code

Hello, team!

There are some numbers in source code. The influence of these numbers is 
non-obvious and very important.

1. SchedulerFactory.java [1]:
executor = ExecutorFactory.singleton().createOrGet("SchedulerFactory", 100);

The meaning of "100" is that Zeppelin server can get only 100 started 
interpreter processes. When analysts run 100 jvm/interpreter processes then 
Zeppelin will be fully stuck: no paragraph can run now even if the paragraph 
run a few minutes ago.

I wrote about this case previously [2].

2. ZeppelinConfiguration.java [3]
ZEPPELIN_INTERPRETER_MAX_POOL_SIZE("zeppelin.interpreter.max.poolsize", 10),

"10" - is the number of paragraphs that will be run under cron scheduling.
This behavior arises after apply of paragraph sequential run.
Also there is no description for "interpreter.max.poolsize".

3. spark.port.maxRetries = 16
It's about use of local spark interpreter.
By default server can start only 16 local spark instance and interpreters. The 
analysts cannot start the 17-th new spark interpreter and gets error.
It's non-obvious what setting affects on max number of spark processes.

The most valuable server resources are RAM and CPU. Above settings don't 
optimize use of RAM/CPU and are more like bugs. It would be good to refactor 
code and not to use settings 1 and 2 at all, but this issue is low priority.
Therefore I want to put a very big number, for example 65536, in all cases.
Share please your thoughts about increasing default values in these cases.

1. 
https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/scheduler/SchedulerFactory.java#L55
2. 
https://lists.apache.org/thread.html/30966875b50b5ac8b4326c23a012d702ea7cac24d75f540ae58f31b3@%3Cusers.zeppelin.apache.org%3E
3. 
https://github.com/apache/zeppelin/blob/dd1be03dee9428ade92b8fd47d148c2325179d19/zeppelin-interpreter/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java#L667

Regards,

Maksim Belousov

[jira] [Updated] (SPARK-22925) ml model persistence creates a lot of small files

2017-12-29 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22925:
-
Description: 
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issues such as SPARK-19294 have been reported for making 
very large single file.

Whereas in the latter case, models such as RandomForestModel could create 
hundreds or thousands of very small files which is also unmanageable. Looking 
into this, there is no simple way to set/change spark.default.parallelism 
(which would be pick up by sc.parallelize) while the app is running since 
SparkConf seems to be copied/cached by the backend without a way to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make numSlice settable on a per-use basis.


  was:
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issues such as SPARK-19294 have been reported for making 
very large single file.

Whereas in the latter case, models such as RandomForestModel could create 
hundreds or thousands of very small files which is also unmanageable. Looking 
into this, there is no simple way to set/change spark.default.parallelism 
(which would be pick up by sc.parallelize) while the app is running since 
SparkConf seems to be copied/cached by the backend without a way to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make it settable on a per-use basis.



> ml model persistence creates a lot of small files
> -
>
> Key: SPARK-22925
> URL: https://issues.apache.org/jira/browse/SPARK-22925
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
> repartition(1) but in some other models we don't.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60
> In the former case issues such as SPARK-19294 have been reported for making 
> very large single file.
> Whereas in the latter case, models such as RandomForestModel could create 
> hundreds or thousands of very small files which is also unmanageable. Looking 
> into this, there is no simple way to set/change spark.default.parallelism 
> (which would be pick up by sc.parallelize) while the app is running since 
> SparkConf seems to be copied/cached by the backend without a way to update 
> them.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135
> It seems we need to have a way to make numSlice settable on a per-use basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22925) ml model persistence creates a lot of small files

2017-12-29 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22925:
-
Description: 
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issues such as SPARK-19294 have been reported for making 
very large single file.

Whereas in the latter case, models such as RandomForestModel could create 
hundreds or thousands of very small files which is also unmanageable. Looking 
into this, there is no simple way to set/change spark.default.parallelism 
(which would be pick up by sc.parallelize) while the app is running since 
SparkConf seems to be copied/cached by the backend without a way to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make it settable on a per-use basis.


  was:
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issues such as SPARK-19294 have been reported for making 
very large single file.

Whereas in the latter case, models such as RandomForestModel could create 
hundreds or thousands of files which is also unmanageable. Looking into this, 
there is no simple way to set/change spark.default.parallelism (which would be 
pick up by sc.parallelize) while the app is running since SparkConf seems to be 
copied/cached by the backend without a way to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make it settable on a per-use basis.



> ml model persistence creates a lot of small files
> -
>
> Key: SPARK-22925
> URL: https://issues.apache.org/jira/browse/SPARK-22925
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
> repartition(1) but in some other models we don't.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60
> In the former case issues such as SPARK-19294 have been reported for making 
> very large single file.
> Whereas in the latter case, models such as RandomForestModel could create 
> hundreds or thousands of very small files which is also unmanageable. Looking 
> into this, there is no simple way to set/change spark.default.parallelism 
> (which would be pick up by sc.parallelize) while the app is running since 
> SparkConf seems to be copied/cached by the backend without a way to update 
> them.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135
> It seems we need to have a way to make it settable on a per-use basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22925) ml model persistence creates a lot of small files

2017-12-29 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22925:
-
Issue Type: Improvement  (was: Bug)

> ml model persistence creates a lot of small files
> -
>
> Key: SPARK-22925
> URL: https://issues.apache.org/jira/browse/SPARK-22925
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
> repartition(1) but in some other models we don't.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60
> In the former case issues such as SPARK-19294 have been reported for making 
> very large single file.
> Whereas in the latter case, models such as RandomForestModel could create 
> hundreds or thousands of files which is also unmanageable. Looking into this, 
> there is no simple way to set/change spark.default.parallelism (which would 
> be pick up by sc.parallelize) while the app is running since SparkConf seems 
> to be copied/cached by the backend without a way to update them.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135
> It seems we need to have a way to make it settable on a per-use basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22925) ml model persistence creates a lot of small files

2017-12-29 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22925:
-
Description: 
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issues such as SPARK-19294 have been reported for making 
very large single file.

Whereas in the latter case, models such as RandomForestModel could create 
hundreds or thousands of files which is also unmanageable. Looking into this, 
there is no simple way to set/change spark.default.parallelism (which would be 
pick up by sc.parallelize) while the app is running since SparkConf seems to be 
copied/cached by the backend without a way to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make it settable on a per-use basis.


  was:
Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issue such as SPARK-19294 has been reported for having very 
large single file.

Whereas in the latter case, model such as RandomForestModel could create 
hundreds or thousands of file which is also unmanageable. Looking into this, 
there is no simple way to set/change spark.default.parallelism while the app is 
running since SparkConf seems to be copied/cached by the backend without a way 
to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make it settable on a per-use basis.



> ml model persistence creates a lot of small files
> -
>
> Key: SPARK-22925
> URL: https://issues.apache.org/jira/browse/SPARK-22925
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
> repartition(1) but in some other models we don't.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60
> In the former case issues such as SPARK-19294 have been reported for making 
> very large single file.
> Whereas in the latter case, models such as RandomForestModel could create 
> hundreds or thousands of files which is also unmanageable. Looking into this, 
> there is no simple way to set/change spark.default.parallelism (which would 
> be pick up by sc.parallelize) while the app is running since SparkConf seems 
> to be copied/cached by the backend without a way to update them.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135
> It seems we need to have a way to make it settable on a per-use basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22925) ml model persistence creates a lot of small files

2017-12-29 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306520#comment-16306520
 ] 

Felix Cheung commented on SPARK-22925:
--

[~josephkb][~holdenkarau][~nick.pentre...@gmail.com][~yanboliang]

> ml model persistence creates a lot of small files
> -
>
> Key: SPARK-22925
> URL: https://issues.apache.org/jira/browse/SPARK-22925
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
> repartition(1) but in some other models we don't.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60
> In the former case issue such as SPARK-19294 has been reported for having 
> very large single file.
> Whereas in the latter case, model such as RandomForestModel could create 
> hundreds or thousands of file which is also unmanageable. Looking into this, 
> there is no simple way to set/change spark.default.parallelism while the app 
> is running since SparkConf seems to be copied/cached by the backend without a 
> way to update them.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135
> It seems we need to have a way to make it settable on a per-use basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22925) ml model persistence creates a lot of small files

2017-12-29 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-22925:


 Summary: ml model persistence creates a lot of small files
 Key: SPARK-22925
 URL: https://issues.apache.org/jira/browse/SPARK-22925
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.2.1, 2.1.2, 2.3.0
Reporter: Felix Cheung


Today in when calling model.save(), some ML models we do makeRDD(data, 1) or 
repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issue such as SPARK-19294 has been reported for having very 
large single file.

Whereas in the latter case, model such as RandomForestModel could create 
hundreds or thousands of file which is also unmanageable. Looking into this, 
there is no simple way to set/change spark.default.parallelism while the app is 
running since SparkConf seems to be copied/cached by the backend without a way 
to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make it settable on a per-use basis.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22924) R DataFrame API for sortWithinPartitions

2017-12-29 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-22924:


 Summary: R DataFrame API for sortWithinPartitions
 Key: SPARK-22924
 URL: https://issues.apache.org/jira/browse/SPARK-22924
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22920) R sql functions for current_date, current_timestamp, rtrim/ltrim/trim with trimString

2017-12-29 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-22920.
--
Resolution: Fixed
  Assignee: Felix Cheung

> R sql functions for current_date, current_timestamp, rtrim/ltrim/trim with 
> trimString
> -
>
> Key: SPARK-22920
> URL: https://issues.apache.org/jira/browse/SPARK-22920
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>    Reporter: Felix Cheung
>Assignee: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22920) R sql functions for current_date, current_timestamp, rtrim/ltrim/trim with trimString

2017-12-29 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22920:
-
Target Version/s: 2.3.0
   Fix Version/s: 2.3.0

> R sql functions for current_date, current_timestamp, rtrim/ltrim/trim with 
> trimString
> -
>
> Key: SPARK-22920
> URL: https://issues.apache.org/jira/browse/SPARK-22920
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>    Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22920) R sql functions for current_date, current_timestamp, rtrim/ltrim/trim with trimString

2017-12-28 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-22920:


 Summary: R sql functions for current_date, current_timestamp, 
rtrim/ltrim/trim with trimString
 Key: SPARK-22920
 URL: https://issues.apache.org/jira/browse/SPARK-22920
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21616) SparkR 2.3.0 migration guide, release note

2017-12-28 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305192#comment-16305192
 ] 

Felix Cheung commented on SPARK-21616:
--

SPARK-22315

> SparkR 2.3.0 migration guide, release note
> --
>
> Key: SPARK-21616
> URL: https://issues.apache.org/jira/browse/SPARK-21616
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
>
> From looking at changes since 2.2.0, this/these should be documented in the 
> migration guide / release note for the 2.3.0 release, as it is behavior 
> changes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-12-28 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305157#comment-16305157
 ] 

Felix Cheung commented on SPARK-21727:
--

[~neilalex] How it is going?

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Spark 2.2.1 (RC2)

2017-12-27 Thread Felix Cheung

Yes, this is something we are aware of.
Please use the direct link to the release version doc.

I think we are still waiting for the PyPI publication.

From: zzc <441586...@qq.com>
Sent: Tuesday, December 26, 2017 10:03:08 PM
To: dev@spark.apache.org
Subject: Re: [VOTE] Spark 2.2.1 (RC2)

Hi Felix Cheung:
  When to pulish the new version 2.2.1 of spark doc to the website, now it's
still the version 2.2.0.

--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Passing an array of more than 22 elements in a UDF

2017-12-26 Thread Felix Cheung

Generally the 22 limitation is from Scala 2.10.

In Scala 2.11, the issue with case class is fixed, but with that said I’m not 
sure if with UDF in Java other limitation might apply.

_
From: Aakash Basu <aakash.spark@gmail.com>
Sent: Monday, December 25, 2017 9:13 PM
Subject: Re: Passing an array of more than 22 elements in a UDF
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: ayan guha <guha.a...@gmail.com>, user <user@spark.apache.org>

What's the privilege of using that specific version for this? Please throw some 
light onto it.

On Mon, Dec 25, 2017 at 6:51 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Or use it with Scala 2.11?

From: ayan guha <guha.a...@gmail.com<mailto:guha.a...@gmail.com>>
Sent: Friday, December 22, 2017 3:15:14 AM
To: Aakash Basu
Cc: user
Subject: Re: Passing an array of more than 22 elements in a UDF

Hi I think you are in correct track. You can stuff all your param in a suitable 
data structure like array or dict and pass this structure as a single param in 
your udf.

On Fri, 22 Dec 2017 at 2:55 pm, Aakash Basu 
<aakash.spark@gmail.com<mailto:aakash.spark@gmail.com>> wrote:
Hi,

I am using Spark 2.2 using Java, can anyone please suggest me how to take more 
than 22 parameters in an UDF? I mean, if I want to pass all the parameters as 
an array of integers?

Thanks,
Aakash.
--
Best Regards,
Ayan Guha

Re: Spark 2.2.1 worker invocation

2017-12-26 Thread Felix Cheung

I think you are looking for spark.executor.extraJavaOptions

https://spark.apache.org/docs/latest/configuration.html#runtime-environment

From: Christopher Piggott 
Sent: Tuesday, December 26, 2017 8:00:56 AM
To: user@spark.apache.org
Subject: Spark 2.2.1 worker invocation

I need to set java.library.path to get access to some native code.  Following 
directions, I made a spark-env.sh:

#!/usr/bin/env bash
export 
LD_LIBRARY_PATH="/usr/local/lib/libcdfNativeLibrary.so:/usr/local/lib/libcdf.so:${LD_LIBRARY_PATH}"
export SPARK_WORKER_OPTS=-Djava.library.path=/usr/local/lib
export SPARK_WORKER_MEMORY=2g

to no avail.  (I tried both with and without exporting the environment).  
Looking at how the worker actually starts up:

 /usr/lib/jvm/default/bin/java -cp /home/spark/conf/:/home/spark/jars/* 
-Xmx1024M -Dspark.driver.port=37219 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@10.1.1.1:37219
 --executor-id 113 --hostname 10.2.2.1 --cores 8 --app-id 
app-20171225145607-0003 --worker-url 
spark://Worker@10.2.2.1:35449

It doesn't seem to take any options.  I put an 'echo' in just to confirm that 
spark-env.sh is getting invoked (and it is).

So, just out of curiosity, I tried to troubleshoot this:

spark@node2-1:~$ grep -R SPARK_WORKER_OPTS *
conf/spark-env.sh:export 
SPARK_WORKER_OPTS=-Djava.library.path=/usr/local/lib
conf/spark-env.sh.template:# - SPARK_WORKER_OPTS, to set config properties 
only for the worker (e.g. "-Dx=y")

The variable doesn't seem to get referenced anywhere in the spark distribution. 
 I checked a number of other options in spark-env.sh.template and they didn't 
seem to be referenced either.  I expected to find them in various startup 
scripts.

I can probably "fix" my problem by hacking the lower-level startup scripts, but 
first I'd like to inquire about what's going on here.  How and where are these 
variables actually used?

Re: Passing an array of more than 22 elements in a UDF

2017-12-24 Thread Felix Cheung

Or use it with Scala 2.11?

From: ayan guha 
Sent: Friday, December 22, 2017 3:15:14 AM
To: Aakash Basu
Cc: user
Subject: Re: Passing an array of more than 22 elements in a UDF

Hi I think you are in correct track. You can stuff all your param in a suitable 
data structure like array or dict and pass this structure as a single param in 
your udf.

On Fri, 22 Dec 2017 at 2:55 pm, Aakash Basu 
> wrote:
Hi,

I am using Spark 2.2 using Java, can anyone please suggest me how to take more 
than 22 parameters in an UDF? I mean, if I want to pass all the parameters as 
an array of integers?

Thanks,
Aakash.
--
Best Regards,
Ayan Guha

[jira] [Updated] (SPARK-22889) CRAN checks can fail if older Spark install exists

2017-12-23 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22889:
-
Fix Version/s: 2.3.0
   2.2.2

> CRAN checks can fail if older Spark install exists
> --
>
> Key: SPARK-22889
> URL: https://issues.apache.org/jira/browse/SPARK-22889
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 2.2.2, 2.3.0
>
>
> Since all CRAN checks go through the same machine, if there is an older 
> partial download or partial install of Spark left behind the tests fail. One 
> solution is to overwrite the install files when running tests. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22889) CRAN checks can fail if older Spark install exists

2017-12-23 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-22889.
--
  Resolution: Fixed
Assignee: Shivaram Venkataraman
Target Version/s: 2.2.2, 2.3.0  (was: 2.3.0)

> CRAN checks can fail if older Spark install exists
> --
>
> Key: SPARK-22889
> URL: https://issues.apache.org/jira/browse/SPARK-22889
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> Since all CRAN checks go through the same machine, if there is an older 
> partial download or partial install of Spark left behind the tests fail. One 
> solution is to overwrite the install files when running tests. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] Change some default settings for avoiding unintended usages

2017-12-23 Thread Felix Cheung

Authentication by default is good but we should avoid having well known user / 
password by default - it’s security risk.

From: Belousov Maksim Eduardovich <m.belou...@tinkoff.ru>
Sent: Thursday, December 21, 2017 12:30:57 AM
To: us...@zeppelin.apache.org
Cc: dev@zeppelin.apache.org
Subject: RE: [DISCUSS] Change some default settings for avoiding unintended 
usages

The authentication by default isn't big deal, it's could be enabled.
It's nice to use another account by default: guest/guest, for example.

Thanks,

Maksim Belousov

From: Jongyoul Lee [mailto:jongy...@gmail.com]
Sent: Monday, December 18, 2017 6:07 AM
To: users <us...@zeppelin.apache.org>
Cc: dev@zeppelin.apache.org
Subject: Re: [DISCUSS] Change some default settings for avoiding unintended 
usages

Agreed. Supporting container services must be good and I like this idea, but I 
don't think it's the part of this issue directly. Let's talk about this issue 
with another email.

I want to talk about enabling authentication by default. If it's enabled, we 
should login admin/password1 at the beginning. How do you think of it?

On Sat, Dec 2, 2017 at 1:57 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I’d +1 docker or container support (mesos, dc/os, k8s)

But I think that they are separate things. If users are authenticated and 
interpreter is impersonating each user, the risk of system disruption should be 
low. This is typically how to secure things in a system, through user directory 
(eg LDAP) and access control (normal user can’t sudo and delete everything).

Thought?

_
From: Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>>
Sent: Thursday, November 30, 2017 11:51 PM

Subject: Re: [DISCUSS] Change some default settings for avoiding unintended 
usages
To: <dev@zeppelin.apache.org<mailto:dev@zeppelin.apache.org>>
Cc: users <us...@zeppelin.apache.org<mailto:us...@zeppelin.apache.org>>

+1 for running interpreter process in docker container.

Jongyoul Lee <jongy...@gmail.com<mailto:jongy...@gmail.com>>于2017年12月1日周五 
下午3:36写道：
Yes, exactly, this is not only the shell interpreter problem, all can run
any script through python and Scala. Shell is just an example.

Using docker looks good but it cannot avoid unindented usage of resources
like mining coin.

On Fri, Dec 1, 2017 at 2:36 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
wrote:

> I don’t think that’s limited to the shell interpreter.
>
> You can run any arbitrary program or script from python or Scala (or java)
> as well.
>
> _
> From: Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>>
> Sent: Wednesday, November 29, 2017 4:00 PM
> Subject: Re: [DISCUSS] Change some default settings for avoiding
> unintended usages
> To: <dev@zeppelin.apache.org<mailto:dev@zeppelin.apache.org>>
> Cc: users <us...@zeppelin.apache.org<mailto:us...@zeppelin.apache.org>>
>
>
>
> Shell interpreter is a black hole for security, usually we don't recommend
> or allow user to use shell.
>
> We may need to refactor the shell interpreter, running under zeppelin user
> is too dangerous.
>
>
>
>
>
> Jongyoul Lee <jongy...@gmail.com<mailto:jongy...@gmail.com>>于2017年11月29日周三 
> 下午11:44写道：
>
> > Hi, users and dev,
> >
> > Recently, I've got an issue about the abnormal usage of some
> interpreters.
> > Zeppelin's users can access shell by shell and python interpreters. It
> > means all users can run or execute what they want even if it harms the
> > system. Thus I agree that we need to change some default settings to
> > prevent this kind of abusing situation. Before we proceed to do it, I
> want
> > to listen to others' opinions.
> >
> > Feel free to reply this email
> >
> > Regards,
> > Jongyoul
> >
> > --
> > 이종열, Jongyoul Lee, 李宗烈
> > http://madeng.net
> >
>
>
>

--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

Re: [DISCUSS] Change some default settings for avoiding unintended usages

2017-12-23 Thread Felix Cheung

Authentication by default is good but we should avoid having well known user / 
password by default - it’s security risk.

From: Belousov Maksim Eduardovich <m.belou...@tinkoff.ru>
Sent: Thursday, December 21, 2017 12:30:57 AM
To: users@zeppelin.apache.org
Cc: d...@zeppelin.apache.org
Subject: RE: [DISCUSS] Change some default settings for avoiding unintended 
usages

The authentication by default isn't big deal, it's could be enabled.
It's nice to use another account by default: guest/guest, for example.

Thanks,

Maksim Belousov

From: Jongyoul Lee [mailto:jongy...@gmail.com]
Sent: Monday, December 18, 2017 6:07 AM
To: users <users@zeppelin.apache.org>
Cc: d...@zeppelin.apache.org
Subject: Re: [DISCUSS] Change some default settings for avoiding unintended 
usages

Agreed. Supporting container services must be good and I like this idea, but I 
don't think it's the part of this issue directly. Let's talk about this issue 
with another email.

I want to talk about enabling authentication by default. If it's enabled, we 
should login admin/password1 at the beginning. How do you think of it?

On Sat, Dec 2, 2017 at 1:57 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I’d +1 docker or container support (mesos, dc/os, k8s)

But I think that they are separate things. If users are authenticated and 
interpreter is impersonating each user, the risk of system disruption should be 
low. This is typically how to secure things in a system, through user directory 
(eg LDAP) and access control (normal user can’t sudo and delete everything).

Thought?

_
From: Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>>
Sent: Thursday, November 30, 2017 11:51 PM

Subject: Re: [DISCUSS] Change some default settings for avoiding unintended 
usages
To: <d...@zeppelin.apache.org<mailto:d...@zeppelin.apache.org>>
Cc: users <users@zeppelin.apache.org<mailto:users@zeppelin.apache.org>>

+1 for running interpreter process in docker container.

Jongyoul Lee <jongy...@gmail.com<mailto:jongy...@gmail.com>>于2017年12月1日周五 
下午3:36写道：
Yes, exactly, this is not only the shell interpreter problem, all can run
any script through python and Scala. Shell is just an example.

Using docker looks good but it cannot avoid unindented usage of resources
like mining coin.

On Fri, Dec 1, 2017 at 2:36 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
wrote:

> I don’t think that’s limited to the shell interpreter.
>
> You can run any arbitrary program or script from python or Scala (or java)
> as well.
>
> _
> From: Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>>
> Sent: Wednesday, November 29, 2017 4:00 PM
> Subject: Re: [DISCUSS] Change some default settings for avoiding
> unintended usages
> To: <d...@zeppelin.apache.org<mailto:d...@zeppelin.apache.org>>
> Cc: users <users@zeppelin.apache.org<mailto:users@zeppelin.apache.org>>
>
>
>
> Shell interpreter is a black hole for security, usually we don't recommend
> or allow user to use shell.
>
> We may need to refactor the shell interpreter, running under zeppelin user
> is too dangerous.
>
>
>
>
>
> Jongyoul Lee <jongy...@gmail.com<mailto:jongy...@gmail.com>>于2017年11月29日周三 
> 下午11:44写道：
>
> > Hi, users and dev,
> >
> > Recently, I've got an issue about the abnormal usage of some
> interpreters.
> > Zeppelin's users can access shell by shell and python interpreters. It
> > means all users can run or execute what they want even if it harms the
> > system. Thus I agree that we need to change some default settings to
> > prevent this kind of abusing situation. Before we proceed to do it, I
> want
> > to listen to others' opinions.
> >
> > Feel free to reply this email
> >
> > Regards,
> > Jongyoul
> >
> > --
> > 이종열, Jongyoul Lee, 李宗烈
> > http://madeng.net
> >
>
>
>

--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

Re: [DISCUSS] Review process

2017-12-23 Thread Felix Cheung

I’d suggest first, +1 on the PR to express your interests and support.

Then you could add reviewer directly on the PR.

But this is outside the scope of this thread.


From: Belousov Maksim Eduardovich <m.belou...@tinkoff.ru>
Sent: Wednesday, December 20, 2017 1:22:07 AM
To: dev@zeppelin.apache.org
Subject: RE: [DISCUSS] Review process

We are speaking in this thread about two reviewers on big PR, but there are PRs 
without any reviewers =)

What contributor can do if his PR have not been reviewed for 4 week?
For example https://github.com/apache/zeppelin/pull/2684

Thanks,

Maksim Belousov


-Original Message-
From: Jongyoul Lee [mailto:jongy...@gmail.com]
Sent: Tuesday, December 19, 2017 8:34 PM
To: dev <dev@zeppelin.apache.org>
Subject: Re: [DISCUSS] Review process

I agree with some large PR should be delayed a bit longer. What I meant is we 
don't have to wait for all kind of PRs.

On Wed, Dec 20, 2017 at 2:11 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> +1
> What would be the rough heuristic people will be comfortable with-
> what is small and what is big?
>
> _
> From: Anthony Corbacho <anthonycorba...@apache.org>
> Sent: Monday, December 18, 2017 3:02 PM
> Subject: Re: [DISCUSS] Review process
> To: <dev@zeppelin.apache.org>
>
>
> I think for large PR (new feature or big change) we should still keep
> more than one approval before merging it since this will require more 
> attension.
>
> But for bug fix i think one approval should be enough.
>
> On Tue, 19 Dec 2017 at 7:49 AM Jeff Zhang <zjf...@gmail.com> wrote:
>
> > Agree with @Felix, especially for the large PR and PR of new
> > features it is still necessary to have more than +1.
> >
> > I think committer have the ability to identity whether this PR is
> > complicated enough that needs another committer's review. As long we
> > as have consensus, we could commit some PR without delay and some PR
> > for
> more
> > reviews. So that we can balance the development speed and code quality.
> >
> >
> >
> > Miquel Angel Andreu Febrer
> > <miquelangeland...@gmail.com>于2017年12月19日周二
> > 上午2:07写道：
> >
> > > You can automate that process in jenkins and manage the delay time
> > > of merging a pull request
> > >
> > > El 18 dic. 2017 18:03, "Felix Cheung" <felixcheun...@hotmail.com>
> > > escribió:
> > >
> > > > I think it is still useful to have a time delay after one
> > > > approve
> since
> > > > often time there are very feedback and updates after one
> > > > committer
> > > approval.
> > > >
> > > > Also github has a tab for all PRs you are subscribed to, it
> > > > shouldn’t
> > be
> > > > very hard to review all the approved ones again.
> > > >
> > > > 
> > > > From: Jongyoul Lee <jongy...@gmail.com>
> > > > Sent: Monday, December 18, 2017 8:04:51 AM
> > > > To: dev@zeppelin.apache.org
> > > > Subject: Re: [DISCUSS] Review process
> > > >
> > > > Good for summary. But actually, no committer merges without
> > > > delay
> after
> > > > reviewing it. So I thought we should clarify it officially.
> > > >
> > > > Now, some committers, including me, will be able to merge some
> > > > PRs
> > > without
> > > > delay and burden.
> > > >
> > > > On Mon, 18 Dec 2017 at 11:27 PM moon soo Lee <m...@apache.org>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Current review process[1] does require either at least a +1
> > > > > from
> > > > committer
> > > > > or 24 hours for lazy consensus.
> > > > >
> > > > > Pullrequest can be open for 1 or 2 days for additional review,
> > > > > but
> i
> > > > think
> > > > > they're not hard requirements. (e.g. Hotfixes are already
> > > > > being
> > merged
> > > > > without waiting additional review)
> > > > >
> > > > > So, technically, current policy allows any committer can start
> > review,
> > > > mark
> > > > > +1 and merge immediately without any delay if necessary.
> > > > >
> > > > > Thanks,
> > > > > moon
> > > > >
> > > > > [1]
> > > > >
> > > >

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-22 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301717#comment-16301717
 ] 

Felix Cheung commented on SPARK-22683:
--

I think the challenge here is the ability to determine upfront if the task is 
going to be small/quick.



> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-22 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301709#comment-16301709
 ] 

Felix Cheung edited comment on SPARK-22683 at 12/22/17 5:35 PM:


I couldn't find the exact source line, but from running Flink previously I'm 
reasonably sure number of task slots  == number of cores. Therefore I don't 
think it's meant to increase utilization by over committing tasks or 
concurrently running multiple tasks and so on.


was (Author: felixcheung):
I couldn't find the exact source line, but from running Flink previously I'm 
reasonably sure number of task slots  == number of cores. Therefore I don't 
think it's meant to increase utilization by concurrently running multiple tasks 
and so on.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could bec

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-22 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301709#comment-16301709
 ] 

Felix Cheung commented on SPARK-22683:
--

I couldn't find the exact source line, but from running Flink previously I'm 
reasonably sure number of task slots  == number of cores. Therefore I don't 
think it's meant to increase utilization by concurrently running multiple tasks 
and so on.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
&g

[jira] [Commented] (SPARK-22870) Dynamic allocation should allow 0 idle time

2017-12-22 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301701#comment-16301701
 ] 

Felix Cheung commented on SPARK-22870:
--

+1
yes there is more than the check for the value 0 as the clean up is not 
currently synchronous.

> Dynamic allocation should allow 0 idle time
> ---
>
> Key: SPARK-22870
> URL: https://issues.apache.org/jira/browse/SPARK-22870
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>Priority: Minor
>
> As discussed in SPARK-22765, with SPARK-21656, an executor will not idle out 
> when there are pending tasks to run. When there is no task to run, an 
> executor will die out after {{spark.dynamicAllocation.executorIdleTimeout}}, 
> which is currently required to be greater than zero. However, for efficiency, 
> a user should be able to specify that an executor can die out immediately w/o 
> being required to be idle for at least 1s.
> This is to make {{0}} a valid value for 
> {{spark.dynamicAllocation.executorIdleTimeout}}, and special handling such a 
> case might be needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20007) Make SparkR apply() functions robust to workers that return empty data.frame

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301015#comment-16301015
 ] 

Felix Cheung commented on SPARK-20007:
--

any taker on this for 2.3.0?

> Make SparkR apply() functions robust to workers that return empty data.frame
> 
>
> Key: SPARK-20007
> URL: https://issues.apache.org/jira/browse/SPARK-20007
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> When using {{gapply()}} (or other members of {{apply()}} family) with a 
> schema, Spark will try to parse data returned form the R process on each 
> worker as Spark DataFrame Rows based on the schema. In this case our provided 
> schema suggests that we have six column. When an R worker returns results to 
> JVM, SparkSQL will try to access its columns one by one and cast them to 
> proper types. If R worker returns nothing, JVM will throw 
> {{ArrayIndexOutOfBoundsException}} exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21076) R dapply doesn't return array or raw columns when array have different length

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301011#comment-16301011
 ] 

Felix Cheung commented on SPARK-21076:
--

any taker on this for 2.3.0?

> R dapply doesn't return array or raw columns when array have different length
> -
>
> Key: SPARK-21076
> URL: https://issues.apache.org/jira/browse/SPARK-21076
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Xu Yang
>
> Calling SparkR::dapplyCollect with R functions that return dataframes 
> produces an error. This comes up when returning columns of binary data- ie. 
> serialized fitted models. Also happens when functions return columns 
> containing vectors. 
> [SPARK-16785|https://issues.apache.org/jira/browse/SPARK-16785]
> still have this issue when input data is an array column not having the same 
> length on each vector, like:
> {code}
> head(test1)
>key  value
> 1 4dda7d68a202e9e3  1595297780
> 2  4e08f349deb7392  641991337
> 3 4e105531747ee00b  374773009
> 4 4f1d5ef7fdb4620a  2570136926
> 5 4f63a71e6dde04cd  2117602722
> 6 4fa2f96b689624fc  3489692062, 1344510747, 1095592237, 
> 424510360, 3211239587
> sparkR.stop()
> sc <- sparkR.init()
> sqlContext <- sparkRSQL.init(sc)
> spark_df = createDataFrame(sqlContext, test1)
> # Fails
> dapplyCollect(spark_df, function(x) x)
> Caused by: org.apache.spark.SparkException: R computation failed with
>  Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
> stringsAsFactors = default.stringsAsFactors())  : 
>   invalid list argument: all variables should have the same length
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:186)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:183)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   ... 1 more
> # Works fine
> spark_df <- selectExpr(spark_df, "key", "explode(value) value") 
> dapplyCollect(spark_df, function(x) x)
> key value
> 1  4dda7d68a202e9e3 1595297780
> 2   4e08f349deb7392  641991337
> 3  4e105531747ee00b  374773009
> 4  4f1d5ef7fdb4620a 2570136926
> 5  4f63a71e6dde04cd 2117602722
> 6  4fa2f96b689624fc 3489692062
> 7  4fa2f96b689624fc 1344510747
> 8  4fa2f96b689624fc 1095592237
> 9  4fa2f96b689624fc  424510360
> 10 4fa2f96b689624fc 3211239587
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21076) R dapply doesn't return array or raw columns when array have different length

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21076:
-
Target Version/s: 2.3.0

> R dapply doesn't return array or raw columns when array have different length
> -
>
> Key: SPARK-21076
> URL: https://issues.apache.org/jira/browse/SPARK-21076
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Xu Yang
>
> Calling SparkR::dapplyCollect with R functions that return dataframes 
> produces an error. This comes up when returning columns of binary data- ie. 
> serialized fitted models. Also happens when functions return columns 
> containing vectors. 
> [SPARK-16785|https://issues.apache.org/jira/browse/SPARK-16785]
> still have this issue when input data is an array column not having the same 
> length on each vector, like:
> {code}
> head(test1)
>key  value
> 1 4dda7d68a202e9e3  1595297780
> 2  4e08f349deb7392  641991337
> 3 4e105531747ee00b  374773009
> 4 4f1d5ef7fdb4620a  2570136926
> 5 4f63a71e6dde04cd  2117602722
> 6 4fa2f96b689624fc  3489692062, 1344510747, 1095592237, 
> 424510360, 3211239587
> sparkR.stop()
> sc <- sparkR.init()
> sqlContext <- sparkRSQL.init(sc)
> spark_df = createDataFrame(sqlContext, test1)
> # Fails
> dapplyCollect(spark_df, function(x) x)
> Caused by: org.apache.spark.SparkException: R computation failed with
>  Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
> stringsAsFactors = default.stringsAsFactors())  : 
>   invalid list argument: all variables should have the same length
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
>   at 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:186)
>   at 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:183)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   ... 1 more
> # Works fine
> spark_df <- selectExpr(spark_df, "key", "explode(value) value") 
> dapplyCollect(spark_df, function(x) x)
> key value
> 1  4dda7d68a202e9e3 1595297780
> 2   4e08f349deb7392  641991337
> 3  4e105531747ee00b  374773009
> 4  4f1d5ef7fdb4620a 2570136926
> 5  4f63a71e6dde04cd 2117602722
> 6  4fa2f96b689624fc 3489692062
> 7  4fa2f96b689624fc 1344510747
> 8  4fa2f96b689624fc 1095592237
> 9  4fa2f96b689624fc  424510360
> 10 4fa2f96b689624fc 3211239587
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21291) R bucketBy partitionBy API

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21291:
-
Target Version/s: 2.3.0

> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301006#comment-16301006
 ] 

Felix Cheung commented on SPARK-22632:
--

how are we on this for 2.3?

> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21940) Support timezone for timestamps in SparkR

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21940:
-
Target Version/s: 2.3.0

> Support timezone for timestamps in SparkR
> -
>
> Key: SPARK-21940
> URL: https://issues.apache.org/jira/browse/SPARK-21940
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> {{SparkR::createDataFrame()}} wipes timezone attribute from POSIXct and 
> POSIXlt. See following example:
> {code}
> > x <- data.frame(x = c(Sys.time()))
> > x
> x
> 1 2017-09-06 19:17:16
> > attr(x$x, "tzone") <- "Europe/Paris"
> > x
> x
> 1 2017-09-07 04:17:16
> > collect(createDataFrame(x))
> x
> 1 2017-09-06 19:17:16
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22632:
-
Target Version/s: 2.3.0

> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20007) Make SparkR apply() functions robust to workers that return empty data.frame

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20007:
-
Target Version/s: 2.3.0

> Make SparkR apply() functions robust to workers that return empty data.frame
> 
>
> Key: SPARK-20007
> URL: https://issues.apache.org/jira/browse/SPARK-20007
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> When using {{gapply()}} (or other members of {{apply()}} family) with a 
> schema, Spark will try to parse data returned form the R process on each 
> worker as Spark DataFrame Rows based on the schema. In this case our provided 
> schema suggests that we have six column. When an R worker returns results to 
> JVM, SparkSQL will try to access its columns one by one and cast them to 
> proper types. If R worker returns nothing, JVM will throw 
> {{ArrayIndexOutOfBoundsException}} exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21208:
-
Target Version/s: 2.3.0

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301012#comment-16301012
 ] 

Felix Cheung commented on SPARK-21208:
--

any taker on this for 2.3.0?

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21030) extend hint syntax to support any expression for Python and R

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21030:
-
Target Version/s: 2.3.0

> extend hint syntax to support any expression for Python and R
> -
>
> Key: SPARK-21030
> URL: https://issues.apache.org/jira/browse/SPARK-21030
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>
> See SPARK-20854
> we need to relax checks in 
> https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422
> and
> https://github.com/apache/spark/blob/7f203a248f94df6183a4bc4642a3d873171fef29/R/pkg/R/DataFrame.R#L3746



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21030) extend hint syntax to support any expression for Python and R

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301014#comment-16301014
 ] 

Felix Cheung commented on SPARK-21030:
--

any taker on this for 2.3.0?

> extend hint syntax to support any expression for Python and R
> -
>
> Key: SPARK-21030
> URL: https://issues.apache.org/jira/browse/SPARK-21030
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>
> See SPARK-20854
> we need to relax checks in 
> https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422
> and
> https://github.com/apache/spark/blob/7f203a248f94df6183a4bc4642a3d873171fef29/R/pkg/R/DataFrame.R#L3746



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301010#comment-16301010
 ] 

Felix Cheung commented on SPARK-21291:
--

any taker on this for 2.3.0?

> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21940) Support timezone for timestamps in SparkR

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301007#comment-16301007
 ] 

Felix Cheung commented on SPARK-21940:
--

any taker on this for 2.3.0?

> Support timezone for timestamps in SparkR
> -
>
> Key: SPARK-21940
> URL: https://issues.apache.org/jira/browse/SPARK-21940
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> {{SparkR::createDataFrame()}} wipes timezone attribute from POSIXct and 
> POSIXlt. See following example:
> {code}
> > x <- data.frame(x = c(Sys.time()))
> > x
> x
> 1 2017-09-06 19:17:16
> > attr(x$x, "tzone") <- "Europe/Paris"
> > x
> x
> 1 2017-09-07 04:17:16
> > collect(createDataFrame(x))
> x
> 1 2017-09-06 19:17:16
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22843) R localCheckpoint API

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22843:
-
Target Version/s: 2.3.0

> R localCheckpoint API
> -
>
> Key: SPARK-22843
> URL: https://issues.apache.org/jira/browse/SPARK-22843
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22843) R localCheckpoint API

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301005#comment-16301005
 ] 

Felix Cheung commented on SPARK-22843:
--

any taker on this for 2.3.0?

> R localCheckpoint API
> -
>
> Key: SPARK-22843
> URL: https://issues.apache.org/jira/browse/SPARK-22843
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22766) Install R linter package in spark lib directory

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301003#comment-16301003
 ] 

Felix Cheung commented on SPARK-22766:
--

how is this vs  SPARK-22063 
?

> Install R linter package in spark lib directory
> ---
>
> Key: SPARK-22766
> URL: https://issues.apache.org/jira/browse/SPARK-22766
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Hossein Falaki
>
> {{dev/lint-r.R}} file installs uses devtools to install {{jimhester/lintr}} 
> package in the default site library location which is 
> {{/usr/local/lib/R/site-library}. This is not recommended and can fail 
> because we are running this script as jenkins while that directory is owned 
> by root.
> We need to install the linter package in a local directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300965#comment-16300965
 ] 

Felix Cheung commented on SPARK-21727:
--

Neil McQuarrie is going to work on this

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>Assignee: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-12-21 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-21727:


Assignee: Neil McQuarrie

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>Assignee: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300599#comment-16300599
 ] 

Felix Cheung commented on SPARK-22851:
--

Ah then it’s the browser - i know Safari does unpack it




> Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect 
> checksum
> --
>
> Key: SPARK-22851
> URL: https://issues.apache.org/jira/browse/SPARK-22851
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: John Brock
>Priority: Critical
>
> The correct sha512 is:
> 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685.
> However, the file I downloaded from 
> http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
>  is giving me a different sha256:
> 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9
> It looks like this mirror has a file that isn't actually gzipped, just 
> tarred. If I ungzip one of the copies of spark-2.2.1-bin-hadoop2.7.tgz with 
> the correct sha512, and take the sha512 of the resulting tar, I get the same 
> incorrect hash above of 
> 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9.
> I asked some colleagues to download the incorrect file themselves to check 
> the hash -- some of them got a file that was gzipped and some didn't. I'm 
> assuming there's some caching or mirroring happening that may give you a 
> different file than the one I got.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum

2017-12-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300338#comment-16300338
 ] 

Felix Cheung commented on SPARK-22851:
--

That’s odd. Mirror replication is handled transparently. If it is correct from 
some mirror but not he others we need to open an INFRA ticket?




> Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect 
> checksum
> --
>
> Key: SPARK-22851
> URL: https://issues.apache.org/jira/browse/SPARK-22851
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: John Brock
>Priority: Critical
>
> The correct sha512 is:
> 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685.
> However, the file I downloaded from 
> http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
>  is giving me a different sha256:
> 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9
> It looks like this mirror has a file that isn't actually gzipped, just 
> tarred. If I ungzip one of the copies of spark-2.2.1-bin-hadoop2.7.tgz with 
> the correct sha512, and take the sha512 of the resulting tar, I get the same 
> incorrect hash above of 
> 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9.
> I asked some colleagues to download the incorrect file themselves to check 
> the hash -- some of them got a file that was gzipped and some didn't. I'm 
> assuming there's some caching or mirroring happening that may give you a 
> different file than the one I got.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Timeline for Spark 2.3

2017-12-20 Thread Felix Cheung

+1
I think the earlier we cut a branch the better.

From: Michael Armbrust 
Sent: Tuesday, December 19, 2017 4:41:44 PM
To: Holden Karau
Cc: Sameer Agarwal; Erik Erlandson; dev
Subject: Re: Timeline for Spark 2.3

Do people really need to be around for the branch cut (modulo the person 
cutting the branch)?

1st or 2nd doesn't really matter to me, but I am +1 kicking this off as soon as 
we enter the new year :)

Michael

On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau 
> wrote:
Sounds reasonable, although I'd choose the 2nd perhaps just since lots of folks 
are off on the 1st?

On Tue, Dec 19, 2017 at 4:36 PM, Sameer Agarwal 
> wrote:
Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that (i.e., 
week of 8th Jan)?

On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau 
> wrote:
So personally I’d be in favour or pushing to early January, doing a release 
over the holidays is a little rough with herding all of people to vote.

On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson 
> wrote:
I wanted to check in on the state of the 2.3 freeze schedule.  Original 
proposal was "late Dec", which is a bit open to interpretation.

We are working to get some refactoring done on the integration testing for the 
Kubernetes back-end in preparation for testing upcoming release candidates, 
however holiday vacation time is about to begin taking its toll both on 
upstream reviewing and on the "downstream" spark-on-kube fork.

If the freeze pushed into January, that would take some of the pressure off the 
kube back-end upstreaming. However, regardless, I was wondering if the dates 
could be clarified.
Cheers,
Erik

On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com 
> wrote:
Hi,

What is the process to request an issue/fix to be included in the next
release? Is there a place to vote for features?
I am interested in https://issues.apache.org/jira/browse/SPARK-13127, to see
if we can get Spark upgrade parquet to 1.9.0, which addresses the
https://issues.apache.org/jira/browse/PARQUET-686.
Can we include the fix in Spark 2.3 release?

Thanks,

Dong

--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

--
Twitter: https://twitter.com/holdenkarau

--
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag

--
Twitter: https://twitter.com/holdenkarau

[jira] [Created] (SPARK-22843) R localCheckpoint API

2017-12-20 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-22843:


 Summary: R localCheckpoint API
 Key: SPARK-22843
 URL: https://issues.apache.org/jira/browse/SPARK-22843
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Docker images

2017-12-19 Thread Felix Cheung

Hi!

Is there a reason the official docker images
https://hub.docker.com/r/_/flink/

Has sources in a different repo
https://github.com/docker-flink/docker-flink
?

Re: [DISCUSS] Review process

2017-12-19 Thread Felix Cheung

+1
What would be the rough heuristic people will be comfortable with- what is 
small and what is big?

_
From: Anthony Corbacho <anthonycorba...@apache.org>
Sent: Monday, December 18, 2017 3:02 PM
Subject: Re: [DISCUSS] Review process
To: <dev@zeppelin.apache.org>


I think for large PR (new feature or big change) we should still keep more
than one approval before merging it since this will require more attension.

But for bug fix i think one approval should be enough.

On Tue, 19 Dec 2017 at 7:49 AM Jeff Zhang <zjf...@gmail.com> wrote:

> Agree with @Felix, especially for the large PR and PR of new features it
> is still necessary to have more than +1.
>
> I think committer have the ability to identity whether this PR is
> complicated enough that needs another committer's review. As long we as
> have consensus, we could commit some PR without delay and some PR for more
> reviews. So that we can balance the development speed and code quality.
>
>
>
> Miquel Angel Andreu Febrer <miquelangeland...@gmail.com>于2017年12月19日周二
> 上午2:07写道：
>
> > You can automate that process in jenkins and manage the delay time of
> > merging a pull request
> >
> > El 18 dic. 2017 18:03, "Felix Cheung" <felixcheun...@hotmail.com>
> > escribió:
> >
> > > I think it is still useful to have a time delay after one approve since
> > > often time there are very feedback and updates after one committer
> > approval.
> > >
> > > Also github has a tab for all PRs you are subscribed to, it shouldn’t
> be
> > > very hard to review all the approved ones again.
> > >
> > > 
> > > From: Jongyoul Lee <jongy...@gmail.com>
> > > Sent: Monday, December 18, 2017 8:04:51 AM
> > > To: dev@zeppelin.apache.org
> > > Subject: Re: [DISCUSS] Review process
> > >
> > > Good for summary. But actually, no committer merges without delay after
> > > reviewing it. So I thought we should clarify it officially.
> > >
> > > Now, some committers, including me, will be able to merge some PRs
> > without
> > > delay and burden.
> > >
> > > On Mon, 18 Dec 2017 at 11:27 PM moon soo Lee <m...@apache.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > Current review process[1] does require either at least a +1 from
> > > committer
> > > > or 24 hours for lazy consensus.
> > > >
> > > > Pullrequest can be open for 1 or 2 days for additional review, but i
> > > think
> > > > they're not hard requirements. (e.g. Hotfixes are already being
> merged
> > > > without waiting additional review)
> > > >
> > > > So, technically, current policy allows any committer can start
> review,
> > > mark
> > > > +1 and merge immediately without any delay if necessary.
> > > >
> > > > Thanks,
> > > > moon
> > > >
> > > > [1]
> > > >
> > > > http://zeppelin.apache.org/contribution/contributions.
> > > html#the-review-process
> > > >
> > > >
> > > > On Mon, Dec 18, 2017 at 2:13 AM Belousov Maksim Eduardovich <
> > > > m.belou...@tinkoff.ru> wrote:
> > > >
> > > > > +1 for non-delay merging.
> > > > > Our team have opened approved PR [1] for 5 days.
> > > > >
> > > > > I didn't find any pages with `consensus how to review and merge
> > > > > contributions`.
> > > > > It would be nice to write a check list for reviewer.
> > > > >
> > > > > The development of Zeppelin is very important for us and we want to
> > > > review
> > > > > new commits.
> > > > >
> > > > >
> > > > > [1] https://github.com/apache/zeppelin/pull/2697
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Maksim Belousov
> > > > >
> > > > > -Original Message-
> > > > > From: Jongyoul Lee [mailto:jongy...@gmail.com]
> > > > > Sent: Monday, December 18, 2017 12:12 PM
> > > > > To: dev <dev@zeppelin.apache.org>
> > > > > Subject: Re: [DISCUSS] Review process
> > > > >
> > > > > Thank you for the replying it. I think so
> > > > >
> > > > > On Mon, Dec 18, 2017 at 3:15 PM, Miquel Angel Andreu Febrer <
> >

Re: [DISCUSS] Review process

2017-12-18 Thread Felix Cheung

I think it is still useful to have a time delay after one approve since often 
time there are very feedback and updates after one committer approval.

Also github has a tab for all PRs you are subscribed to, it shouldn’t be very 
hard to review all the approved ones again.


From: Jongyoul Lee 
Sent: Monday, December 18, 2017 8:04:51 AM
To: dev@zeppelin.apache.org
Subject: Re: [DISCUSS] Review process

Good for summary. But actually, no committer merges without delay after
reviewing it. So I thought we should clarify it officially.

Now, some committers, including me, will be able to merge some PRs without
delay and burden.

On Mon, 18 Dec 2017 at 11:27 PM moon soo Lee  wrote:

> Hi,
>
> Current review process[1] does require either at least a +1 from committer
> or 24 hours for lazy consensus.
>
> Pullrequest can be open for 1 or 2 days for additional review, but i think
> they're not hard requirements. (e.g. Hotfixes are already being merged
> without waiting additional review)
>
> So, technically, current policy allows any committer can start review, mark
> +1 and merge immediately without any delay if necessary.
>
> Thanks,
> moon
>
> [1]
>
> http://zeppelin.apache.org/contribution/contributions.html#the-review-process
>
>
> On Mon, Dec 18, 2017 at 2:13 AM Belousov Maksim Eduardovich <
> m.belou...@tinkoff.ru> wrote:
>
> > +1 for non-delay merging.
> > Our team have opened approved PR [1] for 5 days.
> >
> > I didn't find any pages with `consensus how to review and merge
> > contributions`.
> > It would be nice to write a check list for reviewer.
> >
> > The development of Zeppelin is very important for us and we want to
> review
> > new commits.
> >
> >
> > [1] https://github.com/apache/zeppelin/pull/2697
> >
> >
> > Thanks,
> > Maksim Belousov
> >
> > -Original Message-
> > From: Jongyoul Lee [mailto:jongy...@gmail.com]
> > Sent: Monday, December 18, 2017 12:12 PM
> > To: dev 
> > Subject: Re: [DISCUSS] Review process
> >
> > Thank you for the replying it. I think so
> >
> > On Mon, Dec 18, 2017 at 3:15 PM, Miquel Angel Andreu Febrer <
> > miquelangeland...@gmail.com> wrote:
> >
> > > I agree, ig is necessary to have no delay afternoon merging. I think
> > > it will help speed up processes and help contributors
> > >
> > > El 18 dic. 2017 4:33, "Jongyoul Lee"  escribió:
> > >
> > > Hi committers,
> > >
> > > I want to suggest one thing about our reviewing process. We have the
> > > policy to wait for one-day before merging some PRs. AFAIK, It's
> > > because we reduce mistakes and prevent abuses from committing by owner
> > > without reviewing it concretely. I would like to change this policy to
> > > remove delay after merging it. We, recently, don't have much reviewers
> > > and committers who can merge continuously, and in my case, I,
> > > sometimes, forget some PRs that I have to merge. And I also believe
> > > all committers have consensus how to review and merge contributions.
> > >
> > > How do you think of it?
> > >
> > > JL
> > >
> > > --
> > > 이종열, Jongyoul Lee, 李宗烈
> > > http://madeng.net
> > >
> >
> >
> >
> > --
> > 이종열, Jongyoul Lee, 李宗烈
> > http://madeng.net
> >
>
--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

Re: Accessing Spark UI from Zeppelin

2017-12-16 Thread Felix Cheung

You could set to replace http://masternode with your custom http hostname.

Perhaps you want that to be set to a known, public (and authenticated?) IP/url? 
If you do have it it can be set to the Zeppelin config before Zeppelin starts.

From: ankit jain 
Sent: Thursday, December 14, 2017 2:35:32 PM
To: users@zeppelin.apache.org
Cc: Esteban de Jesus Hernandez; Jhon; Emmanuel; Oliver; Phil
Subject: Accessing Spark UI from Zeppelin

Hi Zeppelin users,

We are following https://issues.apache.org/jira/browse/ZEPPELIN-2949 to launch 
spark ui.

Our Zeppelin instance is deployed on AWS EMR master node and setting 
zeppelin.spark.uiWebUrl to a url which elb maps to 
https://masternode:4040.

When user clicks on spark url within Zeppelin it redirects him to Yarn RM( 
something like http://masternode:20888/proxy/application_1511906080313_0023/) 
which fails to load.

Usually to access EMR Web interfaces requires to setup a SSH tunnel and change 
proxy settings in the browser - 
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-history.html

Is there a way we can avoid users having to setup ssh tunnel and allow direct 
access to Spark UI?

Ideally, we will implement a filter which does Authentication on the user and 
then redirect to Spark UI – right now not sure what the redirect URL should be?

--
Thanks & Regards,
Anki

Re: zeppelin build fails with DependencyConvergence error

2017-12-16 Thread Felix Cheung

Instead of exclusion, would it be better to use the version in the cloudera 
repo?

Please do consider contributing these changes back to Zeppelin source. Thanks!

_
From: Ruslan Dautkhanov 
Sent: Monday, December 11, 2017 3:42 PM
Subject: Re: zeppelin build fails with DependencyConvergence error
To: Zeppelin Users 

Looks like master branch of Zeppelin still has compatibility issue with 
Cloudera dependencies.

When built using

mvn clean package -DskipTests -Pspark-2.2 -Dhadoop.version=2.6.0-cdh5.12.1 
-Phadoop-2.6 -Pvendor-repo -pl '!...list of excluded packages' -e

maven fails on jackson convergence error - see below email for more details.
Looks like there was a change in Zeppelin that upgraded Jackson's version?
So now it conflicts with older jackson library as referenced by cloudera repo.

workaround: Zeppelin builds fine with pom change [1] - the question is now
would somebody expect Zeppelin would still be functioning correctly with these 
exclusions?

[1]

--- a/zeppelin-zengine/pom.xml
+++ b/zeppelin-zengine/pom.xml
@@ -364,6 +364,30 @@
   com.google.guava
   guava

+
+  com.fasterxml.jackson.core
+  jackson-core
+
+
+  com.fasterxml.jackson.core
+  jackson-annotations
+
+
+  com.fasterxml.jackson.core
+  jackson-databind
+
+
+  org.codehaus.jackson
+  jackson-mapper-asl
+
+
+  org.codehaus.jackson
+  jackson-core-asl
+
+
+  org.apache.zookeeper
+  zookeeper
+

On Sun, Aug 27, 2017 at 2:25 PM, Ruslan Dautkhanov 
> wrote:
Building from a current Zeppelin snapshot fails with
zeppelin build fails with 
org.apache.maven.plugins.enforcer.DependencyConvergence
see details below.

Build command
/opt/maven/maven-latest/bin/mvn clean package -DskipTests -Pspark-2.2 
-Dhadoop.version=2.6.0-cdh5.12.0 -Phadoop-2.6 -Pvendor-repo -Pscala-2.10 
-Psparkr -pl '!..excluded certain modules..' -e

maven 3.5.0
jdk 1.8.0_141
RHEL 7.3
npm.x86_64   1:3.10.10-1.6.11.1.1.el7
nodejs.x86_641:6.11.1-1.el7 @epel
latest zeppelin snapshot

Any ideas? It's my first attempt to build on rhel7/jdk8 .. never seen this 
problem before.

Thanks,
Ruslan

[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for 
org.apache.zeppelin:zeppelin-spark-dependencies_2.10:jar:0.8.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin 
com.googlecode.maven-download-plugin:download-maven-plugin @ line 940, column 15
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin 
com.googlecode.maven-download-plugin:download-maven-plugin @ line 997, column 15
[WARNING]
[WARNING] Some problems were encountered while building the effective model for 
org.apache.zeppelin:zeppelin-spark_2.10:jar:0.8.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.scala-tools:maven-scala-plugin @ line 467, 
column 15
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.apache.maven.plugins:maven-surefire-plugin 
@ line 475, column 15
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.apache.maven.plugins:maven-compiler-plugin 
@ line 486, column 15
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.scala-tools:maven-scala-plugin @ line 496, 
column 15
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.apache.maven.plugins:maven-surefire-plugin 
@ line 504, column 15
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten 
the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support 
building such malformed projects.
[WARNING]
[WARNING] The project org.apache.zeppelin:zeppelin-web:war:0.8.0-SNAPSHOT uses 
prerequisites which is only intended for maven-plugin projects but not for non 
maven-plugin projects. For such purposes you should use the 
maven-enforcer-plugin. See 
https://maven.apache.org/enforcer/enforcer-rules/requireMavenVersion.html

... [skip]

[INFO] 
[INFO] Building Zeppelin: Zengine 0.8.0-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @ zeppelin-zengine ---
[INFO]

[jira] [Commented] (SPARK-22812) Failing cran-check on master

2017-12-15 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293648#comment-16293648
 ] 

Felix Cheung commented on SPARK-22812:
--

Not exactly... what’s the environment? Seems like something is wrong 
connecting/pulling from CRAN.






> Failing cran-check on master 
> -
>
> Key: SPARK-22812
> URL: https://issues.apache.org/jira/browse/SPARK-22812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hossein Falaki
>
> When I run {{R/run-tests.sh}} or {{R/check-cran.sh}} I get the following 
> failure message:
> {code}
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) :
>   dims [product 22] do not match the length of object [0]
> {code}
> cc [~felixcheung] have you experienced this error before?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-14 Thread Felix Cheung

;)
The credential to the user to publish to PyPI is PMC only.

+Holden

Had discussed this in the other thread I sent to private@ last week.


On Thu, Dec 14, 2017 at 4:34 AM Sean Owen <so...@cloudera.com> wrote:

> On the various access questions here -- what do you need to have that
> access? We definitely need to give you all necessary access if you're the
> release manager!
>
>
> On Thu, Dec 14, 2017 at 6:32 AM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> And I don’t have access to publish python.
>>
>> On Wed, Dec 13, 2017 at 9:55 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> The R artifacts have some issue that Felix and I are debugging. Lets not
>>> block the announcement for that.
>>>
>>>
>>>

Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-14 Thread Felix Cheung

And I don’t have access to publish python.

On Wed, Dec 13, 2017 at 9:55 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> The R artifacts have some issue that Felix and I are debugging. Lets not
> block the announcement for that.
>
> Thanks
>
> Shivaram
>
> On Wed, Dec 13, 2017 at 5:59 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Looks like Maven artifacts are up, site's up -- what about the Python and
>> R artifacts?
>> I can also move the spark.apache/docs/latest link to point to 2.2.1 if
>> it's pretty ready.
>> We should announce the release officially too then.
>>
>> On Wed, Dec 6, 2017 at 5:00 PM Felix Cheung <felixche...@apache.org>
>> wrote:
>>
>>> I saw the svn move on Monday so I’m working on the website updates.
>>>
>>> I will look into maven today. I will ask if I couldn’t do it.
>>>
>>>
>>> On Wed, Dec 6, 2017 at 10:49 AM Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Pardon, did this release finish? I don't see it in Maven. I know there
>>>> was some question about getting a hand in finishing the release process,
>>>> including copying artifacts in svn. Was there anything else you're waiting
>>>> on someone to do?
>>>>
>>>>
>>>> On Fri, Dec 1, 2017 at 2:10 AM Felix Cheung <felixche...@apache.org>
>>>> wrote:
>>>>
>>>>> This vote passes. Thanks everyone for testing this release.
>>>>>
>>>>>
>>>>> +1:
>>>>>
>>>>> Sean Owen (binding)
>>>>>
>>>>> Herman van Hövell tot Westerflier (binding)
>>>>>
>>>>> Wenchen Fan (binding)
>>>>>
>>>>> Shivaram Venkataraman (binding)
>>>>>
>>>>> Felix Cheung
>>>>>
>>>>> Henry Robinson
>>>>>
>>>>> Hyukjin Kwon
>>>>>
>>>>> Dongjoon Hyun
>>>>>
>>>>> Kazuaki Ishizaki
>>>>>
>>>>> Holden Karau
>>>>>
>>>>> Weichen Xu
>>>>>
>>>>>
>>>>> 0: None
>>>>>
>>>>> -1: None
>>>>>
>>>>
>

Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-06 Thread Felix Cheung

I saw the svn move on Monday so I’m working on the website updates.

I will look into maven today. I will ask if I couldn’t do it.


On Wed, Dec 6, 2017 at 10:49 AM Sean Owen <so...@cloudera.com> wrote:

> Pardon, did this release finish? I don't see it in Maven. I know there was
> some question about getting a hand in finishing the release process,
> including copying artifacts in svn. Was there anything else you're waiting
> on someone to do?
>
>
> On Fri, Dec 1, 2017 at 2:10 AM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> This vote passes. Thanks everyone for testing this release.
>>
>>
>> +1:
>>
>> Sean Owen (binding)
>>
>> Herman van Hövell tot Westerflier (binding)
>>
>> Wenchen Fan (binding)
>>
>> Shivaram Venkataraman (binding)
>>
>> Felix Cheung
>>
>> Henry Robinson
>>
>> Hyukjin Kwon
>>
>> Dongjoon Hyun
>>
>> Kazuaki Ishizaki
>>
>> Holden Karau
>>
>> Weichen Xu
>>
>>
>> 0: None
>>
>> -1: None
>>
>

[jira] [Updated] (SPARK-20201) Flaky Test: org.apache.spark.sql.catalyst.expressions.OrderingSuite

2017-12-02 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20201:
-
Target Version/s:   (was: 2.2.1)

> Flaky Test: org.apache.spark.sql.catalyst.expressions.OrderingSuite
> ---
>
> Key: SPARK-20201
> URL: https://issues.apache.org/jira/browse/SPARK-20201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2856/testReport/junit/org.apache.spark.sql.catalyst.expressions/OrderingSuite/SPARK_16845__GeneratedClass$SpecificOrdering_grows_beyond_64_KB/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.catalyst.expressions.OrderingSuite_name=SPARK-16845%3A+GeneratedClass%24SpecificOrdering+grows+beyond+64+KB
> Error Message
> {code}
> java.lang.StackOverflowError
> {code}
> {code}
> com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:903)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:43)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:887)
>   at 
> org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply$mcV$sp(OrderingSuite.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131)
>   at 
> org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.Spar

Re: [DISCUSS] Change some default settings for avoiding unintended usages

2017-12-01 Thread Felix Cheung

I’d +1 docker or container support (mesos, dc/os, k8s)

But I think that they are separate things. If users are authenticated and 
interpreter is impersonating each user, the risk of system disruption should be 
low. This is typically how to secure things in a system, through user directory 
(eg LDAP) and access control (normal user can’t sudo and delete everything).

Thought?

_
From: Jeff Zhang <zjf...@gmail.com>
Sent: Thursday, November 30, 2017 11:51 PM
Subject: Re: [DISCUSS] Change some default settings for avoiding unintended 
usages
To: <dev@zeppelin.apache.org>
Cc: users <us...@zeppelin.apache.org>



+1 for running interpreter process in docker container.



Jongyoul Lee <jongy...@gmail.com<mailto:jongy...@gmail.com>>于2017年12月1日周五 
下午3:36写道：
Yes, exactly, this is not only the shell interpreter problem, all can run
any script through python and Scala. Shell is just an example.

Using docker looks good but it cannot avoid unindented usage of resources
like mining coin.

On Fri, Dec 1, 2017 at 2:36 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
wrote:

> I don’t think that’s limited to the shell interpreter.
>
> You can run any arbitrary program or script from python or Scala (or java)
> as well.
>
> _
> From: Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>>
> Sent: Wednesday, November 29, 2017 4:00 PM
> Subject: Re: [DISCUSS] Change some default settings for avoiding
> unintended usages
> To: <dev@zeppelin.apache.org<mailto:dev@zeppelin.apache.org>>
> Cc: users <us...@zeppelin.apache.org<mailto:us...@zeppelin.apache.org>>
>
>
>
> Shell interpreter is a black hole for security, usually we don't recommend
> or allow user to use shell.
>
> We may need to refactor the shell interpreter, running under zeppelin user
> is too dangerous.
>
>
>
>
>
> Jongyoul Lee <jongy...@gmail.com<mailto:jongy...@gmail.com>>于2017年11月29日周三 
> 下午11:44写道：
>
> > Hi, users and dev,
> >
> > Recently, I've got an issue about the abnormal usage of some
> interpreters.
> > Zeppelin's users can access shell by shell and python interpreters. It
> > means all users can run or execute what they want even if it harms the
> > system. Thus I agree that we need to change some default settings to
> > prevent this kind of abusing situation. Before we proceed to do it, I
> want
> > to listen to others' opinions.
> >
> > Feel free to reply this email
> >
> > Regards,
> > Jongyoul
> >
> > --
> > 이종열, Jongyoul Lee, 李宗烈
> > http://madeng.net
> >
>
>
>


--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

Re: [DISCUSS] Change some default settings for avoiding unintended usages

2017-12-01 Thread Felix Cheung

I’d +1 docker or container support (mesos, dc/os, k8s)

But I think that they are separate things. If users are authenticated and 
interpreter is impersonating each user, the risk of system disruption should be 
low. This is typically how to secure things in a system, through user directory 
(eg LDAP) and access control (normal user can’t sudo and delete everything).

Thought?

_
From: Jeff Zhang <zjf...@gmail.com>
Sent: Thursday, November 30, 2017 11:51 PM
Subject: Re: [DISCUSS] Change some default settings for avoiding unintended 
usages
To: <d...@zeppelin.apache.org>
Cc: users <users@zeppelin.apache.org>



+1 for running interpreter process in docker container.



Jongyoul Lee <jongy...@gmail.com<mailto:jongy...@gmail.com>>于2017年12月1日周五 
下午3:36写道：
Yes, exactly, this is not only the shell interpreter problem, all can run
any script through python and Scala. Shell is just an example.

Using docker looks good but it cannot avoid unindented usage of resources
like mining coin.

On Fri, Dec 1, 2017 at 2:36 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
wrote:

> I don’t think that’s limited to the shell interpreter.
>
> You can run any arbitrary program or script from python or Scala (or java)
> as well.
>
> _
> From: Jeff Zhang <zjf...@gmail.com<mailto:zjf...@gmail.com>>
> Sent: Wednesday, November 29, 2017 4:00 PM
> Subject: Re: [DISCUSS] Change some default settings for avoiding
> unintended usages
> To: <d...@zeppelin.apache.org<mailto:d...@zeppelin.apache.org>>
> Cc: users <users@zeppelin.apache.org<mailto:users@zeppelin.apache.org>>
>
>
>
> Shell interpreter is a black hole for security, usually we don't recommend
> or allow user to use shell.
>
> We may need to refactor the shell interpreter, running under zeppelin user
> is too dangerous.
>
>
>
>
>
> Jongyoul Lee <jongy...@gmail.com<mailto:jongy...@gmail.com>>于2017年11月29日周三 
> 下午11:44写道：
>
> > Hi, users and dev,
> >
> > Recently, I've got an issue about the abnormal usage of some
> interpreters.
> > Zeppelin's users can access shell by shell and python interpreters. It
> > means all users can run or execute what they want even if it harms the
> > system. Thus I agree that we need to change some default settings to
> > prevent this kind of abusing situation. Before we proceed to do it, I
> want
> > to listen to others' opinions.
> >
> > Feel free to reply this email
> >
> > Regards,
> > Jongyoul
> >
> > --
> > 이종열, Jongyoul Lee, 李宗烈
> > http://madeng.net
> >
>
>
>


--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

[RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-01 Thread Felix Cheung

This vote passes. Thanks everyone for testing this release.


+1:

Sean Owen (binding)

Herman van Hövell tot Westerflier (binding)

Wenchen Fan (binding)

Shivaram Venkataraman (binding)

Felix Cheung

Henry Robinson

Hyukjin Kwon

Dongjoon Hyun

Kazuaki Ishizaki

Holden Karau

Weichen Xu


0: None

-1: None




On Wed, Nov 29, 2017 at 3:21 PM Weichen Xu <weichen...@databricks.com>
wrote:

> +1
>
> On Thu, Nov 30, 2017 at 6:27 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1
>>
>> SHA, MD5 and signatures look fine. Built and ran Maven tests on my
>> Macbook.
>>
>> Thanks
>> Shivaram
>>
>> On Wed, Nov 29, 2017 at 10:43 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> PySpark install into a virtualenv works, PKG-INFO looks correctly
>>> populated (mostly checking for the pypandoc conversion there).
>>>
>>> Thanks for your hard work Felix (and all of the testers :)) :)
>>>
>>> On Wed, Nov 29, 2017 at 9:33 AM, Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Nov 30, 2017 at 1:28 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
>>>>> for core/sql-core/sql-catalyst/mllib/mllib-local have passed.
>>>>>
>>>>> $ java -version
>>>>> openjdk version "1.8.0_131"
>>>>> OpenJDK Runtime Environment (build
>>>>> 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
>>>>> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>>>>>
>>>>> % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7
>>>>> -T 24 clean package install
>>>>> % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
>>>>> core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
>>>>> ...
>>>>> Run completed in 13 minutes, 54 seconds.
>>>>> Total number of tests run: 1118
>>>>> Suites: completed 170, aborted 0
>>>>> Tests: succeeded 1118, failed 0, canceled 0, ignored 6, pending 0
>>>>> All tests passed.
>>>>> [INFO]
>>>>> 
>>>>> [INFO] Reactor Summary:
>>>>> [INFO]
>>>>> [INFO] Spark Project Core . SUCCESS
>>>>> [17:13 min]
>>>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>>>>  6.065 s]
>>>>> [INFO] Spark Project Catalyst . SUCCESS
>>>>> [11:51 min]
>>>>> [INFO] Spark Project SQL .. SUCCESS
>>>>> [17:55 min]
>>>>> [INFO] Spark Project ML Library ....... SUCCESS
>>>>> [17:05 min]
>>>>> [INFO]
>>>>> 
>>>>> [INFO] BUILD SUCCESS
>>>>> [INFO]
>>>>> 
>>>>> [INFO] Total time: 01:04 h
>>>>> [INFO] Finished at: 2017-11-30T01:48:15+09:00
>>>>> [INFO] Final Memory: 128M/329M
>>>>> [INFO]
>>>>> 
>>>>> [WARNING] The requested profile "hive" could not be activated because
>>>>> it does not exist.
>>>>>
>>>>> Kazuaki Ishizaki
>>>>>
>>>>>
>>>>>
>>>>> From:Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>> To:Hyukjin Kwon <gurwls...@gmail.com>
>>>>> Cc:Spark dev list <dev@spark.apache.org>, Felix Cheung <
>>>>> felixche...@apache.org>, Sean Owen <so...@cloudera.com>
>>>>> Date:2017/11/29 12:56
>>>>> Subject:Re: [VOTE] Spark 2.2.1 (RC2)
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> RC2 is tested on CentOS, too.
>>>>>
>>>>> Bests,
>>>>> Dongjoon

[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-30 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274042#comment-16274042
 ] 

Felix Cheung edited comment on SPARK-22472 at 12/1/17 7:09 AM:
---

I guess it's too late to add to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide 
(and we don't seem to document this in patch release there anyway)

I guess I'll just add this to the website on the actual release announcement 
like
http://spark.apache.org/releases/spark-release-2-1-2.html
or 
http://spark.apache.org/releases/spark-release-2-1-0.html#known-issues

sounds good?



was (Author: felixcheung):
I guess it's too late to add to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide 
(and we don't seem to document this in patch release there anyway)

I guess I'll just add this to the website on the actual release announcement 
like
http://spark.apache.org/releases/spark-release-2-1-2.html

sounds good?


> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>Assignee: Wenchen Fan
>  Labels: release-notes
> Fix For: 2.2.1, 2.3.0
>
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-30 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274042#comment-16274042
 ] 

Felix Cheung commented on SPARK-22472:
--

I guess it's too late to add to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide 
(and we don't seem to document this in patch release there anyway)

I guess I'll just add this to the website on the actual release announcement 
like
http://spark.apache.org/releases/spark-release-2-1-2.html

sounds good?


> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>Assignee: Wenchen Fan
>  Labels: release-notes
> Fix For: 2.2.1, 2.3.0
>
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2017-11-30 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274029#comment-16274029
 ] 

Felix Cheung commented on SPARK-22632:
--

interesting re: timezone on macOS
https://cran.r-project.org/src/base/NEWS


> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [DISCUSS] Change some default settings for avoiding unintended usages

2017-11-30 Thread Felix Cheung

I don’t think that’s limited to the shell interpreter.

You can run any arbitrary program or script from python or Scala (or java) as 
well.

_
From: Jeff Zhang 
Sent: Wednesday, November 29, 2017 4:00 PM
Subject: Re: [DISCUSS] Change some default settings for avoiding unintended 
usages
To: 
Cc: users 


Shell interpreter is a black hole for security, usually we don't recommend
or allow user to use shell.

We may need to refactor the shell interpreter, running under zeppelin user
is too dangerous.





Jongyoul Lee 于2017年11月29日周三 下午11:44写道：

> Hi, users and dev,
>
> Recently, I've got an issue about the abnormal usage of some interpreters.
> Zeppelin's users can access shell by shell and python interpreters. It
> means all users can run or execute what they want even if it harms the
> system. Thus I agree that we need to change some default settings to
> prevent this kind of abusing situation. Before we proceed to do it, I want
> to listen to others' opinions.
>
> Feel free to reply this email
>
> Regards,
> Jongyoul
>
> --
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net
>

Re: [DISCUSS] Change some default settings for avoiding unintended usages

2017-11-30 Thread Felix Cheung

I don’t think that’s limited to the shell interpreter.

You can run any arbitrary program or script from python or Scala (or java) as 
well.

_
From: Jeff Zhang 
Sent: Wednesday, November 29, 2017 4:00 PM
Subject: Re: [DISCUSS] Change some default settings for avoiding unintended 
usages
To: 
Cc: users 


Shell interpreter is a black hole for security, usually we don't recommend
or allow user to use shell.

We may need to refactor the shell interpreter, running under zeppelin user
is too dangerous.





Jongyoul Lee 于2017年11月29日周三 下午11:44写道：

> Hi, users and dev,
>
> Recently, I've got an issue about the abnormal usage of some interpreters.
> Zeppelin's users can access shell by shell and python interpreters. It
> means all users can run or execute what they want even if it harms the
> system. Thus I agree that we need to change some default settings to
> prevent this kind of abusing situation. Before we proceed to do it, I want
> to listen to others' opinions.
>
> Feel free to reply this email
>
> Regards,
> Jongyoul
>
> --
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net
>

[jira] [Updated] (SPARK-22637) CatalogImpl.refresh() has quadratic complexity for a view

2017-11-28 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22637:
-
Target Version/s: 2.2.2, 2.3.0  (was: 2.3.0)

> CatalogImpl.refresh() has quadratic complexity for a view
> -
>
> Key: SPARK-22637
> URL: https://issues.apache.org/jira/browse/SPARK-22637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> {{org.apache.spark.sql.internal.CatalogImpl.refreshTable}} uses 
> {{foreach(..)}} to refresh all tables in a view. This traverses all nodes in 
> the subtree and calls {{LogicalPlan.refresh()}} on these nodes. However 
> {{LogicalPlan.refresh()}} is also refreshing its children, as a result 
> refreshing a large view can be quite expensive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-28 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270178#comment-16270178
 ] 

Felix Cheung commented on SPARK-22472:
--

This is going out in 2.2.1 - do we need a rel note on this change?

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>Assignee: Wenchen Fan
>  Labels: release-notes
> Fix For: 2.2.1, 2.3.0
>
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22627) Fix formatting of headers in configuration.html page

2017-11-28 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270169#comment-16270169
 ] 

Felix Cheung commented on SPARK-22627:
--

quite possibly this is related to jekyll changes..
yap this looks correct 
https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-docs/_site/configuration.html#execution-behavior

> Fix formatting of headers in configuration.html page
> 
>
> Key: SPARK-22627
> URL: https://issues.apache.org/jira/browse/SPARK-22627
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>    Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.2.1
>
>
> On the page https://spark.apache.org/docs/latest/configuration.html one can 
> see headers in the HTML which look like left overs from the conversion from 
> Markdown:
> {code}
> ### Execution Behavior
> ...
> ### Networking
> ...
> ### Scheduling
> ...
> etc...
> {code}
> The most problems with formatting has the paragraph 
> {code}
> ### Cluster Managers Each cluster manager in Spark has additional 
> configuration options. Configurations can be found on the pages for each 
> mode:  [YARN](running-on-yarn.html#configuration)  
> [Mesos](running-on-mesos.html#configuration)  [Standalone 
> Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables 
> ...
> {code}
> As a reader of the documentation I want the headers in the text to be 
> formatted correctly and not showing Markdown syntax. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [Spark R]: dapply only works for very small datasets

2017-11-28 Thread Felix Cheung

You can find more discussions in
https://issues.apache.org/jira/browse/SPARK-18924
And
https://issues.apache.org/jira/browse/SPARK-17634

I suspect the cost is linear - so partitioning the data into smaller chunks 
with more executors (one core each) running in parallel would probably help a 
bit.

Unfortunately this is an area that we really would use some improvements on, 
and I think it *should* be possible (hmm  
https://databricks.com/blog/2017/10/06/accelerating-r-workflows-on-databricks.html.
 ;)

_
From: Kunft, Andreas <andreas.ku...@tu-berlin.de>
Sent: Tuesday, November 28, 2017 3:11 AM
Subject: AW: [Spark R]: dapply only works for very small datasets
To: Felix Cheung <felixcheun...@hotmail.com>, <user@spark.apache.org>



Thanks for the fast reply.


I tried it locally, with 1 - 8 slots on a 8 core machine w/ 25GB memory as well 
as on 4 nodes with the same specifications.

When I shrink the data to around 100MB,

it runs in about 1 hour for 1 core and about 6 min with 8 cores.


I'm aware that the serDe takes time, but it seems there must be something else 
off considering these numbers.


____
Von: Felix Cheung <felixcheun...@hotmail.com>
Gesendet: Montag, 27. November 2017 20:20
An: Kunft, Andreas; user@spark.apache.org
Betreff: Re: [Spark R]: dapply only works for very small datasets

What’s the number of executor and/or number of partitions you are working with?

I’m afraid most of the problem is with the serialization deserialization 
overhead between JVM and R...


From: Kunft, Andreas <andreas.ku...@tu-berlin.de>
Sent: Monday, November 27, 2017 10:27:33 AM
To: user@spark.apache.org
Subject: [Spark R]: dapply only works for very small datasets


Hello,


I tried to execute some user defined functions with R using the airline arrival 
performance dataset.

While the examples from the documentation for the `<-` apply operator work 
perfectly fine on a size ~9GB,

the `dapply` operator fails to finish even after ~4 hours.


I'm using a function similar to the one from the documentation:


df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

I checked Stackoverflow and even asked the question there as well, but till now 
the only answer I got was:
"Avoid using dapply, gapply"

So, do I miss some parameters or is there are general limitation?
I'm using Spark 2.2.0 and read the data from HDFS 2.7.1 and played with several 
DOPs.

Best
Andreas

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-28 Thread Felix Cheung

+1

Thanks Sean. Please vote!

Tested various scenarios with R package. Ubuntu, Debian, Windows r-devel
and release and on r-hub. Verified CRAN checks are clean (only 1 NOTE!) and
no leaked files (.cache removed, /tmp clean)


On Sun, Nov 26, 2017 at 11:55 AM Sean Owen <so...@cloudera.com> wrote:

> Yes it downloads recent releases. The test worked for me on a second try,
> so I suspect a bad mirror. If this comes up frequently we can just add
> retry logic, as the closer.lua script will return different mirrors each
> time.
>
> The tests all pass for me on the latest Debian, so +1 for this release.
>
> (I committed the change to set -Xss4m for tests consistently, but this
> shouldn't block a release.)
>
>
> On Sat, Nov 25, 2017 at 12:47 PM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> Ah sorry digging through the history it looks like this is changed
>> relatively recently and should only download previous releases.
>>
>> Perhaps we are intermittently hitting a mirror that doesn’t have the
>> files?
>>
>>
>>
>> https://github.com/apache/spark/commit/daa838b8886496e64700b55d1301d348f1d5c9ae
>>
>>
>> On Sat, Nov 25, 2017 at 10:36 AM Felix Cheung <felixche...@apache.org>
>> wrote:
>>
>>> Thanks Sean.
>>>
>>> For the second one, it looks like the
>>>  HiveExternalCatalogVersionsSuite is trying to download the release tgz
>>> from the official Apache mirror, which won’t work unless the release is
>>> actually, released?
>>>
>>> val preferredMirror =
>>> Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, "-q",
>>> "-O", "-").!!.trim
>>> val url = s"
>>> $preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
>>>
>>> It’s proabbly getting an error page instead.
>>>
>>>
>>> On Sat, Nov 25, 2017 at 10:28 AM Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> I hit the same StackOverflowError as in the previous RC test, but,
>>>> pretty sure this is just because the increased thread stack size JVM flag
>>>> isn't applied consistently. This seems to resolve it:
>>>>
>>>> https://github.com/apache/spark/pull/19820
>>>>
>>>> This wouldn't block release IMHO.
>>>>
>>>>
>>>> I am currently investigating this failure though -- seems like the
>>>> mechanism that downloads Spark tarballs needs fixing, or updating, in the
>>>> 2.2 branch?
>>>>
>>>> HiveExternalCatalogVersionsSuite:
>>>>
>>>> gzip: stdin: not in gzip format
>>>>
>>>> tar: Child returned status 1
>>>>
>>>> tar: Error is not recoverable: exiting now
>>>>
>>>> *** RUN ABORTED ***
>>>>
>>>>   java.io.IOException: Cannot run program "./bin/spark-submit" (in
>>>> directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or 
>>>> directory
>>>>
>>>> On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung <felixche...@apache.org>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.2.1. The vote is open until Friday December 1, 2017 at
>>>>> 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 votes
>>>>> are cast.
>>>>>
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.2.1
>>>>>
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>>
>>>>> The tag to be voted on is v2.2.1-rc2
>>>>> https://github.com/apache/spark/tree/v2.2.1-rc2  (
>>>>> e30e2698a2193f0bbdcd4edb884710819ab6397c)
>>>>>
>>>>> List of JIRA tickets resolved in this release can be found here
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>>>>
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>

Re: [Spark R]: dapply only works for very small datasets

2017-11-27 Thread Felix Cheung

What's the number of executor and/or number of partitions you are working with?

I'm afraid most of the problem is with the serialization deserialization 
overhead between JVM and R...

From: Kunft, Andreas 
Sent: Monday, November 27, 2017 10:27:33 AM
To: user@spark.apache.org
Subject: [Spark R]: dapply only works for very small datasets

Hello,

I tried to execute some user defined functions with R using the airline arrival 
performance dataset.

While the examples from the documentation for the `<-` apply operator work 
perfectly fine on a size ~9GB,

the `dapply` operator fails to finish even after ~4 hours.

I'm using a function similar to the one from the documentation:

df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

I checked Stackoverflow and even asked the question there as well, but till now 
the only answer I got was:
"Avoid using dapply, gapply"

So, do I miss some parameters or is there are general limitation?
I'm using Spark 2.2.0 and read the data from HDFS 2.7.1 and played with several 
DOPs.

Best
Andreas

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-25 Thread Felix Cheung

Ah sorry digging through the history it looks like this is changed
relatively recently and should only download previous releases.

Perhaps we are intermittently hitting a mirror that doesn’t have the files?


https://github.com/apache/spark/commit/daa838b8886496e64700b55d1301d348f1d5c9ae


On Sat, Nov 25, 2017 at 10:36 AM Felix Cheung <felixche...@apache.org>
wrote:

> Thanks Sean.
>
> For the second one, it looks like the  HiveExternalCatalogVersionsSuite is
> trying to download the release tgz from the official Apache mirror, which
> won’t work unless the release is actually, released?
>
> val preferredMirror =
> Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, "-q",
> "-O", "-").!!.trim
> val url = s"
> $preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
>
> It’s proabbly getting an error page instead.
>
>
> On Sat, Nov 25, 2017 at 10:28 AM Sean Owen <so...@cloudera.com> wrote:
>
>> I hit the same StackOverflowError as in the previous RC test, but, pretty
>> sure this is just because the increased thread stack size JVM flag isn't
>> applied consistently. This seems to resolve it:
>>
>> https://github.com/apache/spark/pull/19820
>>
>> This wouldn't block release IMHO.
>>
>>
>> I am currently investigating this failure though -- seems like the
>> mechanism that downloads Spark tarballs needs fixing, or updating, in the
>> 2.2 branch?
>>
>> HiveExternalCatalogVersionsSuite:
>>
>> gzip: stdin: not in gzip format
>>
>> tar: Child returned status 1
>>
>> tar: Error is not recoverable: exiting now
>>
>> *** RUN ABORTED ***
>>
>>   java.io.IOException: Cannot run program "./bin/spark-submit" (in
>> directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or directory
>>
>> On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung <felixche...@apache.org>
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.2.1. The vote is open until Friday December 1, 2017 at 8:00:00 am UTC
>>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.1
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>>
>>> The tag to be voted on is v2.2.1-rc2
>>> https://github.com/apache/spark/tree/v2.2.1-rc2  (
>>> e30e2698a2193f0bbdcd4edb884710819ab6397c)
>>>
>>> List of JIRA tickets resolved in this release can be found here
>>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>>
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1257/
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-docs/_site/index.html
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala you
>>> can add the staging repository to your projects resolvers and test with the
>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>> up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.1?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.2.2.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless t

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-25 Thread Felix Cheung

Thanks Sean.

For the second one, it looks like the  HiveExternalCatalogVersionsSuite is
trying to download the release tgz from the official Apache mirror, which
won’t work unless the release is actually, released?

val preferredMirror =
Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, "-q", "
-O", "-").!!.trim
val url = s"
$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"

It’s proabbly getting an error page instead.


On Sat, Nov 25, 2017 at 10:28 AM Sean Owen <so...@cloudera.com> wrote:

> I hit the same StackOverflowError as in the previous RC test, but, pretty
> sure this is just because the increased thread stack size JVM flag isn't
> applied consistently. This seems to resolve it:
>
> https://github.com/apache/spark/pull/19820
>
> This wouldn't block release IMHO.
>
>
> I am currently investigating this failure though -- seems like the
> mechanism that downloads Spark tarballs needs fixing, or updating, in the
> 2.2 branch?
>
> HiveExternalCatalogVersionsSuite:
>
> gzip: stdin: not in gzip format
>
> tar: Child returned status 1
>
> tar: Error is not recoverable: exiting now
>
> *** RUN ABORTED ***
>
>   java.io.IOException: Cannot run program "./bin/spark-submit" (in
> directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or directory
>
> On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.1. The vote is open until Friday December 1, 2017 at 8:00:00 am UTC
>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.2.1
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>>
>> The tag to be voted on is v2.2.1-rc2
>> https://github.com/apache/spark/tree/v2.2.1-rc2  (
>> e30e2698a2193f0bbdcd4edb884710819ab6397c)
>>
>> List of JIRA tickets resolved in this release can be found here
>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1257/
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-docs/_site/index.html
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you can
>> add the staging repository to your projects resolvers and test with the RC
>> (make sure to clean up the artifact cache before/after so you don't end up
>> building with a out of date RC going forward).
>>
>> *What should happen to JIRA tickets still targeting 2.2.1?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.2.2.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.2.0. That being said if
>> there is something which is a regression form 2.2.0 that has not been
>> correctly targeted please ping a committer to help target the issue (you
>> can see the open issues listed as impacting Spark 2.2.1 / 2.2.2 here
>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.2.1%20OR%20affectedVersion%20%3D%202.2.2)>
>> .
>>
>> *What are the unresolved issues targeted for 2.2.1
>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.1>?*
>>
>> At the time of the writing, there is one intermited failure SPARK-20201
>> <https://issues.apache.org/jira/browse/SPARK-20201> which we are
>> tracking since 2.2.0.
>>
>>

[VOTE] Spark 2.2.1 (RC2)

2017-11-24 Thread Felix Cheung

Please vote on releasing the following candidate as Apache Spark version
2.2.1. The vote is open until Friday December 1, 2017 at 8:00:00 am UTC and
passes if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.2.1

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/


The tag to be voted on is v2.2.1-rc2 https://github.com/apache/
spark/tree/v2.2.1-rc2  (e30e2698a2193f0bbdcd4edb884710819ab6397c)

List of JIRA tickets resolved in this release can be found here
https://issues.apache.org/jira/projects/SPARK/versions/12340470


The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1257/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-
rc2-docs/_site/index.html


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the
current RC and see if anything important breaks, in the Java/Scala you can
add the staging repository to your projects resolvers and test with the RC
(make sure to clean up the artifact cache before/after so you don't end up
building with a out of date RC going forward).

*What should happen to JIRA tickets still targeting 2.2.1?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.2.2.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.2.0. That being said if
there is something which is a regression form 2.2.0 that has not been
correctly targeted please ping a committer to help target the issue (you
can see the open issues listed as impacting Spark 2.2.1 / 2.2.2 here

.

*What are the unresolved issues targeted for 2.2.1
?*

At the time of the writing, there is one intermited failure SPARK-20201
 which we are tracking
since 2.2.0.

[jira] [Updated] (SPARK-22402) Allow fetcher URIs to be downloaded to specific locations relative to Mesos Sandbox

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22402:
-
Affects Version/s: (was: 2.2.2)
   2.2.0

> Allow fetcher URIs to be downloaded to specific locations relative to Mesos 
> Sandbox
> ---
>
> Key: SPARK-22402
> URL: https://issues.apache.org/jira/browse/SPARK-22402
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Arthur Rand
>Priority: Minor
>
> Currently {{spark.mesos.uris}} will only place files in the sandbox, but some 
> configuration files and applications may need to be in specific locations. 
> The Mesos proto allows for this with the optional {{output_file}} field 
> (https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L671). 
> We can expose this through the command line with {{--conf 
> spark.mesos.uris=:}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22402) Allow fetcher URIs to be downloaded to specific locations relative to Mesos Sandbox

2017-11-24 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265584#comment-16265584
 ] 

Felix Cheung commented on SPARK-22402:
--

Any update on this one?


> Allow fetcher URIs to be downloaded to specific locations relative to Mesos 
> Sandbox
> ---
>
> Key: SPARK-22402
> URL: https://issues.apache.org/jira/browse/SPARK-22402
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Arthur Rand
>Priority: Minor
>
> Currently {{spark.mesos.uris}} will only place files in the sandbox, but some 
> configuration files and applications may need to be in specific locations. 
> The Mesos proto allows for this with the optional {{output_file}} field 
> (https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L671). 
> We can expose this through the command line with {{--conf 
> spark.mesos.uris=:}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22495:
-
Fix Version/s: 2.2.1

> Fix setup of SPARK_HOME variable on Windows
> ---
>
> Key: SPARK-22495
> URL: https://issues.apache.org/jira/browse/SPARK-22495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Jakub Nowacki
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> On Windows, pip installed pyspark is unable to find out the spark home. There 
> is already proposed change, sufficient details and discussions in 
> https://github.com/apache/spark/pull/19370 and SPARK-18136



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22595) flaky test: CastSuite.SPARK-22500: cast for struct should not generate codes beyond 64KB

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22595:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.3.0, 2.2.2)

> flaky test: CastSuite.SPARK-22500: cast for struct should not generate codes 
> beyond 64KB
> 
>
> Key: SPARK-22595
> URL: https://issues.apache.org/jira/browse/SPARK-22595
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22591) GenerateOrdering shouldn't change ctx.INPUT_ROW

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22591:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.3.0, 2.2.2)

> GenerateOrdering shouldn't change ctx.INPUT_ROW
> ---
>
> Key: SPARK-22591
> URL: https://issues.apache.org/jira/browse/SPARK-22591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.1, 2.3.0
>
>
> {{GenerateOrdering}} changes {{ctx.INPUT_ROW}} but doesn't restore the 
> original value. Since {{ctx.INPUT_ROW}} is used when generating codes, it is 
> risky to change it arbitrarily.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22548) Incorrect nested AND expression pushed down to JDBC data source

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22548:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.3.0, 2.2.2)

> Incorrect nested AND expression pushed down to JDBC data source
> ---
>
> Key: SPARK-22548
> URL: https://issues.apache.org/jira/browse/SPARK-22548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jia Li
>Assignee: Jia Li
> Fix For: 2.1.3, 2.2.1, 2.3.0
>
>
> Let’s say I have a JDBC data source table ‘foobar’ with 3 rows:
> NAME  THEID
> ==
> fred  1
> mary  2
> joe 'foo' "bar"3
> This query returns incorrect result. 
> SELECT * FROM foobar WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 
> 'fred')
> It’s supposed to return:
> fred  1
> mary  2
> But it returns
> fred  1
> mary  2
> joe 'foo' "bar"3
> This is because one leg of the nested AND predicate, TRIM(NAME) = 'mary’, can 
> not be pushed down but is lost during JDBC push down filter translation. The 
> same translation method is also called by Data Source V2. I have a fix for 
> this issue and will open a PR. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-17920:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.3.0, 2.2.2)

> HiveWriterContainer passes null configuration to serde.initialize, causing 
> NullPointerException in AvroSerde when using avro.schema.url
> ---
>
> Key: SPARK-17920
> URL: https://issues.apache.org/jira/browse/SPARK-17920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
>Reporter: James Norvell
>Assignee: Vinod KC
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
> Attachments: avro.avsc, avro_data
>
>
> When HiveWriterContainer intializes a serde it explicitly passes null for the 
> Configuration:
> https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161
> When attempting to write to a table stored as Avro with avro.schema.url set, 
> this causes a NullPointerException when it tries to get the FileSystem for 
> the URL:
> https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153
> Reproduction:
> {noformat}
> spark-sql> create external table avro_in (a string) stored as avro location 
> '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> create external table avro_out (a string) stored as avro location 
> '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> select * from avro_in;
> hello
> Time taken: 1.986 seconds, Fetched 1 row(s)
> spark-sql> insert overwrite table avro_out select * from avro_in;
> 16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
> Returning signal schema to indicate problem
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at

[jira] [Updated] (SPARK-22500) 64KB JVM bytecode limit problem with cast

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22500:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.2.0, 2.3.0)

> 64KB JVM bytecode limit problem with cast
> -
>
> Key: SPARK-22500
> URL: https://issues.apache.org/jira/browse/SPARK-22500
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{Cast}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of structure fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22549) 64KB JVM bytecode limit problem with concat_ws

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22549:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.3.0, 2.2.2)

> 64KB JVM bytecode limit problem with concat_ws
> --
>
> Key: SPARK-22549
> URL: https://issues.apache.org/jira/browse/SPARK-22549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{concat_ws}} can throw an exception due to the 64KB JVM bytecode limit when 
> they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22550) 64KB JVM bytecode limit problem with elt

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22550:
-
Target Version/s: 2.2.1, 2.3.0

> 64KB JVM bytecode limit problem with elt
> 
>
> Key: SPARK-22550
> URL: https://issues.apache.org/jira/browse/SPARK-22550
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{elt}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22508:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.3.0, 2.2.2)

> 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
> -
>
> Key: SPARK-22508
> URL: https://issues.apache.org/jira/browse/SPARK-22508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB 
> JVM bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22544) FileStreamSource should use its own hadoop conf to call globPathIfNecessary

2017-11-24 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22544:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.3.0, 2.2.2)
   Fix Version/s: (was: 2.2.2)
  2.2.1

> FileStreamSource should use its own hadoop conf to call globPathIfNecessary
> ---
>
> Key: SPARK-22544
> URL: https://issues.apache.org/jira/browse/SPARK-22544
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 2450 matches

Mail list logo