[jira] [Comment Edited] (SPARK-23206) Additional Memory Tuning Metrics

2018-05-10 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470777#comment-16470777
 ] 

Felix Cheung edited comment on SPARK-23206 at 5/10/18 5:20 PM:
---

yes, for us network and disk IO stats. We have been discussing with Edwina and 
her team.


was (Author: felixcheung):
yes, for use network and disk IO stats. We have been discussing with Edwina and 
her team.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-05-10 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470777#comment-16470777
 ] 

Felix Cheung commented on SPARK-23206:
--

yes, for use network and disk IO stats. We have been discussing with Edwina and 
her team.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Revisiting Online serving of Spark models?

2018-05-10 Thread Felix Cheung
Huge +1 on this!


From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to 
avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as 
important as per-Row transformations, but they would be helpful for batching 
for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as 
any to revisit the online serving situation in Spark ML. DB & other's have done 
some excellent working moving a lot of the necessary tools into a local linear 
algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, 
but currently our individual transform/predict methods are private so they 
either need to copy or re-implement (or put them selves in org.apache.spark) to 
access them. How would folks feel about adding a new trait for ML pipeline 
stages to expose to do transformation of single element inputs (or local 
collections) that could be optionally implemented by stages which support this? 
That way we can have less copy and paste code possibly getting out of sync with 
our model training.

I think continuing to have on-line serving grow in different projects is 
probably the right path, forward (folks have different needs), but I'd love to 
see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their 
own commercial offerings, but hopefully if we make it easier for everyone the 
commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]



--
Twitter: https://twitter.com/holdenkarau


[jira] [Created] (SPARK-24207) PrefixSpan: R API

2018-05-08 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-24207:


 Summary: PrefixSpan: R API
 Key: SPARK-24207
 URL: https://issues.apache.org/jira/browse/SPARK-24207
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466920#comment-16466920
 ] 

Felix Cheung commented on SPARK-23780:
--

I suppose if you load googleVis first and then SparkR it would have the same 
effect as Ivan's steps?

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24195) sc.addFile for local:/ path is broken

2018-05-06 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24195:
-
Description: 
In changing SPARK-6300
https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2

essentially the change to
new File(path).getCanonicalFile.toURI.toString

breaks when path is local:, as java.io.File doesn't handle it.

eg.

new 
File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24195) sc.addFile for local:/ path is broken

2018-05-06 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24195:
-
Affects Version/s: 1.3.1
   1.4.1
   1.5.2
   1.6.3
   2.0.2
   2.1.2

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24195) sc.addFile for local:/ path is broken

2018-05-06 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-24195:


 Summary: sc.addFile for local:/ path is broken
 Key: SPARK-24195
 URL: https://issues.apache.org/jira/browse/SPARK-24195
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.1
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-06 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465433#comment-16465433
 ] 

Felix Cheung commented on SPARK-23291:
--

I don't disagree the behavior issue. (ah, so someone did run into it in 2016)

If I recall, a few folks have brought up recently [https://semver.org/] I think 
it might be worthwhile whether we are following the principles of semantic 
versioning... or not.

I don't feel strongly about this so I'll leave others to comment.

 

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-06 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465307#comment-16465307
 ] 

Felix Cheung edited comment on SPARK-23291 at 5/6/18 10:38 PM:
---

actually, I'm not sure we should backport this to a x.x.1 release.

yes, the behavior "was unexpected" but it has been around for the last 3 years, 
if I recall, since the very beginning. and it is not a regression per se.

either users don't care since it has never been reported, or (most likely) 
users have adopted to the behavior in which case we will break existing jobs in 
a patch release.

anyway, it's just my 2c.


was (Author: felixcheung):
actually, I'm not sure we should backport this to a x.x.1 release.

yes, the behavior "was unexpected" but it has been around for the last 3 years, 
if I recall, since the very beginning.

either users don't care since it has never been reported, or (most likely) 
users have adopted to the behavior in which case we will break existing jobs in 
a patch release.

anyway, it's just my 2c.

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-06 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465307#comment-16465307
 ] 

Felix Cheung edited comment on SPARK-23291 at 5/6/18 10:35 PM:
---

actually, I'm not sure we should backport this to a x.x.1 release.

yes, the behavior "was unexpected" but it has been around for the last 3 years, 
if I recall, since the very beginning.

either users don't care since it has never been reported, or (most likely) 
users have adopted to the behavior in which case we will break existing jobs in 
a patch release.

anyway, it's just my 2c.


was (Author: felixcheung):
actually, I'm not sure we should backport this to a x.x.1 release.

yes, the behavior "was unexpected" but it has been around for the last 3 years, 
if I recall.

either users don't care since it has never been reported, or users have adopted 
to the behavior in which case we will break existing jobs in a patch release.

anyway, it's just my 2c.

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-06 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465307#comment-16465307
 ] 

Felix Cheung commented on SPARK-23291:
--

actually, I'm not sure we should backport this to a x.x.1 release.

yes, the behavior "was unexpected" but it has been around for the last 3 years, 
if I recall.

either users don't care since it has never been reported, or users have adopted 
to the behavior in which case we will break existing jobs in a patch release.

 

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-06 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465307#comment-16465307
 ] 

Felix Cheung edited comment on SPARK-23291 at 5/6/18 10:34 PM:
---

actually, I'm not sure we should backport this to a x.x.1 release.

yes, the behavior "was unexpected" but it has been around for the last 3 years, 
if I recall.

either users don't care since it has never been reported, or users have adopted 
to the behavior in which case we will break existing jobs in a patch release.

anyway, it's just my 2c.


was (Author: felixcheung):
actually, I'm not sure we should backport this to a x.x.1 release.

yes, the behavior "was unexpected" but it has been around for the last 3 years, 
if I recall.

either users don't care since it has never been reported, or users have adopted 
to the behavior in which case we will break existing jobs in a patch release.

 

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: SparkR test failures in PR builder

2018-05-03 Thread Felix Cheung
This is resolved.

Please see https://issues.apache.org/jira/browse/SPARK-24152


From: Kazuaki Ishizaki 
Sent: Wednesday, May 2, 2018 4:51:11 PM
To: dev
Cc: Joseph Bradley; Hossein Falaki
Subject: Re: SparkR test failures in PR builder

I am not familiar with SparkR or CRAN. However, I remember that we had the 
similar situation.

Here is a great work at that time. When I have just visited this PR, I think 
that we have the similar situation (i.e. format error) again.
https://github.com/apache/spark/pull/20005

Any other comments are appreciated.

Regards,
Kazuaki Ishizaki



From:Joseph Bradley 
To:dev 
Cc:Hossein Falaki 
Date:2018/05/03 07:31
Subject:SparkR test failures in PR builder




Hi all,

Does anyone know why the PR builder keeps failing on SparkR's CRAN checks?  
I've seen this in a lot of unrelated PRs.  E.g.: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90065/console

Hossein spotted this line:
```
* checking CRAN incoming feasibility ...Error in 
.check_package_CRAN_incoming(pkgdir) :
  dims [product 24] do not match the length of object [0]
```
and suggested that it could be CRAN flakiness.  I'm not familiar with CRAN, but 
do others have thoughts about how to fix this?

Thanks!
Joseph

--
Joseph Bradley
Software Engineer - Machine Learning
Databricks, Inc.




[jira] [Comment Edited] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-05-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461983#comment-16461983
 ] 

Felix Cheung edited comment on SPARK-24152 at 5/3/18 6:29 AM:
--

ok good.

in the event this reoccurs persistently,

option 1:
 * since we have NO_TESTS, we could remove --as-cran from this line 
[https://github.com/apache/spark/blob/master/R/check-cran.sh#L54] (temporarily)

option 2:
 - we could set 

_R_CHECK_CRAN_INCOMING_ to "FALSE" in the environment to disable this check, 

check_CRAN_incoming()

(see 
[http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R]

 

 

 


was (Author: felixcheung):
ok good.

in the event this reoccurs persistently,

option 1:
 * since we have NO_TESTS, we could remove --as-cran from this line 
[https://github.com/apache/spark/blob/master/R/check-cran.sh#L54] (temporarily)

option 2:
 - we could set 

_R_CHECK_CRAN_INCOMING_ to "FALSE" in the environment to disable this check 

check_CRAN_incoming()

(see 
[http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R]

 

 

 

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-05-03 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-24152:
-
Summary: SparkR CRAN feasibility check server problem  (was: Flaky Test: 
SparkR)

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-05-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461993#comment-16461993
 ] 

Felix Cheung commented on SPARK-24152:
--

(I updated the bug title - it's not really flaky..)

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24152) Flaky Test: SparkR

2018-05-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461983#comment-16461983
 ] 

Felix Cheung edited comment on SPARK-24152 at 5/3/18 6:26 AM:
--

ok good.

in the event this reoccurs persistently,

option 1:
 * since we have NO_TESTS, we could remove --as-cran from this line 
[https://github.com/apache/spark/blob/master/R/check-cran.sh#L54] (temporarily)

option 2:
 - we could set 

_R_CHECK_CRAN_INCOMING_ to "FALSE" in the environment to disable this check 

check_CRAN_incoming()

(see 
[http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R|http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R)]

 

 


was (Author: felixcheung):
ok good.

in the event this reoccurs persistently,

option 1:
 * since we have NO_TESTS, we could remove --as-cran from this line 
[https://github.com/apache/spark/blob/master/R/check-cran.sh#L54] (temporarily)

option 2:

- we could set 

_R_CHECK_CRAN_INCOMING_ to "FALSE" in the environment to disable this check 

check_CRAN_incoming()

(see 
[http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R)]

 

> Flaky Test: SparkR
> --
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24152) Flaky Test: SparkR

2018-05-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461983#comment-16461983
 ] 

Felix Cheung edited comment on SPARK-24152 at 5/3/18 6:26 AM:
--

ok good.

in the event this reoccurs persistently,

option 1:
 * since we have NO_TESTS, we could remove --as-cran from this line 
[https://github.com/apache/spark/blob/master/R/check-cran.sh#L54] (temporarily)

option 2:
 - we could set 

_R_CHECK_CRAN_INCOMING_ to "FALSE" in the environment to disable this check 

check_CRAN_incoming()

(see 
[http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R]

 

 

 


was (Author: felixcheung):
ok good.

in the event this reoccurs persistently,

option 1:
 * since we have NO_TESTS, we could remove --as-cran from this line 
[https://github.com/apache/spark/blob/master/R/check-cran.sh#L54] (temporarily)

option 2:
 - we could set 

_R_CHECK_CRAN_INCOMING_ to "FALSE" in the environment to disable this check 

check_CRAN_incoming()

(see 
[http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R|http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R)]

 

 

> Flaky Test: SparkR
> --
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) Flaky Test: SparkR

2018-05-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461983#comment-16461983
 ] 

Felix Cheung commented on SPARK-24152:
--

ok good.

in the event this reoccurs persistently,

option 1:
 * since we have NO_TESTS, we could remove --as-cran from this line 
[https://github.com/apache/spark/blob/master/R/check-cran.sh#L54] (temporarily)

option 2:

- we could set 

_R_CHECK_CRAN_INCOMING_ to "FALSE" in the environment to disable this check 

check_CRAN_incoming()

(see 
[http://mtweb.cs.ucl.ac.uk/mus/bin/install_R/R-3.1.1/src/library/tools/R/check.R)]

 

> Flaky Test: SparkR
> --
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) Flaky Test: SparkR

2018-05-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461974#comment-16461974
 ] 

Felix Cheung commented on SPARK-24152:
--

Is this still a problem?

> Flaky Test: SparkR
> --
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Felix Cheung
Zeppelin keeps the Spark job alive. This is likely a better question for the 
Zeppelin project.


From: Valery Khamenya 
Sent: Tuesday, May 1, 2018 4:30:24 AM
To: user@spark.apache.org
Subject: all calculations finished, but "VCores Used" value remains at its max

Hi all

I experience a strange thing: when Spark 2.3.0 calculations started from 
Zeppelin 0.7.3 are finished, the "VCores Used" value in resource manager stays 
at its maximum, albeit nothing is assumed to be calculated anymore. How come?

if relevant, I experience this issue since AWS EMR 5.13.0

best regards
--
Valery


best regards
--
Valery A.Khamenya


Re: zeppelin 0.8 tar file

2018-04-30 Thread Felix Cheung
0.8 is not released yet.


From: Soheil Pourbafrani 
Sent: Sunday, April 29, 2018 9:18:10 AM
To: users@zeppelin.apache.org
Subject: zeppelin 0.8 tar file

Is there any pre-compiled tar file of Zeppelin 0.8 to download?


[jira] [Commented] (SPARK-23954) Converting spark dataframe containing int64 fields to R dataframes leads to impredictable errors.

2018-04-28 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457390#comment-16457390
 ] 

Felix Cheung commented on SPARK-23954:
--

yap, please see discussion in SPARK-12360 in particular

> Converting spark dataframe containing int64 fields to R dataframes leads to 
> impredictable errors.
> -
>
> Key: SPARK-23954
> URL: https://issues.apache.org/jira/browse/SPARK-23954
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: nicolas paris
>Priority: Minor
>
> Converting spark dataframe containing int64 fields to R dataframes leads to 
> impredictable errors. 
> The problems comes from R that does not handle int64 natively. As a result a 
> good workaround would be to convert bigint as strings when transforming spark 
> dataframes into R dataframes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [DISCUSS] Adjust test logs for CI

2018-04-24 Thread Felix Cheung
+1 on that too


From: Jongyoul Lee <jongy...@gmail.com>
Sent: Sunday, April 22, 2018 9:30:27 PM
To: dev
Subject: Re: [DISCUSS] Adjust test logs for CI

I like this idea!!

On Mon, Apr 23, 2018 at 12:14 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> Another thing we can do is to skip the remaining test when we hit test
> failure. Currently, zeppelin wont't stop run testing code even hit failed
> test.
>
> http://maven.apache.org/surefire/maven-surefire-
> plugin/examples/skip-after-failure.html
>
>
> Jeff Zhang <zjf...@gmail.com>于2018年4月23日周一 上午11:00写道:
>
> >
> > Regarding selenium test, I agree with you that the log in selenium is
> > almost useless. It is hard to figure out what's wrong when selenium test
> > fails. Maybe other frontend expert can help on that.
> >
> >
> >
> > Jongyoul Lee <jongy...@gmail.com>于2018年4月23日周一 上午10:55写道:
> >
> >> @felix,
> >> We can enforce to set different log level like ERROR or WARN but I don't
> >> think it's a proper solution.
> >>
> >> @Jeff,
> >> I found current master might have a problem with Integration test of
> using
> >> Selenium but It's hard to see all logs from that tests because there are
> >> so
> >> many unrelated logs like "sleep...". I can fix this issue, but
> generally,
> >> I
> >> think every contributor doesn't want to see unrelated logs except my
> >> tests.
> >> When developing a new feature, it would be helpful to see debugger/test
> >> logs but after it's implemented, I think it's a bit less-useful and
> it's a
> >> bit bothersome for another developer.
> >>
> >> So it would be better to have a consensus to do our best to reduce my
> >> logs,
> >> especially made by test cases.
> >>
> >> How do you think of it?
> >>
> >> JL
> >>
> >> On Mon, Apr 23, 2018 at 10:04 AM, Jeff Zhang <zjf...@gmail.com> wrote:
> >>
> >> > Jongyoul,
> >> >
> >> > What kind of problem do you have ? Each module has log4j.properties
> >> under
> >> > its test folder that we can change the log level.
> >> >
> >> >
> >> >
> >> > Felix Cheung <felixcheun...@hotmail.com>于2018年4月23日周一 上午3:52写道:
> >> >
> >> > > Is there a way to do this via enable/disable component for logging
> in
> >> > > log4j?
> >> > >
> >> > > 
> >> > > From: Jongyoul Lee <jongy...@gmail.com>
> >> > > Sent: Sunday, April 22, 2018 7:01:54 AM
> >> > > To: dev
> >> > > Subject: [DISCUSS] Adjust test logs for CI
> >> > >
> >> > > Hello contributors,
> >> > >
> >> > > I wonder how you guys think of reducing test logs to help to debug
> >> with
> >> > CI.
> >> > > Recently, Zeppelin's Travis log is too big to read anything.
> >> > >
> >> > > So I suggest these kinds of step:
> >> > > 1. leave test logs as much as you want to test your code passed in
> CI
> >> > > 2. If passed, please remove all of your test's logs
> >> > >
> >> > > How do you think of your guys? I know and understand our CI is not
> >> good
> >> > for
> >> > > running multiple times with same code because of lack of resources.
> >> > >
> >> > > Best regards,
> >> > > Jongyoul Lee
> >> > >
> >> > > --
> >> > > 이종열, Jongyoul Lee, 李宗烈
> >> > > http://madeng.net
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> 이종열, Jongyoul Lee, 李宗烈
> >> http://madeng.net
> >>
> >
>



--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net


Re: Problem running Kubernetes example v2.2.0-kubernetes-0.5.0

2018-04-22 Thread Felix Cheung
You might want to check with the spark-on-k8s
Or try using kubernetes from the official spark 2.3.0 release. (Yes we don't 
have an official docker image though but you can build with the script)


From: Rico Bergmann 
Sent: Wednesday, April 11, 2018 11:02:38 PM
To: user@spark.apache.org
Subject: Problem running Kubernetes example v2.2.0-kubernetes-0.5.0

Hi!

I was trying to get the SparkPi example running using the spark-on-k8s
distro from kubespark. But I get the following error:
+ /sbin/tini -s -- driver
[FATAL tini (11)] exec driver failed: No such file or directory

Did anyone get the example running on a Kubernetes cluster?

Best,
Rico.

invoked cmd:
bin/spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://https://cluster:port \
  --conf spark.executor.instances=2 \
  --conf spark.app.name=spark-pi \
  --conf
spark.kubernetes.container.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
\
  --conf
spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
\
  --conf
spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
\

local:///opt/spark/examples/jars/spark-examples_2.11-v2.2.0-kubernetes-0.5.0.jar

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Julia] Does Spark.jl work in Zeppelin's existing Spark/livy.spark interpreters?

2018-04-22 Thread Felix Cheung
Actually, I’m not sure we support Julia as a language in the Spark interpreter.

As far as I understand this, this is Julia -> Spark so we would need support 
for this added to enable

Java (Zeppelin) -> Julia -> Spark



From: Jongyoul Lee 
Sent: Saturday, April 21, 2018 11:53:12 PM
To: users@zeppelin.apache.org
Subject: Re: [Julia] Does Spark.jl work in Zeppelin's existing Spark/livy.spark 
interpreters?

Hello,

AFAIK, there is no issue.

Regards
JL

On Wed, 18 Apr 2018 at 2:22 AM Josh Goldsborough 
> wrote:
Wondering if anyone has had success using the 
Spark.jl library for Spark to support Julia 
using one of Zeppelin's spark interpreters.

Thanks!
-Josh
--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net


Re: [DISCUSS] Adjust test logs for CI

2018-04-22 Thread Felix Cheung
Is there a way to do this via enable/disable component for logging in log4j?


From: Jongyoul Lee 
Sent: Sunday, April 22, 2018 7:01:54 AM
To: dev
Subject: [DISCUSS] Adjust test logs for CI

Hello contributors,

I wonder how you guys think of reducing test logs to help to debug with CI.
Recently, Zeppelin's Travis log is too big to read anything.

So I suggest these kinds of step:
1. leave test logs as much as you want to test your code passed in CI
2. If passed, please remove all of your test's logs

How do you think of your guys? I know and understand our CI is not good for
running multiple times with same code because of lack of resources.

Best regards,
Jongyoul Lee

--
이종열, Jongyoul Lee, 李宗烈
http://madeng.net


Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-16 Thread Felix Cheung
Is it required for DataReader to support all known DataFormat?

Hopefully, not, as assumed by the 'throw' in the interface. Then specifically 
how are we going to express capability of the given reader of its supported 
format(s), or specific support for each of "real-time data in row format, and 
history data in columnar format"?



From: Wenchen Fan 
Sent: Sunday, April 15, 2018 7:45:01 PM
To: Spark dev list
Subject: [discuss][data source v2] remove type parameter in 
DataReader/WriterFactory

Hi all,

I'd like to propose an API change to the data source v2.

One design goal of data source v2 is API type safety. The FileFormat API is a 
bad example, it asks the implementation to return InternalRow even it's 
actually ColumnarBatch. In data source v2 we add a type parameter to 
DataReader/WriterFactoty and DataReader/Writer, so that data source supporting 
columnar scan returns ColumnarBatch at API level.

However, we met some problems when migrating streaming and file-based data 
source to data source v2.

For the streaming side, we need a variant of DataReader/WriterFactory to add 
streaming specific concept like epoch id and offset. For details please see 
ContinuousDataReaderFactory and 
https://docs.google.com/document/d/1PJYfb68s2AG7joRWbhrgpEWhrsPqbhyRwUVl9V1wPOE/edit#

But this conflicts with the special format mixin traits like 
SupportsScanColumnarBatch. We have to make the streaming variant of 
DataReader/WriterFactory to extend the original DataReader/WriterFactory, and 
do type cast at runtime, which is unnecessary and violate the type safety.

For the file-based data source side, we have a problem with code duplication. 
Let's take ORC data source as an example. To support both unsafe row and 
columnar batch scan, we need something like

// A lot of parameters to carry to the executor side
class OrcUnsafeRowFactory(...) extends DataReaderFactory[UnsafeRow] {
  def createDataReader ...
}

class OrcColumnarBatchFactory(...) extends DataReaderFactory[ColumnarBatch] {
  def createDataReader ...
}

class OrcDataSourceReader extends DataSourceReader {
  def createUnsafeRowFactories = ... // logic to prepare the parameters and 
create factories

  def createColumnarBatchFactories = ... // logic to prepare the parameters and 
create factories
}

You can see that we have duplicated logic for preparing parameters and defining 
the factory.

Here I propose to remove all the special format mixin traits and change the 
factory interface to

public enum DataFormat {
  ROW,
  INTERNAL_ROW,
  UNSAFE_ROW,
  COLUMNAR_BATCH
}

interface DataReaderFactory {
  DataFormat dataFormat;

  default DataReader createRowDataReader() {
throw new IllegalStateException();
  }

  default DataReader createUnsafeRowDataReader() {
throw new IllegalStateException();
  }

  default DataReader createColumnarBatchDataReader() {
throw new IllegalStateException();
  }
}

Spark will look at the dataFormat and decide which create data reader method to 
call.

Now we don't have the problem for the streaming side as these special format 
mixin traits go away. And the ORC data source can also be simplified to

class OrcReaderFactory(...) extends DataReaderFactory {
  def createUnsafeRowReader ...

  def createColumnarBatchReader ...
}

class OrcDataSourceReader extends DataSourceReader {
  def createReadFactories = ... // logic to prepare the parameters and create 
factories
}

We also have a potential benefit of supporting hybrid storage data source, 
which may keep real-time data in row format, and history data in columnar 
format. Then they can make some DataReaderFactory output InternalRow and some 
output ColumnarBatch.

Thoughts?


Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-06 Thread Felix Cheung
Instead of write to console you need to write to memory for it to be queryable


 .format("memory")
   .queryName("tableName")
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks


From: Aakash Basu 
Sent: Friday, April 6, 2018 3:22:07 AM
To: user
Subject: Fwd: [Structured Streaming Query] Calculate Running Avg from Kafka 
feed using SQL query

Any help?

Need urgent help. Someone please clarify the doubt?


-- Forwarded message --
From: Aakash Basu 
>
Date: Mon, Apr 2, 2018 at 1:01 PM
Subject: [Structured Streaming Query] Calculate Running Avg from Kafka feed 
using SQL query
To: user >, "Bowden, Chris" 
>


Hi,

This is a very interesting requirement, where I am getting stuck at a few 
places.

Requirement -

Col1Col2
1  10
2  11
3  12
4  13
5  14

I have to calculate avg of col1 and then divide each row of col2 by that avg. 
And, the Avg should be updated with every new data being fed through Kafka into 
Spark Streaming.

Avg(Col1) = Running Avg
Col2 = Col2/Avg(Col1)


Queries -


1) I am currently trying to simply run a inner query inside a query and print 
Avg with other Col value and then later do the calculation. But, getting error.

Query -

select t.Col2 , (Select AVG(Col1) as Avg from transformed_Stream_DF) as myAvg 
from transformed_Stream_DF t

Error -

pyspark.sql.utils.StreamingQueryException: u'Queries with streaming sources 
must be executed with writeStream.start();

Even though, I already have writeStream.start(); in my code, it is probably 
throwing the error because of the inner select query (I think Spark is assuming 
it as another query altogether which require its own writeStream.start. Any 
help?


2) How to go about it? I have another point in mind, i.e, querying the table to 
get the avg and store it in a variable. In the second query simply pass the 
variable and divide the second column to produce appropriate result. But, is it 
the right approach?

3) Final question: How to do the calculation over the entire data and not the 
latest, do I need to keep appending somewhere and repeatedly use it? My average 
and all the rows of the Col2 shall change with every new incoming data.


Code -


from pyspark.sql import SparkSession
import time
from pyspark.sql.functions import split, col

class test:


spark = SparkSession.builder \
.appName("Stream_Col_Oper_Spark") \
.getOrCreate()

data = spark.readStream.format("kafka") \
.option("startingOffsets", "latest") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test1") \
.load()

ID = data.select('value') \
.withColumn('value', data.value.cast("string")) \
.withColumn("Col1", split(col("value"), ",").getItem(0)) \
.withColumn("Col2", split(col("value"), ",").getItem(1)) \
.drop('value')

ID.createOrReplaceTempView("transformed_Stream_DF")
aggregate_func = spark.sql(
"select t.Col2 , (Select AVG(Col1) as Avg from transformed_Stream_DF) 
as myAvg from transformed_Stream_DF t")  #  (Col2/(AVG(Col1)) as Col3)")

# ---For Console Print---

query = aggregate_func \
.writeStream \
.format("console") \
.start()
# .outputMode("complete") \
# ---Console Print ends---

query.awaitTermination()
# /home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/bin/spark-submit 
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 
/home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Col_Oper_Spark.py



Thanks,
Aakash.



Re: Hadoop 3 support

2018-04-04 Thread Felix Cheung
What would be the strategy with hive? Cherry pick patches? Update to more 
“modern” versions (like 2.3?)

I know of a few critical schema evolution fixes that we could port to hive 
1.2.1-spark


_
From: Steve Loughran 
Sent: Tuesday, April 3, 2018 1:33 PM
Subject: Re: Hadoop 3 support
To: Apache Spark Dev 




On 3 Apr 2018, at 01:30, Saisai Shao 
> wrote:

Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) 
doesn't support run on Hadoop 3. Hive will check the Hadoop version in the 
runtime [1]. Besides this I think some pom changes should be enough to support 
Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of 
changes, but this is not necessary.


[1] 
https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144

2018-04-03 4:57 GMT+08:00 Marcelo Vanzin 
>:
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin 
> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>


To be ruthless, I'd view Hadoop 3.1 as the first one to play with...3.0.x was 
more of a wide-version check. Hadoop 3.1RC0 is out this week, making it the 
ideal (last!) time to find showstoppers.

1. I've got a PR which adds a profile to build spark against hadoop 3, with 
some fixes for zk import along with better hadoop-cloud profile

https://github.com/apache/spark/pull/20923


Apply that and patch and both mvn and sbt can build with the RC0 from the ASF 
staging repo:

build/sbt -Phadoop-3,hadoop-cloud,yarn -Psnapshots-and-staging



2. Everything Marcelo says about hive.

You can build hadoop locally with a -Dhadoop.version=2.11 and the hive 
1.2.1.-spark version check goes through. You can't safely bring up HDFS like 
that, but you can run spark standalone against things

Some strategies

Short term: build a new hive-1,2.x-spark which fixes up the version check and 
merges in those critical patches that cloudera, hortoworks, databricks, + 
anyone else has got in for their production systems. I don't think we have that 
many.

That leaves a "how to release" story, as the ASF will want it to come out under 
the ASF auspices, and, given the liability disclaimers, so should everyone. The 
Hive team could be "invited" to publish it as their own if people ask nicely.

Long term
 -do something about that subclassing to get the thrift endpoint to work. That 
can include fixing hive's service to be subclass friendly.
 -move to hive 2

That' s a major piece of work.




[jira] [Created] (ZEPPELIN-3385) PySpark interpreter should handle .. for autocomplete

2018-04-04 Thread Felix Cheung (JIRA)
Felix Cheung created ZEPPELIN-3385:
--

 Summary: PySpark interpreter should handle .. for autocomplete
 Key: ZEPPELIN-3385
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3385
 Project: Zeppelin
  Issue Type: Bug
  Components: python-interpreter
Reporter: Felix Cheung


See thread here 
[https://github.com/apache/zeppelin/pull/2901#discussion_r178472173]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-04-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425206#comment-16425206
 ] 

Felix Cheung commented on SPARK-23285:
--

fixed in 2.3, updating.

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Yinan Li
>Priority: Minor
> Fix For: 2.4.0
>
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-04-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23285:
-
Fix Version/s: 2.4.0

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Yinan Li
>Priority: Minor
> Fix For: 2.4.0
>
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-04-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425206#comment-16425206
 ] 

Felix Cheung edited comment on SPARK-23285 at 4/4/18 8:53 AM:
--

fixed in 2.4, updating.


was (Author: felixcheung):
fixed in 2.3, updating.

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Yinan Li
>Priority: Minor
> Fix For: 2.4.0
>
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-04-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425202#comment-16425202
 ] 

Felix Cheung commented on SPARK-23680:
--

also, please check fixed version and target version when resolving issue. in 
this case this is merged into master only, so only going to 2.4 and not 2.3.1

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Assignee: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
> Fix For: 2.4.0
>
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-04-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23680:
-
Target Version/s: 2.4.0  (was: 2.3.1, 2.4.0)

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Assignee: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
> Fix For: 2.4.0
>
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-04-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23680:
-
Fix Version/s: 2.4.0

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Assignee: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
> Fix For: 2.4.0
>
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-04-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-23680:


Assignee: Ricardo Martinelli de Oliveira

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Assignee: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23680) entrypoint.sh does not accept arbitrary UIDs, returning as an error

2018-04-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425200#comment-16425200
 ] 

Felix Cheung commented on SPARK-23680:
--

try now. i also had to add rmartine - seemed like this was his first 
contribution, congrats!

> entrypoint.sh does not accept arbitrary UIDs, returning as an error
> ---
>
> Key: SPARK-23680
> URL: https://issues.apache.org/jira/browse/SPARK-23680
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: OpenShift
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Major
>  Labels: easyfix
>
> Openshift supports running pods using arbitrary UIDs 
> ([https://docs.openshift.com/container-platform/3.7/creating_images/guidelines.html#openshift-specific-guidelines)]
>   to improve security. Although entrypoint.sh was developed to cover this 
> feature, the script is returning an error[1].
> The issue is that the script uses getent to find the passwd entry of the 
> current UID, and if the entry is not found it creates an entry in 
> /etc/passwd. According to the getent man page:
> {code:java}
> EXIT STATUS
>    One of the following exit values can be returned by getent:
>   0 Command completed successfully.
>   1 Missing arguments, or database unknown.
>   2 One or more supplied key could not be found in the 
> database.
>   3 Enumeration not supported on this database.
> {code}
> And since the script begin with a "set -ex" command, which means it turns on 
> debug and breaks the script if the command pipelines returns an exit code 
> other than 0.--
> Having that said, this line below must be changed to remove the "-e" flag 
> from set command:
> https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L20
>  
>  
> [1]https://github.com/apache/spark/blob/v2.3.0/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L25-L34



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [Spark R] Proposal: Exposing RBackend in RRunner

2018-03-30 Thread Felix Cheung
Auto reference counting should already be handled by SparkR already.

Can you elaborate on which object and how that would be used?


From: Jeremy Liu <jeremy.jl@gmail.com>
Sent: Thursday, March 29, 2018 8:23:58 AM
To: Reynold Xin
Cc: Felix Cheung; dev@spark.apache.org
Subject: Re: [Spark R] Proposal: Exposing RBackend in RRunner

Use case is to cache a reference to the JVM object created by SparkR.

On Wed, Mar 28, 2018 at 12:03 PM Reynold Xin 
<r...@databricks.com<mailto:r...@databricks.com>> wrote:
If you need the functionality I would recommend you just copying the code over 
to your project and use it that way.

On Wed, Mar 28, 2018 at 9:02 AM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I think the difference is py4j is a public library whereas the R backend is 
specific to SparkR.

Can you elaborate what you need JVMObjectTracker for? We have provided R 
convenient APIs to call into JVM: sparkR.callJMethod for example

_
From: Jeremy Liu <jeremy.jl@gmail.com<mailto:jeremy.jl@gmail.com>>
Sent: Tuesday, March 27, 2018 12:20 PM
Subject: Re: [Spark R] Proposal: Exposing RBackend in RRunner
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>



Spark Dev,

On second thought, the below topic seems more appropriate for spark-dev rather 
than spark-users:

Spark Users,

In SparkR, RBackend is created in RRunner.main(). This in particular makes it 
difficult to control or use the RBackend. For my use case, I am looking to 
access the JVMObjectTracker that RBackend maintains for SparkR dataframes.

Analogously, pyspark starts a py4j.GatewayServer in PythonRunner.main(). It's 
then possible to start a ClientServer that then has access to the object 
bindings between Python/Java.

Is there something similar for SparkR? Or a reasonable way to expose RBackend?

Thanks!
--
-
Jeremy Liu
jeremy.jl@gmail.com<mailto:jeremy.jl@gmail.com>


--
-
Jeremy Liu
jeremy.jl@gmail.com<mailto:jeremy.jl@gmail.com>


Re: [Spark R] Proposal: Exposing RBackend in RRunner

2018-03-28 Thread Felix Cheung
I think the difference is py4j is a public library whereas the R backend is 
specific to SparkR.

Can you elaborate what you need JVMObjectTracker for? We have provided R 
convenient APIs to call into JVM: sparkR.callJMethod for example

_
From: Jeremy Liu 
Sent: Tuesday, March 27, 2018 12:20 PM
Subject: Re: [Spark R] Proposal: Exposing RBackend in RRunner
To: 


Spark Dev,

On second thought, the below topic seems more appropriate for spark-dev rather 
than spark-users:

Spark Users,

In SparkR, RBackend is created in RRunner.main(). This in particular makes it 
difficult to control or use the RBackend. For my use case, I am looking to 
access the JVMObjectTracker that RBackend maintains for SparkR dataframes.

Analogously, pyspark starts a py4j.GatewayServer in PythonRunner.main(). It's 
then possible to start a ClientServer that then has access to the object 
bindings between Python/Java.

Is there something similar for SparkR? Or a reasonable way to expose RBackend?

Thanks!
--
-
Jeremy Liu
jeremy.jl@gmail.com




Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Felix Cheung
If your data can be split into groups and you can call into your favorite R 
package on each group of data (in parallel):

https://spark.apache.org/docs/latest/sparkr.html#run-a-given-function-on-a-large-dataset-grouping-by-input-columns-and-using-gapply-or-gapplycollect



From: Nisha Muktewar 
Sent: Monday, March 26, 2018 2:27:52 PM
To: Josh Goldsborough
Cc: user
Subject: Re: [Spark R]: Linear Mixed-Effects Models in Spark R

Look at LinkedIn's Photon ML package: https://github.com/linkedin/photon-ml

One of the caveats is/was that the input data has to be in Avro in a specific 
format.

On Mon, Mar 26, 2018 at 1:46 PM, Josh Goldsborough 
> wrote:
The company I work for is trying to do some mixed-effects regression modeling 
in our new big data platform including SparkR.

We can run via SparkR's support of native R & use lme4.  But it runs single 
threaded.  So we're looking for tricks/techniques to process large data sets.


This was asked a couple years ago:
https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology

But I wanted to ask again, in case anyone had an answer now.

Thanks,
Josh Goldsborough



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-25 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412918#comment-16412918
 ] 

Felix Cheung commented on SPARK-23780:
--

though there are other methods

 

[https://www.rforge.net/doc/packages/JSON/toJSON.html]

 

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412458#comment-16412458
 ] 

Felix Cheung edited comment on SPARK-23780 at 3/24/18 6:53 AM:
---

here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 or here

[https://github.com/jeroen/jsonlite/blob/master/R/toJSON.R#L2] 


was (Author: felixcheung):
here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412458#comment-16412458
 ] 

Felix Cheung commented on SPARK-23780:
--

here

[https://github.com/mages/googleVis/blob/master/R/zzz.R#L39]

 

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412457#comment-16412457
 ] 

Felix Cheung commented on SPARK-23780:
--

hmm, I think the cause of this is the incompatibility of the method signature 
of toJSON

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23497) Sparklyr Applications doesn't disconnect spark driver in client mode

2018-03-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410858#comment-16410858
 ] 

Felix Cheung commented on SPARK-23497:
--

you should probably follow up with sparklyr/rstudio on this.

> Sparklyr Applications doesn't disconnect spark driver in client mode
> 
>
> Key: SPARK-23497
> URL: https://issues.apache.org/jira/browse/SPARK-23497
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.1.0
>Reporter: bharath kumar
>Priority: Major
>
> Hello,
> When we use Sparklyr to connect to Yarn cluster manager in client mode or 
> cluster mode, Spark driver will not disconnect unless we mention the 
> spark_disconnect(sc) in the code.
> Does it make sense to add a timeout feature for driver to exit after certain 
> amount of time, in client mode or cluster mode. I think its only happening 
> with connection from Sparklyr to Yarn. Some times the driver stays there for 
> weeks and holds minimum resources .
> *More  Details:*
> Yarn -2.7.0
> Spark -2.1.0
> Rversion:
> Microsoft R Open 3.4.2
> Rstudio Version:
> rstudio-server-1.1.414-1.x86_64
> yarn application -status application_id
> 18/01/22 09:08:45 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM 
> address to resourcemanager.com/resourcemanager:8032
>  
> Application Report : 
>     Application-Id : application_id
>     Application-Name : sparklyr
>     Application-Type : SPARK
>     User : userid
>     Queue : root.queuename
>     Start-Time : 1516245523965
>     Finish-Time : 0
>     Progress : 0%
>     State : RUNNING
>     Final-State : UNDEFINED
>     Tracking-URL : N/A
>     RPC Port : -1
>     AM Host : N/A
>     Aggregate Resource Allocation :266468 MB-seconds, 59 vcore-seconds
>     Diagnostics : N/A
>  
> [http://spark.rstudio.com/]
>  
> I can provide more details if required
>  
> Thanks,
> Bharath



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410856#comment-16410856
 ] 

Felix Cheung commented on SPARK-23650:
--

can you clarify where you see "R environment inside the thread for applying UDF 
is not getting cached. It is created and destroyed with each query."?

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: read_model_in_udf.txt, sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Felix Cheung
I like being able to customize the docker image itself - but I realize this 
thread is more about “API” for the stock image.

Environment is nice. Probably we need a way to set custom spark config (as a 
file??)



From: Holden Karau 
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

I’m glad this discussion is happening on dev@ :)

Personally I like customizing with shell env variables during rolling my own 
image, but definitely documentation the expectations/usage of the variables is 
needed before we can really call it an API.

On the related question I suspect two of the more “common” likely 
customizations is adding additional jars for bootstrapping fetching from a DFS 
& also similarity complicated Python dependencies (although given the Pythons 
support isn’t merged yet it’s hard to say what exactly this would look like).

I could also see some vendors wanting to add some bootstrap/setup scripts to 
fetch keys or other things.

What other ways do folks foresee customizing their Spark docker containers?

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson 
> wrote:
During the review of the recent PR to remove use of the init_container from 
kube pods as created by the Kubernetes back-end, the topic of documenting the 
"API" for these container images also came up. What information does the 
back-end provide to these containers? In what form? What assumptions does the 
back-end make about the structure of these containers?  This information is 
important in a scenario where a user wants to create custom images, 
particularly if these are not based on the reference dockerfiles.

A related topic is deciding what such an API should look like.  For example, 
early incarnations were based more purely on environment variables, which could 
have advantages in terms of an API that is easy to describe in a document.  If 
we document the current API, should we annotate it as Experimental?  If not, 
does that effectively freeze the API?

We are interested in community input about possible customization use cases and 
opinions on possible API designs!
Cheers,
Erik
--
Twitter: https://twitter.com/holdenkarau


Re: "IPython is available, use IPython for PySparkInterpreter"

2018-03-20 Thread Felix Cheung
I think that's a good point - perhaps this shouldn't be a warning.


From: Ruslan Dautkhanov 
Sent: Monday, March 19, 2018 11:10:48 AM
To: users
Subject: "IPython is available, use IPython for PySparkInterpreter"

We're getting " IPython is available, use IPython for PySparkInterpreter "
warning each time we start %pyspark notebooks.

Although there is no difference between %pyspark and %ipyspark afaik.
At least we can use all ipython magic commands etc.
(maybe becase we have zeppelin.pyspark.useIPython=true?)

If that's the case, how we can disable "IPython is available, use IPython for 
PySparkInterpreter" warning ?


--
Ruslan Dautkhanov


Re: Build zeppelin 0.8 with spark 2.3

2018-03-19 Thread Felix Cheung
Are you running with branch-0.8? I think there is a recent change in master for 
this.


From: Felix Cheung <felixcheun...@hotmail.com>
Sent: Monday, March 19, 2018 9:49:10 AM
To: dev@zeppelin.apache.org; dev@zeppelin.apache.org
Subject: Re: Build zeppelin 0.8 with spark 2.3

Spark 2.3 does not support Scala 2.10. There should be a script to switch 
Zeppelin to build for Scala 2.11 only...


From: Xiaohui Liu <hero...@gmail.com>
Sent: Sunday, March 18, 2018 9:20:13 PM
To: dev@zeppelin.apache.org
Subject: Build zeppelin 0.8 with spark 2.3

Hi,

I am building zep 0.8 with Spark 2.3.

>> ./dev/change_scala_version.sh 2.11

>> mvn clean package -Pspark-2.3 -Pr -Pscala-2.11 -Pbuild-distr -DskipTests

But it fails with the following error:

[ERROR] Failed to execute goal on project spark-scala-2.10: Could not
resolve dependencies for project
org.apache.zeppelin:spark-scala-2.10:jar:0.8.0-SNAPSHOT: The following
artifacts could not be resolve
d: org.apache.spark:spark-repl_2.10:jar:2.3.0,
org.apache.spark:spark-core_2.10:jar:2.3.0,
org.apache.spark:spark-hive_2.10:jar:2.3.0: Failure to find
org.apache.spark:spark-repl_2.10:jar:2.3.0 in https:/
/repo.maven.apache.org/maven2 was cached in the local repository,
resolution will not be reattempted until the update interval of central has
elapsed or updates are forced -> [Help 1]


This is because that there is no scala_2.10 build for spark-core version
2.3 (https://mvnrepository.com/artifact/org.apache.spark/spark-core).

Any suggestion to mitigate this problem?

Regards


Re: Build zeppelin 0.8 with spark 2.3

2018-03-19 Thread Felix Cheung
Spark 2.3 does not support Scala 2.10. There should be a script to switch 
Zeppelin to build for Scala 2.11 only...


From: Xiaohui Liu 
Sent: Sunday, March 18, 2018 9:20:13 PM
To: dev@zeppelin.apache.org
Subject: Build zeppelin 0.8 with spark 2.3

Hi,

I am building zep 0.8 with Spark 2.3.

>> ./dev/change_scala_version.sh 2.11

>> mvn clean package -Pspark-2.3 -Pr -Pscala-2.11 -Pbuild-distr -DskipTests

But it fails with the following error:

[ERROR] Failed to execute goal on project spark-scala-2.10: Could not
resolve dependencies for project
org.apache.zeppelin:spark-scala-2.10:jar:0.8.0-SNAPSHOT: The following
artifacts could not be resolve
d: org.apache.spark:spark-repl_2.10:jar:2.3.0,
org.apache.spark:spark-core_2.10:jar:2.3.0,
org.apache.spark:spark-hive_2.10:jar:2.3.0: Failure to find
org.apache.spark:spark-repl_2.10:jar:2.3.0 in https:/
/repo.maven.apache.org/maven2 was cached in the local repository,
resolution will not be reattempted until the update interval of central has
elapsed or updates are forced -> [Help 1]


This is because that there is no scala_2.10 build for spark-core version
2.3 (https://mvnrepository.com/artifact/org.apache.spark/spark-core).

Any suggestion to mitigate this problem?

Regards


[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-18 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404198#comment-16404198
 ] 

Felix Cheung commented on SPARK-23650:
--

Is there a reason for the broadcast?

Could you instead distribute the .rds to all the executor and then call readRDS 
from within your UDF?

I understand this approach has been done quite a bit.



> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Custom metrics sink

2018-03-16 Thread Felix Cheung
There is a proposal to expose them. See SPARK-14151


From: Christopher Piggott 
Sent: Friday, March 16, 2018 1:09:38 PM
To: user@spark.apache.org
Subject: Custom metrics sink

Just for fun, i want to make a stupid program that makes different frequency 
chimes as each worker becomes active.  That way you can 'hear' what the cluster 
is doing and how it's distributing work.

I thought to do this I would make a custom Sink, but the Sink and everything 
else in org.apache.spark.metrics.sink is private to spark.  What I was hoping 
to do was to just pick up the # of active workers in semi real time (once a 
second?) and have them send a UDP message somewhere... then each worker would 
be assigned to a different frequency chime.  It's just a toy, for fun.

How do you add a custom Sink when these classes don't seem to be exposed?

--C



Re: Changing how we compute release hashes

2018-03-16 Thread Felix Cheung
+1 there


From: Sean Owen <sro...@gmail.com>
Sent: Friday, March 16, 2018 9:51:49 AM
To: Felix Cheung
Cc: rb...@netflix.com; Nicholas Chammas; Spark dev list
Subject: Re: Changing how we compute release hashes

I think the issue with that is that OS X doesn't have "sha512sum". Both it and 
Linux have "shasum -a 512" though.

On Fri, Mar 16, 2018 at 11:05 AM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Instead of using gpg to create the sha512 hash file we could just change to 
using sha512sum? That would output the right format that is in turns verifiable.



From: Ryan Blue <rb...@netflix.com.INVALID>
Sent: Friday, March 16, 2018 8:31:45 AM
To: Nicholas Chammas
Cc: Spark dev list
Subject: Re: Changing how we compute release hashes

+1 It's possible to produce the same file with gpg, but the sha*sum utilities 
are a bit easier to remember the syntax for.

On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas 
<nicholas.cham...@gmail.com<mailto:nicholas.cham...@gmail.com>> wrote:

To verify that I’ve downloaded a Hadoop release correctly, I can just do this:

$ shasum --check hadoop-2.7.5.tar.gz.sha256
hadoop-2.7.5.tar.gz: OK


However, since we generate Spark release hashes with 
GPG<https://github.com/apache/spark/blob/c2632edebd978716dbfa7874a2fc0a8f5a4a9951/dev/create-release/release-build.sh#L167-L168>,
 the resulting hash is in a format that doesn’t play well with any tools:

$ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
checksum lines found


GPG doesn’t seem to offer a way to verify a file from a hash.

I know I can always manipulate the SHA512 hash into a different format or just 
manually inspect it, but as a “quality of life” improvement can we change how 
we generate the SHA512 hash so that it plays nicely with shasum? If it’s too 
disruptive to change the format of the SHA512 hash, can we add a SHA256 hash to 
our releases in this format?

I suppose if it’s not easy to update or add hashes to our existing releases, it 
may be too difficult to change anything here. But I’m not sure, so I thought 
I’d ask.

Nick

​



--
Ryan Blue
Software Engineer
Netflix


Re: Changing how we compute release hashes

2018-03-16 Thread Felix Cheung
Instead of using gpg to create the sha512 hash file we could just change to 
using sha512sum? That would output the right format that is in turns verifiable.



From: Ryan Blue 
Sent: Friday, March 16, 2018 8:31:45 AM
To: Nicholas Chammas
Cc: Spark dev list
Subject: Re: Changing how we compute release hashes

+1 It's possible to produce the same file with gpg, but the sha*sum utilities 
are a bit easier to remember the syntax for.

On Thu, Mar 15, 2018 at 9:01 PM, Nicholas Chammas 
> wrote:

To verify that I’ve downloaded a Hadoop release correctly, I can just do this:

$ shasum --check hadoop-2.7.5.tar.gz.sha256
hadoop-2.7.5.tar.gz: OK


However, since we generate Spark release hashes with 
GPG,
 the resulting hash is in a format that doesn’t play well with any tools:

$ shasum --check spark-2.3.0-bin-hadoop2.7.tgz.sha512
shasum: spark-2.3.0-bin-hadoop2.7.tgz.sha512: no properly formatted SHA1 
checksum lines found


GPG doesn’t seem to offer a way to verify a file from a hash.

I know I can always manipulate the SHA512 hash into a different format or just 
manually inspect it, but as a “quality of life” improvement can we change how 
we generate the SHA512 hash so that it plays nicely with shasum? If it’s too 
disruptive to change the format of the SHA512 hash, can we add a SHA256 hash to 
our releases in this format?

I suppose if it’s not easy to update or add hashes to our existing releases, it 
may be too difficult to change anything here. But I’m not sure, so I thought 
I’d ask.

Nick

​



--
Ryan Blue
Software Engineer
Netflix


[jira] [Comment Edited] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401525#comment-16401525
 ] 

Felix Cheung edited comment on SPARK-23650 at 3/16/18 7:16 AM:
---

do you mean this?

 

RRunner: Times: boot = 0.010 s, init = 0.005 s, broadcast = 1.894 s, read-input 
= 0.001 s, compute = 0.062 s, write-output = 0.002 s, total = 1.974 s

 

Under the cover it is working with the same R process.

I see

SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

each time it is creating a new broadcast which would then needs to be 
transferred.

IMO there are a few things to look into:
 # it should detect if the broadcast is the same (not sure if it does that)
 # if it is attributed to the same broadcast in use.daemon mode then it perhaps 
doesn't have to transfer it again (but it would need to keep track of the stage 
executed before and broadcast that was sent before etc)
 # data transfer can be faster (SPARK-18924)

However, as of now RRunner simply picks up the broadcast that is pass to it and 
sends it.

 


was (Author: felixcheung):
do you mean this?

 

RRunner: Times: boot = 0.010 s, init = 0.005 s, broadcast = 1.894 s, read-input 
= 0.001 s, compute = 0.062 s, write-output = 0.002 s, total = 1.974 s

 

Under the cover it is working with the same R process.

I see

SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

each time it is creating a new broadcast which would then needs to be 
transferred.

IMO there are a few things to look into:
 # it should detect if the broadcast is the same (not sure if it does that)
 # if it is attributed to the same broadcast in use.daemon mode then it perhaps 
doesn't have to transfer it again
 # data transfer can be faster (SPARK-18924)

However, as of now RRunner simply picks up the broadcast that is pass to it and 
sends it.

 

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401525#comment-16401525
 ] 

Felix Cheung commented on SPARK-23650:
--

do you mean this?

 

RRunner: Times: boot = 0.010 s, init = 0.005 s, broadcast = 1.894 s, read-input 
= 0.001 s, compute = 0.062 s, write-output = 0.002 s, total = 1.974 s

 

Under the cover it is working with the same R process.

I see

SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

each time it is creating a new broadcast which would then needs to be 
transferred.

IMO there are a few things to look into:
 # it should detect if the broadcast is the same (not sure if it does that)
 # if it is attributed to the same broadcast in use.daemon mode then it perhaps 
doesn't have to transfer it again
 # data transfer can be faster (SPARK-18924)

However, as of now RRunner simply picks up the broadcast that is pass to it and 
sends it.

 

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: sparkR_log2.txt, sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [DISCUSS] Not marking Jira issues as resolved in 1.5.0 as resolved in 1.6.0

2018-03-15 Thread Felix Cheung
+1


From: Till Rohrmann 
Sent: Thursday, March 15, 2018 5:45:14 AM
To: dev@flink.apache.org
Subject: Re: [DISCUSS] Not marking Jira issues as resolved in 1.5.0 as resolved 
in 1.6.0

+1 for marking bugs as fixed 1.5.0 only

On Thu, Mar 15, 2018 at 11:09 AM, Ted Yu  wrote:

> +1 on marking bugs as fixed for 1.5.0 only.
>  Original message From: Piotr Nowojski <
> pi...@data-artisans.com> Date: 3/15/18  12:48 AM  (GMT-08:00) To:
> dev@flink.apache.org Subject: Re: [DISCUSS] Not marking Jira issues as
> resolved in 1.5.0 as resolved in 1.6.0
> Same as Chesnay
>
> +1 for marking bugs as fixed 1.5.0 only
>
> > On 15 Mar 2018, at 07:57, Chesnay Schepler  wrote:
> >
> > +1 to mark bugs as fixed in 1.5.0 only.
> >
> > On 15.03.2018 01:40, Aljoscha Krettek wrote:
> >> Hi,
> >>
> >> We currently have some issues that are marked as resolved for both
> 1.5.0 and 1.6.0 [1]. The reason is that we have the release-1.5 branch and
> the master branch, which will eventually become the branch for 1.6.0.
> >>
> >> I think this can lead to confusion because the release notes are
> created based on that data. Say, we fix a bug "foo" after we created the
> release-1.5 branch. Now we will have "[FLINK-] Fixed foo" in the
> release notes for 1.5.0 and 1.6.0. We basically start our Flink 1.6.0
> release notes with around 50 issues that were never bugs in 1.6.0 because
> they were fixed in 1.5.0. Plus, having "[FLINK-] Fixed foo" in the
> 1.6.0 release notes indicates that "foo" was actually a bug in 1.5.0
> (because we now had to fix it), but it wasn't.
> >>
> >> I would propose to remove fixVersion 1.6.0 from all issues that have
> 1.5.0 as fixVersion. What do you think?
> >>
> >> On a side note: a bug that is fixed in 1.5.1 should be marked as fixed
> for 1.6.0 separately, because 1.6.0 is not a direct successor to 1.5.1.
> >>
> >> Best,
> >> Aljoscha
> >>
> >> [1] https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20FLINK%20and%20fixVersion%20%3D%201.6.0%20and%20resolution%20!%3D%
> 20unresolved
> >
> >
>
>


Re: How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Felix Cheung
It’s best to start with Structured Streaming

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#tab_python_0

https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#tab_python_0

_
From: Aakash Basu 
Sent: Wednesday, March 14, 2018 1:09 AM
Subject: How to start practicing Python Spark Streaming in Linux?
To: user 


Hi all,

Any guide on how to kich-start learning PySpark Streaming in ubuntu standalone 
system? Step wise, practical hands-on, would be great.

Also, connecting Kafka with Spark and getting real time data and processing it 
in micro-batches...

Any help?

Thanks,
Aakash.




Re: Too many open files on Bucketing sink

2018-03-14 Thread Felix Cheung
I have seen this before as well.

My workaround was to limit the number of parallelism but it is the unfortunate 
effect of limiting the number of processing tasks also (and so slowing things 
down)

Another alternative is to have bigger buckets (and smaller number of buckets)

Not sure if there is a good solution.


From: galantaa 
Sent: Tuesday, March 13, 2018 7:08:01 AM
To: user@flink.apache.org
Subject: Too many open files on Bucketing sink

Hey all,
I'm using bucketing sink with a bucketer that creates partition per customer
per day.
I sink the files to s3.
it suppose to work on around 500 files at the same time (according to my
partitioning).

I have a critical problem of 'Too many open files'.
I've upload two taskmanagers, each with 16 slots. I've checked how many open
files (or file descriptors) exist with 'lsof | wc -l' and it had reached
over a million files on each taskmanager!

after that, I'd decreased the num of taskSlots to 8 (4 in each taskmanager),
and the concurrency dropped.
checking 'lsof | wc -l' gave around 250k file on each machine.
I also checked how many actual files exist in my tmp dir (it works on the
files there before uploading them to s3) - around 3000.

I think that each taskSlot works with several threads (maybe 16?), and each
thread holds a fd for the actual file, and thats how the numbers get so
high.

Is that a know problem? is there anything I can do?
by now, I filter just 10 customers and it works great, but I have to find a
real solution so I can stream all the data.
Maybe I can also work with a single task Slot per machine but I'm not sure
this is a good idea.

Thank you very much,
Alon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


[jira] [Commented] (SPARK-23618) docker-image-tool.sh Fails While Building Image

2018-03-14 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398175#comment-16398175
 ] 

Felix Cheung commented on SPARK-23618:
--

I think this is because the user isn't in the user role list.

you can add him to Contributors here 
https://issues.apache.org/jira/plugins/servlet/project-config/SPARK/roles

> docker-image-tool.sh Fails While Building Image
> ---
>
> Key: SPARK-23618
> URL: https://issues.apache.org/jira/browse/SPARK-23618
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ninad Ingole
>Priority: Major
>
> I am trying to build kubernetes image for version 2.3.0, using 
> {code:java}
> ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
> {code}
> giving me an issue for docker build 
> error:
> {code:java}
> "docker build" requires exactly 1 argument.
> See 'docker build --help'.
> Usage: docker build [OPTIONS] PATH | URL | - [flags]
> Build an image from a Dockerfile
> {code}
>  
> Executing the command within the spark distribution directory. Please let me 
> know what's the issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-14 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398174#comment-16398174
 ] 

Felix Cheung commented on SPARK-23650:
--

I see one RRunner - do you have more of the log?

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
> Attachments: sparkRlag.txt
>
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2018-03-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397213#comment-16397213
 ] 

Felix Cheung commented on SPARK-23632:
--

could you explain how you think these environment variables can help?

> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23618) docker-image-tool.sh Fails While Building Image

2018-03-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397211#comment-16397211
 ] 

Felix Cheung commented on SPARK-23618:
--

[~foxish] - Jira has a different user role system, I've added you to the right 
role now, could you try again?

 

> docker-image-tool.sh Fails While Building Image
> ---
>
> Key: SPARK-23618
> URL: https://issues.apache.org/jira/browse/SPARK-23618
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ninad Ingole
>Priority: Major
>
> I am trying to build kubernetes image for version 2.3.0, using 
> {code:java}
> ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
> {code}
> giving me an issue for docker build 
> error:
> {code:java}
> "docker build" requires exactly 1 argument.
> See 'docker build --help'.
> Usage: docker build [OPTIONS] PATH | URL | - [flags]
> Build an image from a Dockerfile
> {code}
>  
> Executing the command within the spark distribution directory. Please let me 
> know what's the issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Accept Pinot into Apache Incubator

2018-03-13 Thread Felix Cheung
+1

On Sun, Mar 11, 2018 at 5:34 AM Willem Jiang  wrote:

> +1 (binding)
>
>
> Willem Jiang
>
> Blog: http://willemjiang.blogspot.com (English)
>   http://jnn.iteye.com  (Chinese)
> Twitter: willemjiang
> Weibo: 姜宁willem
>
> On Sun, Mar 11, 2018 at 7:51 PM, Pierre Smits 
> wrote:
>
> > +1
> >
> >
> >
> > Best regards,
> >
> > Pierre Smits
> >
> > V.P. Apache Trafodion
> >
> > On Sun, Mar 11, 2018 at 12:59 AM, Julian Hyde  wrote:
> >
> > > +1 (binding)
> > >
> > > Ironic that Druid — a similar project — has just entered incubation
> too.
> > > But of course that is not a conflict. Both are great projects. Good
> luck!
> > >
> > > Julian
> > >
> > >
> > > > On Mar 9, 2018, at 7:37 PM, Carl Steinbach  wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > On Fri, Mar 9, 2018, 7:29 PM kishore g  wrote:
> > > >
> > > >> Added Jim Jagielski to the mentor's list.
> > > >>
> > > >> On Fri, Mar 9, 2018 at 6:35 PM, Olivier Lamy 
> > wrote:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> On 9 March 2018 at 17:11, kishore g  wrote:
> > > >>>
> > >  Hi all,
> > > 
> > >  I would like to call a VOTE to accept Pinot into the Apache
> > Incubator.
> > > >>> The
> > >  full proposal is available on the wiki
> > >  
> > > 
> > >  Please cast your vote:
> > > 
> > >   [ ] +1, bring Pinot into Incubator
> > >   [ ] +0, I don't care either way,
> > >   [ ] -1, do not bring Pinot into Incubator, because...
> > > 
> > >  The vote will open at least for 72 hours and only votes from the
> > > >>> Incubator
> > >  PMC are binding.
> > > 
> > >  Thanks,
> > >  Kishore G
> > > 
> > >  Discussion thread:
> > > 
> https://lists.apache.org/thread.html/8119f9478ea1811371f1bf6685290b
> > >  22b57b1a3e0849d1d778d77dcb@%3Cgeneral.incubator.apache.org
> > > 
> > > 
> > >  = Pinot Proposal =
> > > 
> > >  == Abstract ==
> > > 
> > >  Pinot is a distributed columnar storage engine that can ingest
> data
> > in
> > >  real-time and serve analytical queries at low latency. There are
> two
> > > >>> modes
> > >  of data ingestion - batch and/or realtime. Batch mode allows users
> > to
> > >  generate pinot segments externally using systems such as Hadoop.
> > These
> > >  segments can be uploaded into Pinot via simple curl calls. Pinot
> can
> > > >>> ingest
> > >  data in near real-time from streaming sources such as Kafka. Data
> > > >>> ingested
> > >  into Pinot is stored in a columnar format. Pinot provides a SQL
> like
> > >  interface (PQL) that supports filters, aggregations, and group by
> > >  operations. It does not support joins by design, in order to
> > guarantee
> > >  predictable latency. It leverages other Apache projects such as
> > > >>> Zookeeper,
> > >  Kafka, and Helix, along with many libraries from the ASF.
> > > 
> > >  == Proposal ==
> > > 
> > >  Pinot was open sourced by LinkedIn and hosted on GitHub. Majority
> of
> > > >> the
> > >  development happens at LinkedIn with other contributions from Uber
> > and
> > >  Slack. We believe that being a part of Apache Software Foundation
> > will
> > >  improve the diversity and help form a strong community around the
> > > >>> project.
> > > 
> > >  LinkedIn submits this proposal to donate the code base to Apache
> > > >> Software
> > >  Foundation. The code is already under Apache License 2.0.  Code
> and
> > > the
> > >  documentation are hosted on Github.
> > >  * Code: http://github.com/linkedin/pinot
> > >  * Documentation: https://github.com/linkedin/pinot/wiki
> > > 
> > > 
> > >  == Background ==
> > > 
> > >  LinkedIn, similar to other companies, has many applications that
> > > >> provide
> > >  rich real-time insights to members and customers (internal and
> > > >> external).
> > >  The workload characteristics for these applications vary a lot.
> Some
> > >  internal applications simply need ad-hoc query capabilities with
> > > >>> sub-second
> > >  to multiple seconds latency. But external site facing applications
> > > >>> require
> > >  strong SLA even very high workloads. Prior to Pinot, LinkedIn had
> > > >>> multiple
> > >  solutions depending on the workload generated by the application
> and
> > > >> this
> > >  was inefficient. Pinot was developed to be the one single platform
> > > that
> > >  addresses all classes of applications. Today at LinkedIn, Pinot
> > powers
> > > >>> more
> > >  than 50 site facing products with workload ranging from few
> queries
> > > per
> > >  second to 1000’s of queries per second while maintaining the 99th
> > >  percentile latency which can be as low as few 

[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396607#comment-16396607
 ] 

Felix Cheung commented on SPARK-23650:
--

which system/platform are you running on?

> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2018-03-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396606#comment-16396606
 ] 

Felix Cheung commented on SPARK-23632:
--

well, if download of packages is taking that long, then there isn't much on the 
R side we can do. Perhaps spark package code can be changed to perform the 
download asynchronously but then we would need to check if the JVM is ready.

We could also increase the timeout but there are times if JVM is unresponsive, 
a short timeout can be useful to provide a quick termination.

> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [DISCUSS] Apache Pinot Incubator Proposal

2018-03-09 Thread Felix Cheung
Hi Kishore - do you need one more mentor?


On Tue, Feb 13, 2018 at 12:10 AM kishore g  wrote:

> Hello,
>
> I would like to propose Pinot as an Apache Incubator project. The proposal
> is available as a draft at https://wiki.apache.org/incubator/PinotProposal.
> I
> have also included the text of the proposal below.
>
> Any feedback from the community is much appreciated.
>
> Regards,
> Kishore G
>
> = Pinot Proposal =
>
> == Abstract ==
>
> Pinot is a distributed columnar storage engine that can ingest data in
> real-time and serve analytical queries at low latency. There are two modes
> of data ingestion - batch and/or realtime. Batch mode allows users to
> generate pinot segments externally using systems such as Hadoop. These
> segments can be uploaded into Pinot via simple curl calls. Pinot can ingest
> data in near real-time from streaming sources such as Kafka. Data ingested
> into Pinot is stored in a columnar format. Pinot provides a SQL like
> interface (PQL) that supports filters, aggregations, and group by
> operations. It does not support joins by design, in order to guarantee
> predictable latency. It leverages other Apache projects such as Zookeeper,
> Kafka, and Helix, along with many libraries from the ASF.
>
> == Proposal ==
>
> Pinot was open sourced by LinkedIn and hosted on GitHub. Majority of the
> development happens at LinkedIn with other contributions from Uber and
> Slack. We believe that being a part of Apache Software Foundation will
> improve the diversity and help form a strong community around the project.
>
> LinkedIn submits this proposal to donate the code base to Apache Software
> Foundation. The code is already under Apache License 2.0.  Code and the
> documentation are hosted on Github.
>  * Code: http://github.com/linkedin/pinot
>  * Documentation: https://github.com/linkedin/pinot/wiki
>
>
> == Background ==
>
> LinkedIn, similar to other companies, has many applications that provide
> rich real-time insights to members and customers (internal and external).
> The workload characteristics for these applications vary a lot. Some
> internal applications simply need ad-hoc query capabilities with sub-second
> to multiple seconds latency. But external site facing applications require
> strong SLA even very high workloads. Prior to Pinot, LinkedIn had multiple
> solutions depending on the workload generated by the application and this
> was inefficient. Pinot was developed to be the one single platform that
> addresses all classes of applications. Today at LinkedIn, Pinot powers more
> than 50 site facing products with workload ranging from few queries per
> second to 1000’s of queries per second while maintaining the 99th
> percentile latency which can be as low as few milliseconds. All internal
> dashboards at LinkedIn are powered by Pinot.
>
> == Rationale ==
>
> We believe that requirement to develop rich real-time analytic applications
> is applicable to other organizations. Both Pinot and the interested
> communities would benefit from this work being openly available.
>
> == Current Status ==
>
> Pinot is currently open sourced under the Apache License Version 2.0 and
> available at github.com/linkedin/pinot. All the development is done using
> GitHub Pull Requests. We cut releases on a weekly basis and deploy it at
> LinkedIn. mp-0.1.468 is the latest release tag that is deployed in
> production.
>
> == Meritocracy ==
>
> Following the Apache meritocracy model, we intend to build an open and
> diverse community around Pinot. We will encourage the community to
> contribute to discussion and codebase.
>
> == Community ==
>
> Pinot is currently used extensively at LinkedIn and Uber. Several companies
> have expressed interest in the project. We hope to extend the contributor
> base significantly by bringing Pinot into Apache.
>
> == Core Developers ==
>
> Pinot was started by engineers at LinkedIn, and now has committers from
> Uber.
>
> == Alignment ==
>
> Apache is the most natural home for taking Pinot forward. Pinot leverages
> several existing Apache Projects such as Kafka, Helix, Zookeeper, and Avro.
> As Pinot gains adoption, we plan to add support for the ORC and Parquet
> formats, as well as adding integration with Yarn and Mesos.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The risk of the Pinot project being abandoned is minimal. The teams at
> LinkedIn and Uber are highly incentivized to continue development of Pinot
> as it is a critical part of their infrastructure.
>
> === Inexperience with Open Source ===
>
> Post open sourcing, Pinot was completely developed on GitHub. All the
> current developers on Pinot are well aware of the open source development
> process. However, most of the developers are new to the Apache process.
> Kishore Gopalakrishna, one of the lead developers in Pinot, is VP and
> committer of the Apache Helix project.
>
> === Homogenous Developers ===
>
> The current core developers are all from 

[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2018-03-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392507#comment-16392507
 ] 

Felix Cheung commented on SPARK-23632:
--

To clarify, are you running into problem because the package download is taking 
longer than the fixed 10 sec?

> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-03-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23291:
-
Affects Version/s: 2.1.2
   2.2.0
   2.3.0

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-03-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23291.
--
  Resolution: Fixed
Assignee: Liang-Chi Hsieh
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



IPMC join request

2018-03-06 Thread Felix Cheung
Hi all,

I'd like to join IPMC, initially to help mentor Dr Elephant as incubator
project but also looking forward to help mentor other Apache incubator
projects.

I am PPMC/PMC of Apache Zeppelin (since incubation to TLP) and PMC of
Apache Spark, Release Manager for releases.

Thanks!
Felix


[jira] [Assigned] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1

2018-03-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-22430:


Assignee: Rekha Joshi

> Unknown tag warnings when building R docs with Roxygen 6.0.1
> 
>
> Key: SPARK-22430
> URL: https://issues.apache.org/jira/browse/SPARK-22430
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
> Environment: Roxygen 6.0.1
>Reporter: Joel Croteau
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.4.0
>
>
> When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of 
> unknown tag warnings are generated:
> {noformat}
> Warning: @export [schema.R#33]: unknown tag
> Warning: @export [schema.R#53]: unknown tag
> Warning: @export [schema.R#63]: unknown tag
> Warning: @export [schema.R#80]: unknown tag
> Warning: @export [schema.R#123]: unknown tag
> Warning: @export [schema.R#141]: unknown tag
> Warning: @export [schema.R#216]: unknown tag
> Warning: @export [generics.R#388]: unknown tag
> Warning: @export [generics.R#403]: unknown tag
> Warning: @export [generics.R#407]: unknown tag
> Warning: @export [generics.R#414]: unknown tag
> Warning: @export [generics.R#418]: unknown tag
> Warning: @export [generics.R#422]: unknown tag
> Warning: @export [generics.R#428]: unknown tag
> Warning: @export [generics.R#432]: unknown tag
> Warning: @export [generics.R#438]: unknown tag
> Warning: @export [generics.R#442]: unknown tag
> Warning: @export [generics.R#446]: unknown tag
> Warning: @export [generics.R#450]: unknown tag
> Warning: @export [generics.R#454]: unknown tag
> Warning: @export [generics.R#459]: unknown tag
> Warning: @export [generics.R#467]: unknown tag
> Warning: @export [generics.R#475]: unknown tag
> Warning: @export [generics.R#479]: unknown tag
> Warning: @export [generics.R#483]: unknown tag
> Warning: @export [generics.R#487]: unknown tag
> Warning: @export [generics.R#498]: unknown tag
> Warning: @export [generics.R#502]: unknown tag
> Warning: @export [generics.R#506]: unknown tag
> Warning: @export [generics.R#512]: unknown tag
> Warning: @export [generics.R#518]: unknown tag
> Warning: @export [generics.R#526]: unknown tag
> Warning: @export [generics.R#530]: unknown tag
> Warning: @export [generics.R#534]: unknown tag
> Warning: @export [generics.R#538]: unknown tag
> Warning: @export [generics.R#542]: unknown tag
> Warning: @export [generics.R#549]: unknown tag
> Warning: @export [generics.R#556]: unknown tag
> Warning: @export [generics.R#560]: unknown tag
> Warning: @export [generics.R#567]: unknown tag
> Warning: @export [generics.R#571]: unknown tag
> Warning: @export [generics.R#575]: unknown tag
> Warning: @export [generics.R#579]: unknown tag
> Warning: @export [generics.R#583]: unknown tag
> Warning: @export [generics.R#587]: unknown tag
> Warning: @export [generics.R#591]: unknown tag
> Warning: @export [generics.R#595]: unknown tag
> Warning: @export [generics.R#599]: unknown tag
> Warning: @export [generics.R#603]: unknown tag
> Warning: @export [generics.R#607]: unknown tag
> Warning: @export [generics.R#611]: unknown tag
> Warning: @export [generics.R#615]: unknown tag
> Warning: @export [generics.R#619]: unknown tag
> Warning: @export [generics.R#623]: unknown tag
> Warning: @export [generics.R#627]: unknown tag
> Warning: @export [generics.R#631]: unknown tag
> Warning: @export [generics.R#635]: unknown tag
> Warning: @export [generics.R#639]: unknown tag
> Warning: @export [generics.R#643]: unknown tag
> Warning: @export [generics.R#647]: unknown tag
> Warning: @export [generics.R#654]: unknown tag
> Warning: @export [generics.R#658]: unknown tag
> Warning: @export [generics.R#663]: unknown tag
> Warning: @export [generics.R#667]: unknown tag
> Warning: @export [generics.R#672]: unknown tag
> Warning: @export [generics.R#676]: unknown tag
> Warning: @export [generics.R#680]: unknown tag
> Warning: @export [generics.R#684]: unknown tag
> Warning: @export [generics.R#690]: unknown tag
> Warning: @export [generics.R#696]: unknown tag
> Warning: @export [generics.R#702]: unknown tag
> Warning: @export [generics.R#706]: unknown tag
> Warning: @export [generics.R#710]: unknown tag
> Warning: @export [generics.R#716]: unknown tag
> Warning: @export [generics.R#720]: unknown tag
> Warning: @export [generics.R#726]: unknown tag
> Warning: @export [generics.R#730]: unknown tag
> Warning: @export [generics.R#734]: unknown tag
&g

[jira] [Resolved] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1

2018-03-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-22430.
--
  Resolution: Fixed
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0

> Unknown tag warnings when building R docs with Roxygen 6.0.1
> 
>
> Key: SPARK-22430
> URL: https://issues.apache.org/jira/browse/SPARK-22430
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
> Environment: Roxygen 6.0.1
>Reporter: Joel Croteau
>Priority: Trivial
> Fix For: 2.4.0
>
>
> When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of 
> unknown tag warnings are generated:
> {noformat}
> Warning: @export [schema.R#33]: unknown tag
> Warning: @export [schema.R#53]: unknown tag
> Warning: @export [schema.R#63]: unknown tag
> Warning: @export [schema.R#80]: unknown tag
> Warning: @export [schema.R#123]: unknown tag
> Warning: @export [schema.R#141]: unknown tag
> Warning: @export [schema.R#216]: unknown tag
> Warning: @export [generics.R#388]: unknown tag
> Warning: @export [generics.R#403]: unknown tag
> Warning: @export [generics.R#407]: unknown tag
> Warning: @export [generics.R#414]: unknown tag
> Warning: @export [generics.R#418]: unknown tag
> Warning: @export [generics.R#422]: unknown tag
> Warning: @export [generics.R#428]: unknown tag
> Warning: @export [generics.R#432]: unknown tag
> Warning: @export [generics.R#438]: unknown tag
> Warning: @export [generics.R#442]: unknown tag
> Warning: @export [generics.R#446]: unknown tag
> Warning: @export [generics.R#450]: unknown tag
> Warning: @export [generics.R#454]: unknown tag
> Warning: @export [generics.R#459]: unknown tag
> Warning: @export [generics.R#467]: unknown tag
> Warning: @export [generics.R#475]: unknown tag
> Warning: @export [generics.R#479]: unknown tag
> Warning: @export [generics.R#483]: unknown tag
> Warning: @export [generics.R#487]: unknown tag
> Warning: @export [generics.R#498]: unknown tag
> Warning: @export [generics.R#502]: unknown tag
> Warning: @export [generics.R#506]: unknown tag
> Warning: @export [generics.R#512]: unknown tag
> Warning: @export [generics.R#518]: unknown tag
> Warning: @export [generics.R#526]: unknown tag
> Warning: @export [generics.R#530]: unknown tag
> Warning: @export [generics.R#534]: unknown tag
> Warning: @export [generics.R#538]: unknown tag
> Warning: @export [generics.R#542]: unknown tag
> Warning: @export [generics.R#549]: unknown tag
> Warning: @export [generics.R#556]: unknown tag
> Warning: @export [generics.R#560]: unknown tag
> Warning: @export [generics.R#567]: unknown tag
> Warning: @export [generics.R#571]: unknown tag
> Warning: @export [generics.R#575]: unknown tag
> Warning: @export [generics.R#579]: unknown tag
> Warning: @export [generics.R#583]: unknown tag
> Warning: @export [generics.R#587]: unknown tag
> Warning: @export [generics.R#591]: unknown tag
> Warning: @export [generics.R#595]: unknown tag
> Warning: @export [generics.R#599]: unknown tag
> Warning: @export [generics.R#603]: unknown tag
> Warning: @export [generics.R#607]: unknown tag
> Warning: @export [generics.R#611]: unknown tag
> Warning: @export [generics.R#615]: unknown tag
> Warning: @export [generics.R#619]: unknown tag
> Warning: @export [generics.R#623]: unknown tag
> Warning: @export [generics.R#627]: unknown tag
> Warning: @export [generics.R#631]: unknown tag
> Warning: @export [generics.R#635]: unknown tag
> Warning: @export [generics.R#639]: unknown tag
> Warning: @export [generics.R#643]: unknown tag
> Warning: @export [generics.R#647]: unknown tag
> Warning: @export [generics.R#654]: unknown tag
> Warning: @export [generics.R#658]: unknown tag
> Warning: @export [generics.R#663]: unknown tag
> Warning: @export [generics.R#667]: unknown tag
> Warning: @export [generics.R#672]: unknown tag
> Warning: @export [generics.R#676]: unknown tag
> Warning: @export [generics.R#680]: unknown tag
> Warning: @export [generics.R#684]: unknown tag
> Warning: @export [generics.R#690]: unknown tag
> Warning: @export [generics.R#696]: unknown tag
> Warning: @export [generics.R#702]: unknown tag
> Warning: @export [generics.R#706]: unknown tag
> Warning: @export [generics.R#710]: unknown tag
> Warning: @export [generics.R#716]: unknown tag
> Warning: @export [generics.R#720]: unknown tag
> Warning: @export [generics.R#726]: unknown tag
> Warning: @export [generics.R#730]: unknown tag
> Warning: @export [generics.R#734]: unk

Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
For pyspark specifically IMO should be very high on the list to port back...

As for roadmap - should be sharing more soon.


From: lucas.g...@gmail.com <lucas.g...@gmail.com>
Sent: Friday, March 2, 2018 9:41:46 PM
To: user@spark.apache.org
Cc: Felix Cheung
Subject: Re: Question on Spark-kubernetes integration

Oh interesting, given that pyspark was working in spark on kub 2.2 I assumed it 
would be part of what got merged.

Is there a roadmap in terms of when that may get merged up?

Thanks!



On 2 March 2018 at 21:32, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
That’s in the plan. We should be sharing a bit more about the roadmap in future 
releases shortly.

In the mean time this is in the official documentation on what is coming:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work

This supports started as a fork of the Apache Spark project and this fork has 
dynamic scaling support you can check out here:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dynamic-executor-scaling



From: Lalwani, Jayesh 
<jayesh.lalw...@capitalone.com<mailto:jayesh.lalw...@capitalone.com>>
Sent: Friday, March 2, 2018 8:08:55 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Question on Spark-kubernetes integration

Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
That's in the plan. We should be sharing a bit more about the roadmap in future 
releases shortly.

In the mean time this is in the official documentation on what is coming:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work

This supports started as a fork of the Apache Spark project and this fork has 
dynamic scaling support you can check out here:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dynamic-executor-scaling



From: Lalwani, Jayesh 
Sent: Friday, March 2, 2018 8:08:55 AM
To: user@spark.apache.org
Subject: Question on Spark-kubernetes integration

Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Welcoming some new committers

2018-03-02 Thread Felix Cheung
Congrats and welcome!


From: Dongjoon Hyun 
Sent: Friday, March 2, 2018 4:27:10 PM
To: Spark dev list
Subject: Re: Welcoming some new committers

Congrats to all!

Bests,
Dongjoon.

On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan 
> wrote:
Congratulations to everyone and welcome!

On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger 
> wrote:
Congrats to the new committers, and I appreciate the vote of confidence.

On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia 
> wrote:
> Hi everyone,
>
> The Spark PMC has recently voted to add several new committers to the 
> project, based on their contributions to Spark 2.3 and other past work:
>
> - Anirudh Ramanathan (contributor to Kubernetes support)
> - Bryan Cutler (contributor to PySpark and Arrow support)
> - Cody Koeninger (contributor to streaming and Kafka support)
> - Erik Erlandson (contributor to Kubernetes support)
> - Matt Cheah (contributor to Kubernetes support and other parts of Spark)
> - Seth Hendrickson (contributor to MLlib and PySpark)
>
> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as 
> committers!
>
> Matei
> -
> To unsubscribe e-mail: 
> dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org





Re: Using bundler for Jekyll?

2018-03-01 Thread Felix Cheung
Also part of the problem is that the latest news panel is static on each page, 
so any new link added changes hundreds of files?


From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Thursday, March 1, 2018 6:36:43 PM
To: dev
Subject: Using bundler for Jekyll?

One of the things which comes up when folks update the Spark website is that we 
often get lots of unnecessarily changed files. I _think_ some of this might 
come from different jekyll versions on different machines, would folks be OK if 
we added a requirements that folks use bundler so we can have more consistent 
versions?

--
Twitter: https://twitter.com/holdenkarau


Re: Help needed in R documentation generation

2018-02-27 Thread Felix Cheung
I had agreed it was a compromise when it was proposed back in May 2017.

I don’t think I can capture the long reviews and many discussed that went in, 
for further discussion please start from JIRA SPARK-20889.




From: Marcelo Vanzin <van...@cloudera.com>
Sent: Tuesday, February 27, 2018 10:26:23 AM
To: Felix Cheung
Cc: Mihály Tóth; Mihály Tóth; dev@spark.apache.org
Subject: Re: Help needed in R documentation generation

I followed Misi's instructions:
- click on 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html
- click on "s" at the top
- find "sin" and click on it

And that does not give me the documentation for the "sin" function.
That leads to you to a really ugly list of functions that's basically
unreadable. There's lots of things like this:

## S4 method for signature 'Column'
abs(x)

Which look to me like the docs weren't properly generated. So it
doesn't look like it's a discoverability problem, it seems there's
something odd going on with the new docs.

On the previous version those same steps take me to a nicely formatted
doc for the "sin" function.



On Tue, Feb 27, 2018 at 10:14 AM, Felix Cheung
<felixcheun...@hotmail.com> wrote:
> I think what you are calling out is discoverability of names from index - I
> agree this should be improved.
>
> There are several reasons for this change, if I recall, some are:
>
> - we have too many doc pages and a very long index page because of the
> atypical large number of functions - many R packages only have dozens (or a
> dozen) and we have hundreds; this also affects discoverability
>
> - a side effect of high number of functions is that we have hundreds of
> pages of cross links between functions in the same and different categories
> that are very hard to read or find
>
> - many function examples are too simple or incomplete - it would be good to
> make them runnable, for instance
>
> There was a proposal for a search feature on the doc index at one point, IMO
> that would be very useful and would address the discoverability issue.
>
>
> ________
> From: Mihály Tóth <misut...@gmail.com>
> Sent: Tuesday, February 27, 2018 9:13:18 AM
> To: Felix Cheung
> Cc: Mihály Tóth; dev@spark.apache.org
>
> Subject: Re: Help needed in R documentation generation
>
> Hi,
>
> Earlier, at https://spark.apache.org/docs/latest/api/R/index.html I see
>
> sin as a title
> description describes what sin does
> usage, arguments, note, see also are specific to sin function
>
> When opening sin from
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html:
>
> Title is 'Math functions for Column operations', not very specific to sin
> Description is 'Math functions defined for Column.'
> Usage contains a list of functions, scrolling down you can see sin as well
> though ...
>
> To me that sounds like a problem. Do I overlook something here?
>
> Best Regards,
>   Misi
>
>
> 2018-02-27 16:15 GMT+00:00 Felix Cheung <felixcheun...@hotmail.com>:
>>
>> The help content on sin is in
>>
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/column_math_functions.html
>>
>> It’s a fairly long list but sin is in there. Is that not what you are
>> seeing?
>>
>>
>> 
>> From: Mihály Tóth <mt...@cloudera.com>
>> Sent: Tuesday, February 27, 2018 8:03:34 AM
>> To: dev@spark.apache.org
>> Subject: Fwd: Help needed in R documentation generation
>>
>> Hi,
>>
>> Actually, when I open the link you provided and click on - for example -
>> 'sin' the page does not seem to describe that function at all. Actually I
>> get same effect that I get locally. I have attached a screenshot about that:
>>
>>
>>
>>
>>
>> I tried with Chrome and then with Safari too and got the same result.
>>
>> When I go to https://spark.apache.org/docs/latest/api/R/index.html (Spark
>> 2.2.1) and select 'sin' I get a proper Description, Usage, Arguments, etc.
>> sections.
>>
>> This sounds like a bug in the documentation of Spark R, does'nt it? Shall
>> I file a Jira about it?
>>
>> Locally I ran SPARK_HOME/R/create-docs.sh and it returned successfully.
>> Unfortunately with the result mentioned above.
>>
>> Best Regards,
>>
>>   Misi
>>
>>
>>>
>>> 
>>>
>>> From: Felix Cheung <felixcheun...@hotmail.com>
>>> Date: 2018-02-26 20:42 GMT+00:00
>>> Subject: Re: Help needed in R documentation generation
>>>

Re: Help needed in R documentation generation

2018-02-27 Thread Felix Cheung
I think what you are calling out is discoverability of names from index - I 
agree this should be improved.

There are several reasons for this change, if I recall, some are:

- we have too many doc pages and a very long index page because of the atypical 
large number of functions - many R packages only have dozens (or a dozen) and 
we have hundreds; this also affects discoverability

- a side effect of high number of functions is that we have hundreds of pages 
of cross links between functions in the same and different categories that are 
very hard to read or find

- many function examples are too simple or incomplete - it would be good to 
make them runnable, for instance

There was a proposal for a search feature on the doc index at one point, IMO 
that would be very useful and would address the discoverability issue.



From: Mihály Tóth <misut...@gmail.com>
Sent: Tuesday, February 27, 2018 9:13:18 AM
To: Felix Cheung
Cc: Mihály Tóth; dev@spark.apache.org
Subject: Re: Help needed in R documentation generation

Hi,

Earlier, at https://spark.apache.org/docs/latest/api/R/index.html I see

  1.  sin as a title
  2.  description describes what sin does
  3.  usage, arguments, note, see also are specific to sin function

When opening sin from 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html:

  1.  Title is 'Math functions for Column operations', not very specific to sin
  2.  Description is 'Math functions defined for Column.'
  3.  Usage contains a list of functions, scrolling down you can see sin as 
well though ...

To me that sounds like a problem. Do I overlook something here?

Best Regards,
  Misi


2018-02-27 16:15 GMT+00:00 Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>:
The help content on sin is in
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/column_math_functions.html

It’s a fairly long list but sin is in there. Is that not what you are seeing?



From: Mihály Tóth <mt...@cloudera.com<mailto:mt...@cloudera.com>>
Sent: Tuesday, February 27, 2018 8:03:34 AM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Fwd: Help needed in R documentation generation

Hi,

Actually, when I open the link you provided and click on - for example - 'sin' 
the page does not seem to describe that function at all. Actually I get same 
effect that I get locally. I have attached a screenshot about that:


[Szövegközi kép 1]


I tried with Chrome and then with Safari too and got the same result.

When I go to https://spark.apache.org/docs/latest/api/R/index.html (Spark 
2.2.1) and select 'sin' I get a proper Description, Usage, Arguments, etc. 
sections.

This sounds like a bug in the documentation of Spark R, does'nt it? Shall I 
file a Jira about it?

Locally I ran SPARK_HOME/R/create-docs.sh and it returned successfully. 
Unfortunately with the result mentioned above.

Best Regards,

  Misi





From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Date: 2018-02-26 20:42 GMT+00:00
Subject: Re: Help needed in R documentation generation
To: Mihály Tóth <misut...@gmail.com<mailto:misut...@gmail.com>>
Cc: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>


Could you tell me more about the steps you are taking? Which page you are 
clicking on?

Could you try 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html


From: Mihály Tóth <misut...@gmail.com<mailto:misut...@gmail.com>>
Sent: Monday, February 26, 2018 8:06:59 AM
To: Felix Cheung
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Help needed in R documentation generation

I see.

When I click on such a selected function, like 'sin' the page falls apart and 
does not tell anything about sin function. How is it supposed to work when all 
functions link to the same column_math_functions.html ?

Thanks,

  Misi


On Sun, Feb 25, 2018, 22:53 Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?


From: Mihály Tóth <misut...@gmail.com<mailto:misut...@gmail.com>>
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

http://column_math_functions.ht>ml">asin

Have you met with such a problem?

Thanks,

  Misi








Re: Help needed in R documentation generation

2018-02-27 Thread Felix Cheung
The help content on sin is in
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/column_math_functions.html

It’s a fairly long list but sin is in there. Is that not what you are seeing?



From: Mihály Tóth <mt...@cloudera.com>
Sent: Tuesday, February 27, 2018 8:03:34 AM
To: dev@spark.apache.org
Subject: Fwd: Help needed in R documentation generation

Hi,

Actually, when I open the link you provided and click on - for example - 'sin' 
the page does not seem to describe that function at all. Actually I get same 
effect that I get locally. I have attached a screenshot about that:


[Szövegközi kép 1]


I tried with Chrome and then with Safari too and got the same result.

When I go to https://spark.apache.org/docs/latest/api/R/index.html (Spark 
2.2.1) and select 'sin' I get a proper Description, Usage, Arguments, etc. 
sections.

This sounds like a bug in the documentation of Spark R, does'nt it? Shall I 
file a Jira about it?

Locally I ran SPARK_HOME/R/create-docs.sh and it returned successfully. 
Unfortunately with the result mentioned above.

Best Regards,

  Misi





From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Date: 2018-02-26 20:42 GMT+00:00
Subject: Re: Help needed in R documentation generation
To: Mihály Tóth <misut...@gmail.com<mailto:misut...@gmail.com>>
Cc: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>


Could you tell me more about the steps you are taking? Which page you are 
clicking on?

Could you try 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html


From: Mihály Tóth <misut...@gmail.com<mailto:misut...@gmail.com>>
Sent: Monday, February 26, 2018 8:06:59 AM
To: Felix Cheung
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Help needed in R documentation generation

I see.

When I click on such a selected function, like 'sin' the page falls apart and 
does not tell anything about sin function. How is it supposed to work when all 
functions link to the same column_math_functions.html ?

Thanks,

  Misi


On Sun, Feb 25, 2018, 22:53 Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?


From: Mihály Tóth <misut...@gmail.com<mailto:misut...@gmail.com>>
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

http://column_math_functions.ht>ml">asin

Have you met with such a problem?

Thanks,

  Misi







Re: Spark on K8s - using files fetched by init-container?

2018-02-27 Thread Felix Cheung
Yes you were pointing to HDFS on a loopback address...


From: Jenna Hoole 
Sent: Monday, February 26, 2018 1:11:35 PM
To: Yinan Li; user@spark.apache.org
Subject: Re: Spark on K8s - using files fetched by init-container?

Oh, duh. I completely forgot that file:// is a prefix I can use. Up and running 
now :)

Thank you so much!
Jenna

On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li 
> wrote:
OK, it looks like you will need to use 
`file:///var/spark-data/spark-files/flights.csv` instead. The 'file://' scheme 
must be explicitly used as it seems it defaults to 'hdfs' in your setup.

On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole 
> wrote:
Thank you for the quick response! However, I'm still having problems.

When I try to look for /var/spark-data/spark-files/flights.csv I get told:

Error: Error in loadDF : analysis error - Path does not exist: 
hdfs://192.168.0.1:8020/var/spark-data/spark-files/flights.csv;

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

And when I try to look for local:///var/spark-data/spark-files/flights.csv, I 
get:

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 'local:///var/spark-data/spark-files/flights.csv': No such 
file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

I can see from a kubectl describe that the directory is getting mounted.

Mounts:

  /etc/hadoop/conf from hadoop-properties (rw)

  
/var/run/secrets/kubernetes.io/serviceaccount
 from spark-token-pxz79 (ro)

  /var/spark-data/spark-files from download-files (rw)

  /var/spark-data/spark-jars from download-jars-volume (rw)

  /var/spark/tmp from spark-local-dir-0-tmp (rw)

Is there something else I need to be doing in my set up?

Thanks,
Jenna

On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li 
> wrote:
The files specified through --files are localized by the init-container to 
/var/spark-data/spark-files by default. So in your case, the file should be 
located at /var/spark-data/spark-files/flights.csv locally in the container.

On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole 
> wrote:
This is probably stupid user error, but I can't for the life of me figure out 
how to access the files that are staged by the init-container.

I'm trying to run the SparkR example data-manipulation.R which requires the 
path to its datafile. I supply the hdfs location via --files and then the full 
hdfs path.


--files 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 local:///opt/spark/examples/src/main/r/data-manipulation.R 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv

The init-container seems to load my file.

18/02/26 18:29:09 INFO spark.SparkContext: Added file 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 at 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 with timestamp 1519669749519

18/02/26 18:29:09 INFO util.Utils: Fetching 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 to 
/var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp

However, I get an error that my file does not exist.

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 
'hdfs://192.168.0.1:8020/user/jhoole/flights.csv':
 No such file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

If I try supplying just flights.csv, I get a different error

--files 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv


Error: Error in loadDF : analysis error - Path does not exist: 

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-27 Thread Felix Cheung
+1

Tested R:

install from package, CRAN tests, manual tests, help check, vignettes check

Filed this https://issues.apache.org/jira/browse/SPARK-23461
This is not a regression so not a blocker of the release.

Tested this on win-builder and r-hub. On r-hub on multiple platforms everything 
passed. For win-builder tests failed on x86 but passed x64 - perhaps due to an 
intermittent download issue causing a gzip error, re-testing now but won’t hold 
the release on this.


From: Nan Zhu 
Sent: Monday, February 26, 2018 4:03:22 PM
To: Michael Armbrust
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC5)

+1  (non-binding), tested with internal workloads and benchmarks

On Mon, Feb 26, 2018 at 12:09 PM, Michael Armbrust 
> wrote:
+1 all our pipelines have been running the RC for several days now.

On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun 
> wrote:
+1 (non-binding).

Bests,
Dongjoon.



On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
> wrote:
+1 (non-binding)

On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li 
> wrote:
+1 (binding) in Spark SQL, Core and PySpark.

Xiao

2018-02-24 14:49 GMT-08:00 Ricardo Almeida 
>:
+1 (non-binding)

same as previous RC

On 24 February 2018 at 11:10, Hyukjin Kwon 
> wrote:
+1

2018-02-24 16:57 GMT+09:00 Bryan Cutler 
>:
+1
Tests passed and additionally ran Arrow related tests and did some perf checks 
with python 2.7.14

On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau 
> wrote:
Note: given the state of Jenkins I'd love to see Bryan Cutler or someone with 
Arrow experience sign off on this release.

On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian 
> wrote:

+1 (binding)

Passed all the tests, looks good.

Cheng

On 2/23/18 15:00, Holden Karau wrote:
+1 (binding)
PySpark artifacts install in a fresh Py3 virtual env

On Feb 23, 2018 7:55 AM, "Denny Lee" 
> wrote:
+1 (non-binding)

On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough 
> wrote:
New to testing out Spark RCs for the community but I was able to run some of 
the basic unit tests without error so for what it's worth, I'm a +1.

On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.3.0. 
The vote is open until Tuesday February 27, 2018 at 8:00:00 am UTC and passes 
if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc5: 
https://github.com/apache/spark/tree/v2.3.0-rc5 
(992447fb30ee9ebb3cf794f2d06f4d63a2d792db)

List of JIRA tickets resolved in this release can be found here: 
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1266/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/index.html


FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are 
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. 

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16377913#comment-16377913
 ] 

Felix Cheung commented on SPARK-23206:
--

[~elu] Hi Edwina, we are interesting in this as well. We have requirements on 
shuffle that we are currently looking into and a different approach on metric 
collections that we could discuss. Let me know if there is any 
sync/call/discussion being planned?

 

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Help needed in R documentation generation

2018-02-26 Thread Felix Cheung
Could you tell me more about the steps you are taking? Which page you are 
clicking on?

Could you try 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/api/R/index.html


From: Mihály Tóth <misut...@gmail.com>
Sent: Monday, February 26, 2018 8:06:59 AM
To: Felix Cheung
Cc: dev@spark.apache.org
Subject: Re: Help needed in R documentation generation

I see.

When I click on such a selected function, like 'sin' the page falls apart and 
does not tell anything about sin function. How is it supposed to work when all 
functions link to the same column_math_functions.html ?

Thanks,

  Misi


On Sun, Feb 25, 2018, 22:53 Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?


From: Mihály Tóth <misut...@gmail.com<mailto:misut...@gmail.com>>
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

asin

Have you met with such a problem?

Thanks,

  Misi




Re: Help needed in R documentation generation

2018-02-25 Thread Felix Cheung
This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?


From: Mihály Tóth 
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

asin

Have you met with such a problem?

Thanks,

  Misi




Re: Github pull requests

2018-02-21 Thread Felix Cheung
Re JIRA - the merge PR script in Spark closes the JIRA automatically..

_
From: Julian Hyde 
Sent: Wednesday, February 21, 2018 8:46 PM
Subject: Re: Github pull requests
To: Jonas Pfefferle 
Cc: , Patrick Stuedi 


I believe that there are tools to do git / CI / JIRA integration. Spark is one 
of the projects with the most integration. Search their lists and JIRA to find 
out how they did it.

Speaking for my own project: Calcite doesn’t have very much integration because 
we don’t have spare cycles to research and troubleshoot. A documented manual 
process suffices.

Julian


> On Feb 21, 2018, at 2:26 AM, Jonas Pfefferle  wrote:
>
> We just closed our first pull request and where wondering if there is also a 
> way to automatically close the corresponding JIRA ticket? Also is there a way 
> we can technically enforce that we have a certain amount of people who 
> approved the code? Or do we have to do this informally?
>
> Thanks,
> Jonas
>
> On Wed, 14 Feb 2018 10:53:04 -0800
> Julian Hyde  wrote:
>> The nice thing about git is that every git repo is capable of being a master 
>> / slave. (The ASF git repo is special only in that it gathers audit logs 
>> when people push to it, e.g. the IP address where the push came from. Those 
>> logs will be useful if the provenance of our IP is ever challenged.)
>> So, the merging doesn’t happen on the GitHub repo. It happens in the repo on 
>> your laptop. Before merging, you pull the latest from the apache master 
>> branch (it doesn’t matter whether this comes from the GitHub mirror or the 
>> ASF repo - it is bitwise identical, as the commit SHAs will attest), and you 
>> pull from a GitHub repo the commit(s) referenced in the GitHub PR. You 
>> append these commits to the commit chain, test, then push to the ASF master 
>> branch.
>> If you add ‘Close #NN’ to the commit comments (and you generally will), an 
>> ASF commit hook will close PR #NN at the time that the commit arrives in ASF 
>> git.
>> Julian
>>> On Feb 14, 2018, at 6:59 AM, Jonas Pfefferle  wrote:
>>> I think you are missing a 3rd option:
>>> Basically option 1) but we merge the pull request on github and push the 
>>> changes to the apache git. So no need to delete the PRs. However we have to 
>>> be careful to only commit changes to github to not get the histories out of 
>>> sync.
>>> Jonas
>>> On Wed, 14 Feb 2018 13:58:58 +0100
>>> Patrick Stuedi  wrote:
 Hi all,
 If the github repo is synced with git repo only in one direction, then
 what is the recommended way to handle new code contributions
 (including code reviews)? We see two options here:
 1) Code contributions are issued as PRs on the Crail Apache github
 (and reviewed there), then merged outside in a private repo and
 committed back to the Apache git repo (the PR may need to be deleted
 once the commit has happened), from where the Apache Crail github repo
 will again pick it up (sync).
 2) We don't use the git repo at all, only the github repo. PRs are
 reviewed and merged directly at the github level.
 Option (1) looks complicated, option (2) might not be according to the
 Apache policies (?). What is the recommended way?
 -Patrick
 On Mon, Feb 12, 2018 at 5:25 PM, Julian Hyde  
 wrote:
> No.
> Julian
>> On Feb 12, 2018, at 08:03, Jonas Pfefferle  wrote:
>> Hi @all,
>> Is the Apache Crail github repository synced both ways with the Apache 
>> Crail git? I.e. can we merge pull request in github?
>> Regards,
>> Jonas
>





Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

2018-02-20 Thread Felix Cheung
No it does not support bi directional edges as of now.

_
From: xiaobo <guxiaobo1...@qq.com>
Sent: Tuesday, February 20, 2018 4:35 AM
Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships
To: Felix Cheung <felixcheun...@hotmail.com>, <user@spark.apache.org>


So the question comes to does graphframes support bidirectional relationship 
natively with only one edge?



-- Original ------
From: Felix Cheung <felixcheun...@hotmail.com>
Date: Tue,Feb 20,2018 10:01 AM
To: xiaobo <guxiaobo1...@qq.com>, user@spark.apache.org <user@spark.apache.org>
Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

Generally that would be the approach.
But since you have effectively double the number of edges this will likely 
affect the scale your job will run.


From: xiaobo <guxiaobo1...@qq.com>
Sent: Monday, February 19, 2018 3:22:02 AM
To: user@spark.apache.org
Subject: [graphframes]how Graphframes Deal With Bidirectional Relationships

Hi,
To represent a bidirectional relationship, one solution is to insert two edges 
for the vertices pair, my question is do the algorithms of graphframes still 
work when we doing this.

Thanks





Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung
Ah sorry I realize my wordings were unclear (not enough zzz or coffee)

So to clarify,
1) when searching for a word in the Sql function doc, it does return that 
search result page correctly, however, none of the link in result opens to the 
actual doc page, so to take the search I included as an example, if you click 
on approx_percentile, for instance, it brings open the web directory instead.

2) The second is the dist location we are voting on has a .iml file, which is 
normally not included in release or release RC and it is unsigned and without 
hash (therefore seems like it should not be in the release)

Thanks!

_
From: Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
Sent: Tuesday, February 20, 2018 2:24 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: Sean Owen <sro...@gmail.com>, dev <dev@spark.apache.org>


FWIW The search result link works for me

Shivaram

On Mon, Feb 19, 2018 at 6:21 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
These are two separate things:

Does the search result links work for you?

The second is the dist location we are voting on has a .iml file.

_
From: Sean Owen <sro...@gmail.com<mailto:sro...@gmail.com>>
Sent: Tuesday, February 20, 2018 2:19 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>



Maybe I misunderstand, but I don't see any .iml file in the 4 results on that 
page? it looks reasonable.

On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Any idea with sql func docs search result returning broken links as below?

From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal

Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml








Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung
These are two separate things:

Does the search result links work for you?

The second is the dist location we are voting on has a .iml file.

_
From: Sean Owen <sro...@gmail.com>
Sent: Tuesday, February 20, 2018 2:19 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: dev <dev@spark.apache.org>


Maybe I misunderstand, but I don't see any .iml file in the 4 results on that 
page? it looks reasonable.

On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Any idea with sql func docs search result returning broken links as below?

From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal

Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml





Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung
Any idea with sql func docs search result returning broken links as below?


From: Felix Cheung <felixcheun...@hotmail.com>
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml



From: Sameer Agarwal <sameer.a...@gmail.com>
Sent: Saturday, February 17, 2018 1:43:39 PM
To: Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

I'll start with a +1 once again.

All blockers reported against RC3 have been resolved and the builds are healthy.

On 17 February 2018 at 13:41, Sameer Agarwal 
<samee...@apache.org<mailto:samee...@apache.org>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.3.0. 
The vote is open until Thursday February 22, 2018 at 8:00:00 am UTC and passes 
if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc4: 
https://github.com/apache/spark/tree/v2.3.0-rc4 
(44095cb65500739695b0324c177c19dfa1471472)

List of JIRA tickets resolved in this release can be found here: 
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1265/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/index.html


FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are 
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.1 or 2.4.0 as appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.2.0. That being said, if there is 
something which is a regression from 2.2.0 and has not been correctly targeted 
please ping me or a committer to help target the issue (you can see the open 
issues listed as impacting Spark 2.3.0 at https://s.apache.org/WmoI).



--
Sameer Agarwal
Computer Science | UC Berkeley
http://cs.berkeley.edu/~sameerag


Re: [graphframes]how Graphframes Deal With Bidirectional Relationships

2018-02-19 Thread Felix Cheung
Generally that would be the approach.
But since you have effectively double the number of edges this will likely 
affect the scale your job will run.


From: xiaobo 
Sent: Monday, February 19, 2018 3:22:02 AM
To: user@spark.apache.org
Subject: [graphframes]how Graphframes Deal With Bidirectional Relationships

Hi,
To represent a bidirectional relationship, one solution is to insert two edges 
for the vertices pair, my question is do the algorithms of graphframes still 
work when we doing this.

Thanks



[jira] [Updated] (SPARK-23461) vignettes should include model predictions for some ML models

2018-02-18 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23461:
-
Description: 
eg. 

Linear Support Vector Machine (SVM) Classifier
h4. Logistic Regression

Tree - GBT, RF, DecisionTree

(and ALS was disabled)

By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}

  was:
eg. 

Linear Support Vector Machine (SVM) Classifier
h4. Logistic Regression

Tree

(and ALS was disabled)

By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}


> vignettes should include model predictions for some ML models
> -
>
> Key: SPARK-23461
> URL: https://issues.apache.org/jira/browse/SPARK-23461
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> eg. 
> Linear Support Vector Machine (SVM) Classifier
> h4. Logistic Regression
> Tree - GBT, RF, DecisionTree
> (and ALS was disabled)
> By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-18 Thread Felix Cheung
Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml



From: Sameer Agarwal 
Sent: Saturday, February 17, 2018 1:43:39 PM
To: Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

I'll start with a +1 once again.

All blockers reported against RC3 have been resolved and the builds are healthy.

On 17 February 2018 at 13:41, Sameer Agarwal 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.3.0. 
The vote is open until Thursday February 22, 2018 at 8:00:00 am UTC and passes 
if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc4: 
https://github.com/apache/spark/tree/v2.3.0-rc4 
(44095cb65500739695b0324c177c19dfa1471472)

List of JIRA tickets resolved in this release can be found here: 
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1265/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/index.html


FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are 
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.1 or 2.4.0 as appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.2.0. That being said, if there is 
something which is a regression from 2.2.0 and has not been correctly targeted 
please ping me or a committer to help target the issue (you can see the open 
issues listed as impacting Spark 2.3.0 at https://s.apache.org/WmoI).



--
Sameer Agarwal
Computer Science | UC Berkeley
http://cs.berkeley.edu/~sameerag


Re: Does Pyspark Support Graphx?

2018-02-18 Thread Felix Cheung
Hi - I’m maintaining it. As of now there is an issue with 2.2 that breaks 
personalized page rank, and that’s largely the reason there isn’t a release for 
2.2 support.

There are attempts to address this issue - if you are interested we would love 
for your help.


From: Nicolas Paris 
Sent: Sunday, February 18, 2018 12:31:27 AM
To: Denny Lee
Cc: xiaobo; user@spark.apache.org
Subject: Re: Does Pyspark Support Graphx?

> Most likely not as most of the effort is currently on GraphFrames  - a great
> blog post on the what GraphFrames offers can be found at: https://

Is the graphframes package still active ? The github repository
indicates it's not extremelly active. Right now, there is no available
package for spark-2.2 so that one need to compile it from sources.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[jira] [Created] (SPARK-23461) vignettes should include model predictions for some ML models

2018-02-18 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-23461:


 Summary: vignettes should include model predictions for some ML 
models
 Key: SPARK-23461
 URL: https://issues.apache.org/jira/browse/SPARK-23461
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.1, 2.3.0
Reporter: Felix Cheung


eg. 

Linear Support Vector Machine (SVM) Classifier
h4. Logistic Regression

Tree

(and ALS was disabled)

By doing something like {{head(select(gmmFitted, "V1", "V2", "prediction"))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-02-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368351#comment-16368351
 ] 

Felix Cheung commented on SPARK-23435:
--

Working on this. Debugging a problem.

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23435) R tests should support latest testthat

2018-02-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-23435:


Assignee: Felix Cheung

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    3   4   5   6   7   8   9   10   11   12   >