[jira] [Created] (SPARK-23435) R tests should support latest testthat

2018-02-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-23435:


 Summary: R tests should support latest testthat
 Key: SPARK-23435
 URL: https://issues.apache.org/jira/browse/SPARK-23435
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.1, 2.4.0
Reporter: Felix Cheung


To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was released 
in Dec 2017, and its method has been changed.

In order for our tests to keep working, we need to detect that and call a 
different method.

Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2018-02-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365263#comment-16365263
 ] 

Felix Cheung edited comment on SPARK-22817 at 2/15/18 9:13 AM:
---

I should have caught this - -we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file-

scratch that. in CRAN we are calling test_package, which works fine.


was (Author: felixcheung):
I should have caught this - we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2018-02-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365263#comment-16365263
 ] 

Felix Cheung commented on SPARK-22817:
--

I should have caught this - we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: NullPointerException in paragraph when getting batched TableEnvironment

2018-02-14 Thread Felix Cheung
Does it work within the Flink Scala Shell?


From: André Schütz 
Sent: Wednesday, February 14, 2018 4:02:30 AM
To: us...@zeppelin.incubator.apache.org
Subject: NullPointerException in paragraph when getting batched TableEnvironment

Hi,

within the Flink Interpreter context, we try to get a Batch
TableEnvironment with the following code.

[code]
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.table.sources._

val batchEnvironment  = benv
val batchTableEnvironment = TableEnvironment.getTableEnvironment
(batchEnvironment) [/code]

When executing the paragraph, we get the following error.

[error]
Caused by: java.lang.ExceptionInInitializerError:
java.lang.NullPointerException
Caused by: java.lang.NullPointerException
  at org.apache.flink.table.api.scala.BatchTableEnvironment.
(BatchTableEnvironment.scala:47) at
org.apache.flink.table.api.TableEnvironment$.getTableEnvironment
(TableEnvironment.scala:1049)
[/error]

Any ideas why there is the NullPointerException?

I am grateful for any ideas.

Kind regards,
Andre

--
Andre Schütz
COO / Founder - Wegtam GmbH
an...@wegtam.com | P: +49 (0) 381-80 699 041 | M: +49 (0) 176-218 02 604
www.wegtam.com | 
www.tensei-data.com | 
www.wegtam.net


Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Felix Cheung
Yes it is issue with the newer release of testthat.

To workaround could you install an earlier version with devtools? will follow 
up for a fix.

_
From: Hyukjin Kwon 
Sent: Wednesday, February 14, 2018 6:49 PM
Subject: Re: SparkR test script issue: unable to run run-tests.h on spark 2.2
To: chandan prakash 
Cc: user @spark 


>From a very quick look, I think testthat version issue with SparkR.

I had to fix that version to 1.x before in AppVeyor. There are few details in 
https://github.com/apache/spark/pull/20003

Can you check and lower testthat version?


On 14 Feb 2018 6:09 pm, "chandan prakash" 
> wrote:
Hi All,
I am trying to run test script of R under ./R/run-tests.sh but hitting same 
ERROR everytime.
I tried running on mac as well as centos machine, same issue coming up.
I am using spark 2.2 (branch-2.2)
I followed from apache doc and followed the steps:
1. installed R
2. installed packages like testthat as mentioned in doc
3. run run-tests.h


Every time I am getting this error line:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
Execution halted


Any Help?

--
Chandan Prakash





[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-02-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357755#comment-16357755
 ] 

Felix Cheung commented on SPARK-23285:
--

Aounds reasonable to me



> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351333#comment-16351333
 ] 

Felix Cheung commented on SPARK-23314:
--

I've isolated this down to this particular file

[https://raw.githubusercontent.com/BuzzFeedNews/2016-04-federal-surveillance-planes/master/data/feds/feds3.csv]

without converting to pandas it seems to read fine, so not if it's a data 
problem.

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351188#comment-16351188
 ] 

Felix Cheung commented on SPARK-23314:
--

Thanks. I have isolated this to a different subset of data, but not yet able to 
pinpoint the exact row (mostly the value displayed is local but the data is 
UTC, and there is no match after adjusting for time zone) It might be with the 
data so in such case is there a way to help debug this?


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350819#comment-16350819
 ] 

Felix Cheung commented on SPARK-23314:
--

Im running python 2
Pandas 0.22.0
Pyarrow 0.8.0



> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Description: 
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For details, see Comment box. I'm able to reproduce this on the latest 
branch-2.3 (last change from Feb 1 UTC)

  was:
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For details, see Comment box


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Description: 
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For details, see Comment box

  was:
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Comment box


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Description: 
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Comment box

  was:
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Comment box


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349899#comment-16349899
 ] 

Felix Cheung commented on SPARK-23314:
--

[~icexelloss] [~bryanc]

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349898#comment-16349898
 ] 

Felix Cheung commented on SPARK-23314:
--

log


[Stage 3:=> (195 + 5) / 
200]18/02/01 19:17:26 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 
205)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
229, in main
process()
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 257, in 
dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 235, in 
_create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 230, in 
create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/spark/python/pyspark/sql/types.py", line 1733, in 
_check_series_convert_timestamps_internal
return s.dt.tz_localize(tz).dt.tz_convert('UTC')
File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 
115, in f
return self._delegate_method(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/accessors.py", 
line 131, in _delegate_method
result = method(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 
118, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.py", 
line 1858, in tz_localize
errors=errors)
File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:164)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:114)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/02/01 19:17:26 WARN TaskSetManager: Lost task 7.0 in stage 3.0 (TID 205, 
localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
229, in main
process()
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 257, in 
dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 235, in 
_create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 230, in 
create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/spark/python/pyspark/sql/types.py", line 1733, in 
_check_series_co

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349897#comment-16349897
 ] 

Felix Cheung commented on SPARK-23314:
--

code

 

>>> flights = spark.read.option("inferSchema", True).option("header", 
>>> True).option("dateFormat", "-MM-dd HH:mm:ss").csv("data*.csv")
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)
... def subtract_mean_year_mfr(pdf):
... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())
...
g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)
>>>
>>> g.count()

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
>     Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Environment: (was: data sample

adshex,flight_id,latitude,longitude,altitude,speed,track,squawk,type,timestamp,name,other_names1,other_names2,n_number,serial_number,mfr_mdl_code,mfr,model,year_mfr,type_aircraft,agency
A72AA1,72791e8,33.2552,-117.91699,5499,111,137,4401,B350,2015-08-18T07:58:54Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2659,-117.928,5500,109,138,4401,B350,2015-08-18T07:58:39Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2741,-117.93599,5500,109,137,4401,B350,2015-08-18T07:58:28Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.28251,-117.945,5500,112,138,4401,B350,2015-08-18T07:58:13Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.29341,-117.95699,5500,102,134,4401,B350,2015-08-18T07:57:58Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs


>>> flights = spark.read.option("inferSchema", True).option("header", 
>>> True).option("dateFormat", "-MM-dd HH:mm:ss").csv("data*.csv")
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)
... def subtract_mean_year_mfr(pdf):
... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())
...
g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)
>>>
>>> g.count()
[Stage 3:=> (195 + 5) / 
200]18/02/01 19:17:26 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 
205)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File 
"/Users/felixcheung/Uber/spark-chamber/python/lib/pyspark.zip/pyspark/worker.py",
 line 229, in main
process()
File 
"/Users/felixcheung/Uber/spark-chamber/python/lib/pyspark.zip/pyspark/worker.py",
 line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 257, in dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 235, in _create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 230, in create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/sql/types.py", line 
1733, in _check_series_convert_timestamps_internal
return s.dt.tz_localize(tz).dt.tz_convert('UTC')
File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 
115, in f
return self._delegate_method(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/accessors.py", 
line 131, in _delegate_method
result = method(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 
118, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.py", 
line 1858, in tz_localize
errors=errors)
File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:164)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:114)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodege

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349896#comment-16349896
 ] 

Felix Cheung commented on SPARK-23314:
--

data sample

adshex,flight_id,latitude,longitude,altitude,speed,track,squawk,type,timestamp,name,other_names1,other_names2,n_number,serial_number,mfr_mdl_code,mfr,model,year_mfr,type_aircraft,agency
A72AA1,72791e8,33.2552,-117.91699,5499,111,137,4401,B350,2015-08-18T07:58:54Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2659,-117.928,5500,109,138,4401,B350,2015-08-18T07:58:39Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2741,-117.93599,5500,109,137,4401,B350,2015-08-18T07:58:28Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.28251,-117.945,5500,112,138,4401,B350,2015-08-18T07:58:13Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.29341,-117.95699,5500,102,134,4401,B350,2015-08-18T07:57:58Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23314:
-
Description: 
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Comment box

  was:
Under  SPARK-22216

When testing pandas_udf on group bys, I saw this error with the timestamp 
column.

AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

For detailed on repo, see Environment box


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-23314:


 Summary: Pandas grouped udf on dataset with timestamp column error 
 Key: SPARK-23314
 URL: https://issues.apache.org/jira/browse/SPARK-23314
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
 Environment: data sample

adshex,flight_id,latitude,longitude,altitude,speed,track,squawk,type,timestamp,name,other_names1,other_names2,n_number,serial_number,mfr_mdl_code,mfr,model,year_mfr,type_aircraft,agency
A72AA1,72791e8,33.2552,-117.91699,5499,111,137,4401,B350,2015-08-18T07:58:54Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2659,-117.928,5500,109,138,4401,B350,2015-08-18T07:58:39Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2741,-117.93599,5500,109,137,4401,B350,2015-08-18T07:58:28Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.28251,-117.945,5500,112,138,4401,B350,2015-08-18T07:58:13Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.29341,-117.95699,5500,102,134,4401,B350,2015-08-18T07:57:58Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs


>>> flights = spark.read.option("inferSchema", True).option("header", 
>>> True).option("dateFormat", "-MM-dd HH:mm:ss").csv("data*.csv")
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)
... def subtract_mean_year_mfr(pdf):
... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())
...
g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)
>>>
>>> g.count()
[Stage 3:=> (195 + 5) / 
200]18/02/01 19:17:26 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 
205)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File 
"/Users/felixcheung/Uber/spark-chamber/python/lib/pyspark.zip/pyspark/worker.py",
 line 229, in main
process()
File 
"/Users/felixcheung/Uber/spark-chamber/python/lib/pyspark.zip/pyspark/worker.py",
 line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 257, in dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 235, in _create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/serializers.py", 
line 230, in create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/Uber/spark-chamber/python/pyspark/sql/types.py", line 
1733, in _check_series_convert_timestamps_internal
return s.dt.tz_localize(tz).dt.tz_convert('UTC')
File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 
115, in f
return self._delegate_method(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/accessors.py", 
line 131, in _delegate_method
result = method(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 
118, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.py", 
line 1858, in tz_localize
errors=errors)
File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:164)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:114)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)

Re: data source v2 online meetup

2018-02-01 Thread Felix Cheung
+1 hangout


From: Xiao Li 
Sent: Wednesday, January 31, 2018 10:46:26 PM
To: Ryan Blue
Cc: Reynold Xin; dev; Wenchen Fen; Russell Spitzer
Subject: Re: data source v2 online meetup

Hi, Ryan,

wow, your Iceberg already used data source V2 API! That is pretty cool! I am 
just afraid these new APIs are not stable. We might deprecate or change some 
data source v2 APIs in the next version (2.4). Sorry for the inconvenience it 
might introduce.

Thanks for your feedback always,

Xiao


2018-01-31 15:54 GMT-08:00 Ryan Blue 
>:
Thanks for suggesting this, I think it's a great idea. I'll definitely attend 
and can talk about the changes that we've made DataSourceV2 to enable our new 
table format, Iceberg.

On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin 
> wrote:
Data source v2 API is one of the larger main changes in Spark 2.3, and whatever 
that has already been committed is only the first version and we'd need more 
work post-2.3 to improve and stablize it.

I think at this point we should stop making changes to it in branch-2.3, and 
instead focus on using the existing API and getting feedback for 2.4. Would 
people be interested in doing an online hangout to discuss this, perhaps in the 
month of Feb?

It'd be more productive if people attending the hangout have tried the API by 
implementing some new sources or porting an existing source over.





--
Ryan Blue
Software Engineer
Netflix



[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23291:
-
Shepherd: Felix Cheung  (was: Hossein Falaki)

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Narendra
>Priority: Major
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342441#comment-16342441
 ] 

Felix Cheung commented on SPARK-23114:
--

Sure!


> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23114.
--
Resolution: Fixed

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-27 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342394#comment-16342394
 ] 

Felix Cheung commented on SPARK-23114:
--

Resolving.

[~sameerag] please see release note above.

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23117.
--
Resolution: Won't Fix
  Assignee: Felix Cheung

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341232#comment-16341232
 ] 

Felix Cheung commented on SPARK-23107:
--

Thanks
My bad RFormula does have a page

https://spark.apache.org/docs/2.2.0/ml-features.html#rformula


> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-26 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340876#comment-16340876
 ] 

Felix Cheung commented on SPARK-23107:
--

We have never had any doc for it and it’s not new in 2.3.0 so I figure it’s not 
a blocker for the release


> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-01-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-23200:


Assignee: Santiago Saavedra  (was: Anirudh Ramanathan)

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Santiago Saavedra
>Priority: Major
> Fix For: 2.4.0
>
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23213) SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1

2018-01-25 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339748#comment-16339748
 ] 

Felix Cheung commented on SPARK-23213:
--

To clarify we don’t support RDD in R.

Anything you access via SparkR::: is not supported, that include unionRDD, is 
not supported.


> SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1 
> -
>
> Key: SPARK-23213
> URL: https://issues.apache.org/jira/browse/SPARK-23213
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
> Environment: JAVA_HOME=/opt/jdk1.8.0_161/
> spark 2.2.1
> R version 3.4.3 (2017-11-30) – "Kite-Eating Tree"
>Reporter: Tony 
>Priority: Major
>
> Welcome to
>                   __ 
>    / __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.2.1 
>     /_/ 
>  
>  
>  SparkSession available as 'spark'.
> > 
> sc1 <- sparkR.session(appName = "wordcount")
> lines <- SparkR:::textFile(sc1,"/opt/test333")
> 18/01/25 02:33:37 ERROR RBackendHandler: defaultParallelism on 1 failed
> java.lang.IllegalArgumentException: invalid method defaultParallelism for 
> object 1
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:193)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23213) SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1

2018-01-25 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339559#comment-16339559
 ] 

Felix Cheung commented on SPARK-23213:
--

If you have any specific on what you need - we should have alternative API you 
can use that is not RDD based.

Anything you access with SparkR::: (3 :) - this is accessing methods inside the 
namespace that is not exported. So they are not public API.


> SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1 
> -
>
> Key: SPARK-23213
> URL: https://issues.apache.org/jira/browse/SPARK-23213
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
> Environment: JAVA_HOME=/opt/jdk1.8.0_161/
> spark 2.2.1
> R version 3.4.3 (2017-11-30) – "Kite-Eating Tree"
>Reporter: Tony 
>Priority: Major
>
> Welcome to
>                   __ 
>    / __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.2.1 
>     /_/ 
>  
>  
>  SparkSession available as 'spark'.
> > 
> sc1 <- sparkR.session(appName = "wordcount")
> lines <- SparkR:::textFile(sc1,"/opt/test333")
> 18/01/25 02:33:37 ERROR RBackendHandler: defaultParallelism on 1 failed
> java.lang.IllegalArgumentException: invalid method defaultParallelism for 
> object 1
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:193)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23213) SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1

2018-01-25 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339522#comment-16339522
 ] 

Felix Cheung commented on SPARK-23213:
--

You can convert DataFrame into RDD
But again textFile and RDD (all RDD APIs) are not  supported public API, sorry.

It will help if you could elaborate on what you are trying to do and what you 
might need.


> SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1 
> -
>
> Key: SPARK-23213
> URL: https://issues.apache.org/jira/browse/SPARK-23213
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
> Environment: JAVA_HOME=/opt/jdk1.8.0_161/
> spark 2.2.1
> R version 3.4.3 (2017-11-30) – "Kite-Eating Tree"
>Reporter: Tony 
>Priority: Major
>
> Welcome to
>                   __ 
>    / __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.2.1 
>     /_/ 
>  
>  
>  SparkSession available as 'spark'.
> > 
> sc1 <- sparkR.session(appName = "wordcount")
> lines <- SparkR:::textFile(sc1,"/opt/test333")
> 18/01/25 02:33:37 ERROR RBackendHandler: defaultParallelism on 1 failed
> java.lang.IllegalArgumentException: invalid method defaultParallelism for 
> object 1
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:193)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23213) SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1

2018-01-25 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338987#comment-16338987
 ] 

Felix Cheung commented on SPARK-23213:
--

Try read.text instead?

[http://spark.apache.org/docs/latest/api/R/read.text.html]

SparkR:::textFile is an internal method. Is there a reason you need it?

> SparkR:::textFile(sc1,"/opt/test333") can not work on spark2.2.1 
> -
>
> Key: SPARK-23213
> URL: https://issues.apache.org/jira/browse/SPARK-23213
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
> Environment: JAVA_HOME=/opt/jdk1.8.0_161/
> spark 2.2.1
> R version 3.4.3 (2017-11-30) – "Kite-Eating Tree"
>Reporter: Tony 
>Priority: Major
>
> Welcome to
>                   __ 
>    / __/__  ___ _/ /__ 
>   _\ \/ _ \/ _ `/ __/  '_/ 
>  /___/ .__/\_,_/_/ /_/\_\   version  2.2.1 
>     /_/ 
>  
>  
>  SparkSession available as 'spark'.
> > 
> sc1 <- sparkR.session(appName = "wordcount")
> lines <- SparkR:::textFile(sc1,"/opt/test333")
> 18/01/25 02:33:37 ERROR RBackendHandler: defaultParallelism on 1 failed
> java.lang.IllegalArgumentException: invalid method defaultParallelism for 
> object 1
> at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:193)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
> at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338043#comment-16338043
 ] 

Felix Cheung commented on SPARK-23117:
--

I'm ok to sign off if we don't have example for SPARK-20307 or SPARK-21381.

Perhaps something we should explain more in ML guide - since changes go into 
python and scala APIs as well.

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-24 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23115.
--
Resolution: Fixed
  Assignee: Felix Cheung

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16337947#comment-16337947
 ] 

Felix Cheung commented on SPARK-23115:
--

done

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333677#comment-16333677
 ] 

Felix Cheung edited comment on SPARK-23115 at 1/24/18 7:17 AM:
---

Another pass, we should add API doc for

-SPARK-20906 (PR pending)-


was (Author: felixcheung):
Another pass, we should add API doc for

SPARK-20906

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-23 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21727.
--
  Resolution: Fixed
   Fix Version/s: 2.4.0
  2.3.0
Target Version/s: 2.3.0

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>Priority: Major
> Fix For: 2.3.0, 2.4.0
>
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22522) Convert to apache-release to publish Maven artifacts

2018-01-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336254#comment-16336254
 ] 

Felix Cheung commented on SPARK-22522:
--

It’s to the right place but not the supported plugin.

It’s curling the endpoint directly which could be fragile.


> Convert to apache-release to publish Maven artifacts 
> -
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html
> to publish to Nexus/repository.apache.org which can be promoted to maven 
> central (when release).
> this is the same repo we are publishing to today. this JIRA is only tracking 
> the tooling changes.
> ...at the very least we need to revisit all the calls to curl (and/or gpg) in 
> the release-build.sh for the publish-release path - seems like some errors 
> are ignored (running into that myself) and it would be very easy to miss 
> publishing one or more or all files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22522) Convert to apache-release to publish Maven artifacts

2018-01-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336204#comment-16336204
 ] 

Felix Cheung commented on SPARK-22522:
--

No it’s not done AFAIK


> Convert to apache-release to publish Maven artifacts 
> -
>
> Key: SPARK-22522
> URL: https://issues.apache.org/jira/browse/SPARK-22522
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see http://www.apache.org/dev/publishing-maven-artifacts.html
> to publish to Nexus/repository.apache.org which can be promoted to maven 
> central (when release).
> this is the same repo we are publishing to today. this JIRA is only tracking 
> the tooling changes.
> ...at the very least we need to revisit all the calls to curl (and/or gpg) in 
> the release-build.sh for the publish-release path - seems like some errors 
> are ignored (running into that myself) and it would be very easy to miss 
> publishing one or more or all files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335872#comment-16335872
 ] 

Felix Cheung commented on SPARK-23114:
--

I’m merely asking if anyone would have some real workload to test this fix 
with, since it has been reported in earlier releases issues with job timeout so 
there must be some long running job?

As I don’t have access to real customer dataset.

Anyway, as for other issues you have reported I think we have had follow ups 
and would be great for everyone in the community to chime in.


> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333712#comment-16333712
 ] 

Felix Cheung edited comment on SPARK-23107 at 1/21/18 11:08 PM:


We don't have doc on RFormula but it'll be good idea to add now and also to 
allow for documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way


was (Author: felixcheung):
We don't have doc on RFormula but it'll be good idea to also allow for 
documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333733#comment-16333733
 ] 

Felix Cheung commented on SPARK-21727:
--

how are we doing?

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>Priority: Major
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333730#comment-16333730
 ] 

Felix Cheung edited comment on SPARK-23114 at 1/21/18 11:03 PM:


[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many short/bursty tasks?

 


was (Author: felixcheung):
[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many tasks?

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333725#comment-16333725
 ] 

Felix Cheung edited comment on SPARK-23114 at 1/21/18 11:02 PM:


[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
 offset in SparkR GLM [https://github.com/apache/spark/pull/18831]
 stringIndexerOrderType
 handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc for SQL functions

 


was (Author: felixcheung):
[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
offset in SparkR GLM https://github.com/apache/spark/pull/18831
stringIndexerOrderType
handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333730#comment-16333730
 ] 

Felix Cheung commented on SPARK-23114:
--

[~falaki] [~hyukjin.kwon]

About SPARK-21093, do you think you could have real data and real workload to 
test for long haul or heavy load or many tasks?

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333725#comment-16333725
 ] 

Felix Cheung commented on SPARK-23114:
--

[~sameerag]

Here are some ideas for the release notes (that goes to spark-website in the 
announcements)

For SparkR, new in 2.3.0:

SQL changes:

SQL functions, cubing & nested structure

collect_list, collect_set, split_string, repeat_string, rollup, cube
 explode_outer posexplode_outer, %<=>%, !, not, create_array, create_map, 
grouping_bit, grouping_id
 input_file_name, alias, trunc, date_trunc, map_keys, map_values, current_date, 
current_timestamp, trim/trimString,
 dayofweek, unionByName,

to_json (map or array of maps)

Data Source -  multiLine (json/csv)

 

ML changes:

Decision Tree (regression and classification)

Constrained Logistic Regression
offset in SparkR GLM https://github.com/apache/spark/pull/18831
stringIndexerOrderType
handleInvalid (spark.svmLinear, spark.logit, spark.mlp, spark.naiveBayes, 
spark.gbt, spark.decisionTree, spark.randomForest)

 

SS changes:

Structured Streaming API for withWatermark, trigger (once, processingTime), 
partitionBy

stream-stream join

 

Documentation:

major overhaul and simplification of API doc

 

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333718#comment-16333718
 ] 

Felix Cheung edited comment on SPARK-23117 at 1/21/18 10:47 PM:


I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20619 SPARK-14659 SPARK-20899 - should have 
RFormula in ML guide


was (Author: felixcheung):
I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20899 - should have RFormula in ML guide

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23116) SparkR 2.3 QA: Update user guide for new features & APIs

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333717#comment-16333717
 ] 

Felix Cheung commented on SPARK-23116:
--

I did a pass.

> SparkR 2.3 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-23116
> URL: https://issues.apache.org/jira/browse/SPARK-23116
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23116) SparkR 2.3 QA: Update user guide for new features & APIs

2018-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23116.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.3.0

> SparkR 2.3 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-23116
> URL: https://issues.apache.org/jira/browse/SPARK-23116
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.3.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333718#comment-16333718
 ] 

Felix Cheung commented on SPARK-23117:
--

I did a pass, I think these could use an example, preferably a bit more detail 
one

SPARK-20307

SPARK-21381

 

Others:

Constrained Logistic Regression - SPARK-20906 - should go to ML guide

stringIndexerOrderType - SPARK-20899 - should have RFormula in ML guide

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333716#comment-16333716
 ] 

Felix Cheung commented on SPARK-20307:
--

I think [~wm624] if you have the time

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(A

[jira] [Comment Edited] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333682#comment-16333682
 ] 

Felix Cheung edited comment on SPARK-20307 at 1/21/18 10:40 PM:


Hi Felix,
 I can do that but I have a family emergency lately. It will not occur soon.
 Best
 Joseph

 


was (Author: monday0927!):
Hi Felix,
I can do that but I have a family emergency lately. It will not occur soon.
Best
Joseph

On 1/21/18, 2:45 PM, "Felix Cheung (JIRA)" <j...@apache.org> wrote:


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333663#comment-16333663
 ] 

Felix Cheung commented on SPARK-20307:
--

for SPARK-20307 and SPARK-21381, do you think you can write up example on 
how to use them and also a mention in the R programming guide?

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
StringIndexer
> 

>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
spark.randomForest, but i assume is valid for all spark.xx functions that apply 
a StringIndexer under the hood), testing on a new dataset with factor levels 
that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid 
on to the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context 
loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", 
"that"), 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 607.0 failed 1 times, most recent failure: Lost task 0.0 in stage 607.0 
(TID 1581, localhost, executor driver): org.apache.spark.SparkException: Failed 
to execute user defined function($anonfun$4: (string) => double)
> at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
> at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.

[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333712#comment-16333712
 ] 

Felix Cheung commented on SPARK-23107:
--

We don't have doc on RFormula but it'll be good idea to also allow for 
documenting changes like 
h1. SPARK-20619 
h1. SPARK-20899

in a language independent way

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333654#comment-16333654
 ] 

Felix Cheung edited comment on SPARK-20906 at 1/21/18 10:30 PM:


[~wm624] would you like to add example of this in the API doc?

roxygen2 doc for spark.logit


was (Author: felixcheung):
[~wm624] would you like to add example of this in the API doc?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23108) ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333708#comment-16333708
 ] 

Felix Cheung commented on SPARK-23108:
--

>From reviewing R, it would be good to document constrained optimization for 
>logistic regression SPARK-20906 (R guide just links to ML guide, so we should 
>add doc there)

> ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-23108
> URL: https://issues.apache.org/jira/browse/SPARK-23108
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23118) SparkR 2.3 QA: Programming guide, migration guide, vignettes updates

2018-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23118.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.3.0

> SparkR 2.3 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-23118
> URL: https://issues.apache.org/jira/browse/SPARK-23118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")
>  * Update R vignettes
> Note: This task is for large changes to the guides. New features are handled 
> in SPARK-23116.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23118) SparkR 2.3 QA: Programming guide, migration guide, vignettes updates

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333707#comment-16333707
 ] 

Felix Cheung commented on SPARK-23118:
--

for programming guide, perhaps 

SPARK-20906

But it mostly just links to API doc and ML programming guide. Will add a 
comment on ML programming guide instead.

> SparkR 2.3 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-23118
> URL: https://issues.apache.org/jira/browse/SPARK-23118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")
>  * Update R vignettes
> Note: This task is for large changes to the guides. New features are handled 
> in SPARK-23116.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333677#comment-16333677
 ] 

Felix Cheung commented on SPARK-23115:
--

Another pass, we should add API doc for

SPARK-20906

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333654#comment-16333654
 ] 

Felix Cheung edited comment on SPARK-20906 at 1/21/18 8:54 PM:
---

[~wm624] would you like to add example of this in the API doc?


was (Author: felixcheung):
[~wm624] would you like to add example of this in the R vignettes?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2018-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22208:
-
Labels: releasenotes  (was: )

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333663#comment-16333663
 ] 

Felix Cheung commented on SPARK-20307:
--

for SPARK-20307 and SPARK-21381, do you think you can write up example on how 
to use them and also a mention in the R programming guide?

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable

[jira] [Commented] (SPARK-22208) Improve percentile_approx by not rounding up targetError and starting from index 0

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333659#comment-16333659
 ] 

Felix Cheung commented on SPARK-22208:
--

Is this documented in the SQL programming guide/ migration guide?

[~ZenWzh]

[~smilegator]

 

> Improve percentile_approx by not rounding up targetError and starting from 
> index 0
> --
>
> Key: SPARK-22208
> URL: https://issues.apache.org/jira/browse/SPARK-22208
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> percentile_approx never returns the first element when percentile is in 
> (relativeError, 1/N], where relativeError default is 1/1, and N is the 
> total number of elements. But ideally, percentiles in [0, 1/N] should all 
> return the first element as the answer.
> For example, given input data 1 to 10, if a user queries 10% (or even less) 
> percentile, it should return 1, because the first value 1 already reaches 
> 10%. Currently it returns 2.
> Based on the paper, targetError is not rounded up, and searching index should 
> start from 0 instead of 1. By following the paper, we should be able to fix 
> the cases mentioned above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20906) Constrained Logistic Regression for SparkR

2018-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333654#comment-16333654
 ] 

Felix Cheung commented on SPARK-20906:
--

[~wm624] would you like to add example of this in the R vignettes?

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21293) R document update structured streaming

2018-01-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21293.
--
  Resolution: Fixed
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> R document update structured streaming
> --
>
> Key: SPARK-21293
> URL: https://issues.apache.org/jira/browse/SPARK-21293
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.3.0
>
>
> add examples for
> Window Operations on Event Time
> Join Operations
> Streaming Deduplication



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Issue with multiple users running Spark

2018-01-17 Thread Felix Cheung
Should we have some doc on this? I think this could be a common problem


From: Austin Heyne 
Sent: Monday, January 15, 2018 6:59:55 AM
To: users@zeppelin.apache.org
Subject: Re: Issue with multiple users running Spark


Thanks Jeff and Michael for the help. We're seeing good success just disabling 
'zeppelin.spark.useHiveContext'.

-Austin

On 01/12/2018 07:56 PM, Jeff Zhang wrote:

There're 2 options for you:

1. Disable hiveContext in spark via setting zeppelin.spark.useHiveContext to 
false in spark's interpreter setting
2. Connect to hive metastore service instead of single derby instance. You can 
configure that in your hive-site.xml



Michael Segel 
>于2018年1月13日周六 
上午2:40写道:
Hi,

Quick response… unless you tell Derby to set up as a networked service (this is 
going back to SilverCloud days), its a single user instance. So it won’t work.
Were you using MySQL or something… you would have better luck…


I think if you go back in to Derby’s docs and see how to start this as a 
networked server (multi-user) , you could try it.
Most people don’t do this because not many people know Derby and I don’t know 
how well that portion of the code has been maintained over the years.


HTH

-Mike

> On Jan 12, 2018, at 12:35 PM, Austin Heyne 
> > wrote:
>
> Hi everyone,
>
> I'm currently running Zeppelin on a spark master node using the AWS provided 
> Zeppelin install. I'm trying to get the notebook setup so multiple devs can 
> use it (and the spark cluster) concurrently. I have the spark interpreter set 
> to instantiate 'Per Note' in 'isolated' processes. I also have 
> 'spark.dynamicAllocation.enabled' set to 'true' so the multiple spark 
> contexts can share the cluster.
>
> The problem I'm seeing is when the second spark context tries to instantiate 
> hive starts throwing errors because the Derby database has already been 
> booted (by the other context). Full stack trace is available here [1]. How do 
> I go about working around this? Is there a way to have it use another 
> database or is this a limitation?
>
> Thanks for any help!
>
> [1] https://gist.github.com/aheyne/8d84eaedefb997f248b6e88c1b9e1e34
>
> --
> Austin L. Heyne
>



--
Austin L. Heyne


[jira] [Commented] (SPARK-23118) SparkR 2.3 QA: Programming guide, migration guide, vignettes updates

2018-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328430#comment-16328430
 ] 

Felix Cheung commented on SPARK-23118:
--

did this and opened SPARK-21616

> SparkR 2.3 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-23118
> URL: https://issues.apache.org/jira/browse/SPARK-23118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")
>  * Update R vignettes
> Note: This task is for large changes to the guides. New features are handled 
> in SPARK-23116.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328422#comment-16328422
 ] 

Felix Cheung commented on SPARK-23115:
--

did this, and opened this

https://issues.apache.org/jira/browse/SPARK-23069

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328422#comment-16328422
 ] 

Felix Cheung edited comment on SPARK-23115 at 1/17/18 8:01 AM:
---

did this, and opened SPARK-23069


was (Author: felixcheung):
did this, and opened this

https://issues.apache.org/jira/browse/SPARK-23069

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23114) Spark R 2.3 QA umbrella

2018-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328419#comment-16328419
 ] 

Felix Cheung commented on SPARK-23114:
--

sure, [~josephkb]

> Spark R 2.3 QA umbrella
> ---
>
> Key: SPARK-23114
> URL: https://issues.apache.org/jira/browse/SPARK-23114
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23068) Jekyll doc build error does not fail build

2018-01-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23068:
-
Summary: Jekyll doc build error does not fail build  (was: R doc build 
error does not fail build)

> Jekyll doc build error does not fail build
> --
>
> Key: SPARK-23068
> URL: https://issues.apache.org/jira/browse/SPARK-23068
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> +++ /usr/local/bin/Rscript -e ' if("devtools" %in% 
> rownames(installed.packages())) { library(devtools); 
> devtools::document(pkg="./pkg", roclets=c("rd")) }'
> Error: 'roxygen2' >= 5.0.0 must be installed for this functionality.
> Execution halted
> jekyll 3.7.0 | Error:  R doc generation failed
> See SPARK-23065



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23068) Jekyll doc build error does not fail build

2018-01-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23068:
-
Component/s: Documentation

> Jekyll doc build error does not fail build
> --
>
> Key: SPARK-23068
> URL: https://issues.apache.org/jira/browse/SPARK-23068
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> +++ /usr/local/bin/Rscript -e ' if("devtools" %in% 
> rownames(installed.packages())) { library(devtools); 
> devtools::document(pkg="./pkg", roclets=c("rd")) }'
> Error: 'roxygen2' >= 5.0.0 must be installed for this functionality.
> Execution halted
> jekyll 3.7.0 | Error:  R doc generation failed
> See SPARK-23065



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23073) Fix incorrect R doc page header for generated sql functions

2018-01-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23073:
-
Description: 
See title says 
{code}
asc {SparkR}
{code}

https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
http://spark.apache.org/docs/latest/api/R/columnfunctions.html

asc, contains etc are functions generated at runtime. Because of that, their 
doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 picks 
the doc page title from the first function name by default, in the presence of 
any function it can parse.

An attempt to fix here 
https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824

{code}
#' @rdname columnfunctions
  #' @export
 +#' @name NULL
  setGeneric("asc", function(x) { standardGeneric("asc") })
{code}

But it cause a more severe issue to fail CRAN checks

{code}
* checking for missing documentation entries ... WARNING
Undocumented code objects:
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull'
  'isNull' 'like' 'rlike'
All user-level objects in a package should have documentation entries.
See the chapter 'Writing R documentation files' in the 'Writing R
Extensions' manual.
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Objects in \usage without \alias in documentation object 'columnfunctions':
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull'
  'isNotNull' 'like' 'rlike'
{code}

To follow up we should
- look for a way to set the doc page title
- http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really 
barebone and we should explicitly add a doc page with content (which could also 
address the first point)


  was:
See title says 
{code}
asc {SparkR}
{code}

https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
http://spark.apache.org/docs/latest/api/R/columnfunctions.html

asc, contains etc are functions generated at runtime. Because of that, their 
doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 picks 
the doc page title from the first function name by default, in the presence of 
any function it can parse.

An attempt to fix here 
https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824

{code}
#' @rdname columnfunctions
  #' @export
 +#' @name NULL
  setGeneric("asc", function(x) { standardGeneric("asc") })
{code}

But it cause a more severe issue to fail CRAN checks

{code}
* checking for missing documentation entries ... WARNING
Undocumented code objects:
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull'
  'isNull' 'like' 'rlike'
All user-level objects in a package should have documentation entries.
See the chapter 'Writing R documentation files' in the 'Writing R
Extensions' manual.
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Objects in \usage without \alias in documentation object 'columnfunctions':
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull'
  'isNotNull' 'like' 'rlike'
{code}

To follow up we should
- look for a way to set the doc page title
- http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really 
barebone and we should explicitly add a doc page content (which could also 
address the first point)



> Fix incorrect R doc page header for generated sql functions
> ---
>
> Key: SPARK-23073
> URL: https://issues.apache.org/jira/browse/SPARK-23073
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>    Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
> Attachments: Screen Shot 2018-01-14 at 11.11.05 AM.png
>
>
> See title says 
> {code}
> asc {SparkR}
> {code}
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
> http://spark.apache.org/docs/latest/api/R/columnfunctions.html
> asc, contains etc are functions generated at runtime. Because of that, their 
> doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 
> picks the doc page title from the first function name by default, in the 
> presence of any function it can parse.
> An attempt to fix here 
> https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824
> {code}
> #' @rdname columnfunctions
>   #' @export
>  +#' @name NULL
>   setGeneric("asc", function(x) { standardGeneric("asc&quo

[jira] [Updated] (SPARK-23073) Fix incorrect R doc page header for generated sql functions

2018-01-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23073:
-
Component/s: (was: SpakrR)
 SparkR

> Fix incorrect R doc page header for generated sql functions
> ---
>
> Key: SPARK-23073
> URL: https://issues.apache.org/jira/browse/SPARK-23073
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
> Attachments: Screen Shot 2018-01-14 at 11.11.05 AM.png
>
>
> See title says 
> {code}
> asc {SparkR}
> {code}
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
> http://spark.apache.org/docs/latest/api/R/columnfunctions.html
> asc, contains etc are functions generated at runtime. Because of that, their 
> doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 
> picks the doc page title from the first function name by default, in the 
> presence of any function it can parse.
> An attempt to fix here 
> https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824
> {code}
> #' @rdname columnfunctions
>   #' @export
>  +#' @name NULL
>   setGeneric("asc", function(x) { standardGeneric("asc") })
> {code}
> But it cause a more severe issue to fail CRAN checks
> {code}
> * checking for missing documentation entries ... WARNING
> Undocumented code objects:
>   'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull'
>   'isNull' 'like' 'rlike'
> All user-level objects in a package should have documentation entries.
> See the chapter 'Writing R documentation files' in the 'Writing R
> Extensions' manual.
> * checking for code/documentation mismatches ... OK
> * checking Rd \usage sections ... WARNING
> Objects in \usage without \alias in documentation object 'columnfunctions':
>   'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull'
>   'isNotNull' 'like' 'rlike'
> {code}
> To follow up we should
> - look for a way to set the doc page title
> - http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really 
> barebone and we should explicitly add a doc page content (which could also 
> address the first point)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23073) Fix incorrect R doc page header for generated sql functions

2018-01-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23073:
-
Description: 
See title says 
{code}
asc {SparkR}
{code}

https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
http://spark.apache.org/docs/latest/api/R/columnfunctions.html

asc, contains etc are functions generated at runtime. Because of that, their 
doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 picks 
the doc page title from the first function name by default, in the presence of 
any function it can parse.

An attempt to fix here 
https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824

{code}
#' @rdname columnfunctions
  #' @export
 +#' @name NULL
  setGeneric("asc", function(x) { standardGeneric("asc") })
{code}

But it cause a more severe issue to fail CRAN checks

{code}
* checking for missing documentation entries ... WARNING
Undocumented code objects:
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull'
  'isNull' 'like' 'rlike'
All user-level objects in a package should have documentation entries.
See the chapter 'Writing R documentation files' in the 'Writing R
Extensions' manual.
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Objects in \usage without \alias in documentation object 'columnfunctions':
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull'
  'isNotNull' 'like' 'rlike'
{code}

To follow up we should
- look for a way to set the doc page title
- http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really 
barebone and we should explicitly add a doc page content (which could also 
address the first point)


  was:
See title says {{asc {SparkR}}}
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
http://spark.apache.org/docs/latest/api/R/columnfunctions.html

asc, contains etc are functions generated at runtime. Because of that, their 
doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 picks 
the doc page title from the first function name by default, in the presence of 
any function it can parse.

An attempt to fix here 
https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824

{code:java}
#' @rdname columnfunctions
  #' @export
 +#' @name NULL
  setGeneric("asc", function(x) { standardGeneric("asc") })
{code}

But it cause a more severe issue to fail CRAN checks

{code}
* checking for missing documentation entries ... WARNING
Undocumented code objects:
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull'
  'isNull' 'like' 'rlike'
All user-level objects in a package should have documentation entries.
See the chapter 'Writing R documentation files' in the 'Writing R
Extensions' manual.
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Objects in \usage without \alias in documentation object 'columnfunctions':
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull'
  'isNotNull' 'like' 'rlike'
{code}

To follow up we should
- look for a way to set the doc page title
- http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really 
barebone and we should explicitly add a doc page content (which could also 
address the first point)



> Fix incorrect R doc page header for generated sql functions
> ---
>
> Key: SPARK-23073
> URL: https://issues.apache.org/jira/browse/SPARK-23073
> Project: Spark
>  Issue Type: Documentation
>  Components: SpakrR
>    Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
> Attachments: Screen Shot 2018-01-14 at 11.11.05 AM.png
>
>
> See title says 
> {code}
> asc {SparkR}
> {code}
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
> http://spark.apache.org/docs/latest/api/R/columnfunctions.html
> asc, contains etc are functions generated at runtime. Because of that, their 
> doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 
> picks the doc page title from the first function name by default, in the 
> presence of any function it can parse.
> An attempt to fix here 
> https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824
> {code}
> #' @rdname columnfunctions
>   #' @export
>  +#' @name NULL
>   setGeneric("asc", function(x) { standardGeneric("asc") })
> {

[jira] [Updated] (SPARK-23073) Fix incorrect R doc page header for generated sql functions

2018-01-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23073:
-
Description: 
See title says {{asc {SparkR}}}
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
http://spark.apache.org/docs/latest/api/R/columnfunctions.html

asc, contains etc are functions generated at runtime. Because of that, their 
doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 picks 
the doc page title from the first function name by default, in the presence of 
any function it can parse.

An attempt to fix here 
https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824

{code:java}
#' @rdname columnfunctions
  #' @export
 +#' @name NULL
  setGeneric("asc", function(x) { standardGeneric("asc") })
{code}

But it cause a more severe issue to fail CRAN checks

{code}
* checking for missing documentation entries ... WARNING
Undocumented code objects:
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull'
  'isNull' 'like' 'rlike'
All user-level objects in a package should have documentation entries.
See the chapter 'Writing R documentation files' in the 'Writing R
Extensions' manual.
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Objects in \usage without \alias in documentation object 'columnfunctions':
  'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull'
  'isNotNull' 'like' 'rlike'
{code}

To follow up we should
- look for a way to set the doc page title
- http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really 
barebone and we should explicitly add a doc page content (which could also 
address the first point)


> Fix incorrect R doc page header for generated sql functions
> ---
>
> Key: SPARK-23073
> URL: https://issues.apache.org/jira/browse/SPARK-23073
> Project: Spark
>  Issue Type: Documentation
>  Components: SpakrR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
> Attachments: Screen Shot 2018-01-14 at 11.11.05 AM.png
>
>
> See title says {{asc {SparkR}}}
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/columnfunctions.html
> http://spark.apache.org/docs/latest/api/R/columnfunctions.html
> asc, contains etc are functions generated at runtime. Because of that, their 
> doc entries are dependent on the Generics.R file. Unfortunately, ROxygen2 
> picks the doc page title from the first function name by default, in the 
> presence of any function it can parse.
> An attempt to fix here 
> https://github.com/apache/spark/pull/20263/commits/d433dc930021de85aa338c5017a223bae3526df3#diff-8e3d61ff66c9ffcd6ffb7a8eedc08409R824
> {code:java}
> #' @rdname columnfunctions
>   #' @export
>  +#' @name NULL
>   setGeneric("asc", function(x) { standardGeneric("asc") })
> {code}
> But it cause a more severe issue to fail CRAN checks
> {code}
> * checking for missing documentation entries ... WARNING
> Undocumented code objects:
>   'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNotNull'
>   'isNull' 'like' 'rlike'
> All user-level objects in a package should have documentation entries.
> See the chapter 'Writing R documentation files' in the 'Writing R
> Extensions' manual.
> * checking for code/documentation mismatches ... OK
> * checking Rd \usage sections ... WARNING
> Objects in \usage without \alias in documentation object 'columnfunctions':
>   'asc' 'contains' 'desc' 'getField' 'getItem' 'isNaN' 'isNull'
>   'isNotNull' 'like' 'rlike'
> {code}
> To follow up we should
> - look for a way to set the doc page title
> - http://spark.apache.org/docs/latest/api/R/columnfunctions.html is really 
> barebone and we should explicitly add a doc page content (which could also 
> address the first point)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23073) Fix incorrect R doc page header for generated sql functions

2018-01-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23073:
-
Attachment: Screen Shot 2018-01-14 at 11.11.05 AM.png

> Fix incorrect R doc page header for generated sql functions
> ---
>
> Key: SPARK-23073
> URL: https://issues.apache.org/jira/browse/SPARK-23073
> Project: Spark
>  Issue Type: Documentation
>  Components: SpakrR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
> Attachments: Screen Shot 2018-01-14 at 11.11.05 AM.png
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23073) Fix incorrect R doc page header for generated sql functions

2018-01-14 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-23073:


 Summary: Fix incorrect R doc page header for generated sql 
functions
 Key: SPARK-23073
 URL: https://issues.apache.org/jira/browse/SPARK-23073
 Project: Spark
  Issue Type: Documentation
  Components: SpakrR
Affects Versions: 2.2.1, 2.3.0
Reporter: Felix Cheung
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23038) Update docker/spark-test (JDK/OS)

2018-01-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23038:
-
Fix Version/s: 2.4.0

> Update docker/spark-test (JDK/OS)
> -
>
> Key: SPARK-23038
> URL: https://issues.apache.org/jira/browse/SPARK-23038
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.2.2, 2.3.0, 2.4.0
>
>
> This issue aims to update the followings in `docker/spark-test`.
> - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only.
> - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life 
> of `precise` was April 28, 2017.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23038) Update docker/spark-test (JDK/OS)

2018-01-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23038.
--
  Resolution: Fixed
Assignee: Dongjoon Hyun
   Fix Version/s: 2.3.0
  2.2.2
Target Version/s: 2.2.2, 2.3.0

> Update docker/spark-test (JDK/OS)
> -
>
> Key: SPARK-23038
> URL: https://issues.apache.org/jira/browse/SPARK-23038
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> This issue aims to update the followings in `docker/spark-test`.
> - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only.
> - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life 
> of `precise` was April 28, 2017.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23069) R doc for describe missing text

2018-01-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23069:
-
Issue Type: Documentation  (was: Bug)

> R doc for describe missing text
> ---
>
> Key: SPARK-23069
> URL: https://issues.apache.org/jira/browse/SPARK-23069
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23063) Changes to publish the spark-kubernetes package

2018-01-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23063.
--
   Resolution: Fixed
 Assignee: Anirudh Ramanathan
Fix Version/s: 2.3.0

> Changes to publish the spark-kubernetes package
> ---
>
> Key: SPARK-23063
> URL: https://issues.apache.org/jira/browse/SPARK-23063
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Blocker
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23069) R doc for describe missing text

2018-01-13 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-23069:


 Summary: R doc for describe missing text
 Key: SPARK-23069
 URL: https://issues.apache.org/jira/browse/SPARK-23069
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325465#comment-16325465
 ] 

Felix Cheung commented on SPARK-23065:
--

I have checked the doc, looks good, but I have a few small fixes. will open 
another JIRA.

> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Sameer Agarwal
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2018-01-13 at 3.15.48 PM.png, Screen Shot 
> 2018-01-13 at 3.16.06 PM.png
>
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23065.
--
  Resolution: Fixed
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Sameer Agarwal
>Priority: Blocker
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2018-01-13 at 3.15.48 PM.png, Screen Shot 
> 2018-01-13 at 3.16.06 PM.png
>
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23068) R doc build error does not fail build

2018-01-13 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-23068:


 Summary: R doc build error does not fail build
 Key: SPARK-23068
 URL: https://issues.apache.org/jira/browse/SPARK-23068
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung


+++ /usr/local/bin/Rscript -e ' if("devtools" %in% 
rownames(installed.packages())) { library(devtools); 
devtools::document(pkg="./pkg", roclets=c("rd")) }'
Error: 'roxygen2' >= 5.0.0 must be installed for this functionality.
Execution halted
jekyll 3.7.0 | Error:  R doc generation failed

See SPARK-23065



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325400#comment-16325400
 ] 

Felix Cheung commented on SPARK-23065:
--

Something was definitely cached - not sure what since I have forced refresh.

Anyway the links are back now (when I open in another browse), let me review 
more closely. Thanks




> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Sameer Agarwal
>Priority: Blocker
> Attachments: Screen Shot 2018-01-13 at 3.15.48 PM.png, Screen Shot 
> 2018-01-13 at 3.16.06 PM.png
>
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325229#comment-16325229
 ] 

Felix Cheung commented on SPARK-23065:
--

Was the Jekyll error failed the doc build? Just want to make sure error like 
this is discoverable.

I’m still seeing the empty header page can you check?
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html

Vs
https://spark.apache.org/docs/latest/api/R/index.html




> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Sameer Agarwal
>Priority: Blocker
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-12 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-23065:
-
Priority: Blocker  (was: Major)

> R API doc empty in Spark 2.3.0 RC1
> --
>
> Key: SPARK-23065
> URL: https://issues.apache.org/jira/browse/SPARK-23065
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Blocker
>
> [~sameerag]
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html
> Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23065) R API doc empty in Spark 2.3.0 RC1

2018-01-12 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-23065:


 Summary: R API doc empty in Spark 2.3.0 RC1
 Key: SPARK-23065
 URL: https://issues.apache.org/jira/browse/SPARK-23065
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung


[~sameerag]

https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs/_site/api/R/index.html

Did it fail to build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Use Bokeh in Apache Zeppelin

2018-01-10 Thread Felix Cheung
Nice!
Get this in 
https://github.com/apache/zeppelin/blob/master/docs/interpreter/python.md?


From: Partridge, Lucas (GE Aviation) 
Sent: Wednesday, January 10, 2018 2:27:19 AM
To: Jeff Zhang
Cc: users@zeppelin.apache.org
Subject: Use Bokeh in Apache Zeppelin

Thanks Jeff! I can confirm the following resulted in an inline graph when using 
a notebook bound to the Spark interpreter group in plain Zeppelin 0.7.0:

%pyspark
from bokeh.plotting import figure
from bokeh.io import show,output_notebook
import bkzep
output_notebook(notebook_type='zeppelin')

f = figure()
f.line(x=[1,2],y=[3,4])
show(f)

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: 09 January 2018 23:24
To: Partridge, Lucas (GE Aviation) 
Cc: users@zeppelin.apache.org
Subject: EXT: Re: Use Bokeh in Apache Zeppelin


Awesome. Glad to see you can use bokeh in zeppelin. From bokeh after 0.12.7, 
you need bkzep. You can check the README here. https://github.com/zjffdu/bkzep

Actually you just need to import bkzep. You don't need to call 
install_notebook_hook explicitly.



Partridge, Lucas (GE Aviation) 
>于2018年1月10日周三 上午12:35写道:
Hi Jeff,

I eventually managed to get Bokeh running in Zeppelin 0.7.0 after finding your 
code at https://pypkg.com/pypi/bkzep/f/bkzep/__init__.py . So I did ‘pip 
install bkzep’ and restarted Zeppelin. Then if I pasted this code of yours…

from bokeh.io import install_notebook_hook
from bkzep.io import load_notebook, 
_show_zeppelin_app_with_state, _show_zeppelin_doc_with_state

install_notebook_hook('zeppelin', load_notebook,
  _show_zeppelin_doc_with_state, 
_show_zeppelin_app_with_state, overwrite=True)

…into a notebook paragraph before using Bokeh then I could see my plots 
directly within Zeppelin:).
Thanks, Lucas.

From: Partridge, Lucas (GE Aviation)
Sent: 09 January 2018 15:01

To: users@zeppelin.apache.org
Cc: zjf...@gmail.com
Subject: EXT: RE: Use Bokeh in Apache Zeppelin

I forgot to say I’m using Bokeh 0.12.13.

From: Partridge, Lucas (GE Aviation)
Sent: 09 January 2018 13:24
To: users@zeppelin.apache.org
Cc: zjf...@gmail.com
Subject: EXT: RE: Use Bokeh in Apache Zeppelin

Hi Jeff,

Adding support for Bokeh in Zeppelin is great! At 
https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS90ZXN0L2VhMGI0ODQ0MzNhYjQxNjZhODg5MjI1ZjAxZWVjMjdiL25vdGUuanNvbg
 it says:

“If you want to use bokeh in spark interpreter. You need HDP 2.6.0 (Zeppelin 
0.7.0) or afterwards”

I’m not using HDP but I am using Zeppelin 0.7.0 (zeppelin-0.7.0-bin-all.tgz) in 
ubuntu 16.04. And when I do this in a notebook bound to the Spark interpreter 
group:

%pyspark
from bokeh.io import output_notebook
output_notebook(notebook_type='zeppelin')

I get this error:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8411751233295366188.py", line 346, in 
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8411751233295366188.py", line 339, in 
exec(code)
  File "", line 2, in 
  File "/home/lucas/.local/lib/python2.7/site-packages/bokeh/util/api.py", line 
190, in wrapper
return obj(*args, **kw)
  File "/home/lucas/.local/lib/python2.7/site-packages/bokeh/io/output.py", 
line 114, in output_notebook
run_notebook_hook(notebook_type, 'load', resources, verbose, hide_banner, 
load_timeout)
  File "/home/lucas/.local/lib/python2.7/site-packages/bokeh/util/api.py", line 
190, in wrapper
return obj(*args, **kw)
  File "/home/lucas/.local/lib/python2.7/site-packages/bokeh/io/notebook.py", 
line 286, in run_notebook_hook
raise RuntimeError("no display hook installed for notebook type %r" % 
notebook_type)
RuntimeError: no display hook installed for notebook type 'zeppelin'

Can you confirm Bokeh does work with the %pyspark interpreter in Zeppelin 
0.7.0? Or should I move to a later version of Zeppelin? I’d rather stick with 
0.7.0 for now if possible.

Thanks, Lucas.

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: 02 July 2017 00:01
To: users >
Subject: EXT: Use Bokeh in Apache Zeppelin


I write a tutorial on using bokeh in apache zeppelin. If you are interested in 
data visualization in zeppelin notebook, bokeh would be a very good library for 
you. And you can take a look at the tutorial here.

https://community.hortonworks.com/articles/109837/use-bokeh-in-apache-zeppelin.html




Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Felix Cheung
java.nio.BufferUnderflowException

Can you try reading the same data in Scala?



From: Liana Napalkova 
Sent: Wednesday, January 10, 2018 12:04:00 PM
To: Timur Shenkao
Cc: user@spark.apache.org
Subject: Re: py4j.protocol.Py4JJavaError: An error occurred while calling 
o794.parquet

The DataFrame is not empy.
Indeed, it has nothing to do with serialization. I think that the issue is 
related to this bug: https://issues.apache.org/jira/browse/SPARK-22769
In my question I have not posted the whole error stack trace, but one of the 
error messages says `Could not find CoarseGrainedScheduler`. So, it's probably 
something related to the resources.


From: Timur Shenkao 
Sent: 10 January 2018 20:07:37
To: Liana Napalkova
Cc: user@spark.apache.org
Subject: Re: py4j.protocol.Py4JJavaError: An error occurred while calling 
o794.parquet


Caused by: org.apache.spark.SparkException: Task not serializable


That's the answer :)

What are you trying to save? Is it empty or None / null?

On Wed, Jan 10, 2018 at 4:58 PM, Liana Napalkova 
> wrote:

Hello,

Has anybody faced the following problem in PySpark? (Python 2.7.12):

df.show() # works fine and shows the first 5 rows of DataFrame

df.write.parquet(outputPath + '/data.parquet', mode="overwrite")  # throws 
the error

The last line throws the following error:


py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)

Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)

Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)

Caused by: java.nio.BufferUnderflowException

at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

Thanks.

L.


DISCLAIMER: Aquest missatge pot contenir informació confidencial. Si vostè no 
n'és el destinatari, si us plau, esborri'l i faci'ns-ho saber immediatament a 
la següent adreça: le...@eurecat.org Si el 
destinatari d'aquest missatge no consent la utilització del correu electrònic 
via Internet i la gravació 

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-09 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319855#comment-16319855
 ] 

Felix Cheung commented on SPARK-21727:
--

good call...

if (
(is.atomic(object) && !is.raw(object)) &&
length(object) > 1
)

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22998) Value for SPARK_MOUNTED_CLASSPATH in executor pods is not set

2018-01-09 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-22998.
--
  Resolution: Fixed
Assignee: Yinan Li
Target Version/s: 2.3.0

> Value for SPARK_MOUNTED_CLASSPATH in executor pods is not set
> -
>
> Key: SPARK-22998
> URL: https://issues.apache.org/jira/browse/SPARK-22998
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Assignee: Yinan Li
> Fix For: 2.3.0
>
>
> The environment variable {{SPARK_MOUNTED_CLASSPATH}} is referenced by the 
> executor's Dockerfile, but is not set by the k8s scheduler backend.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Integration testing and Scheduler Backends

2018-01-08 Thread Felix Cheung
How would (2) be uncommon elsewhere?

On Mon, Jan 8, 2018 at 10:16 PM Anirudh Ramanathan 
wrote:

> This is with regard to the Kubernetes Scheduler Backend and scaling the
> process to accept contributions. Given we're moving past upstreaming
> changes from our fork, and into getting *new* patches, I wanted to start
> this discussion sooner than later. This is more of a post-2.3 question -
> not something we're looking to solve right away.
>
> While unit tests are handy, they're not nearly as good at giving us
> confidence as a successful run of our integration tests against
> single/multi-node k8s clusters. Currently, we have integration testing
> setup at https://github.com/apache-spark-on-k8s/spark-integration and
> it's running continuously against apache/spark:master in
> pepperdata-jenkins
>  (on
> minikube) & k8s-testgrid
>  (in
> GKE clusters). Now, the question is - how do we make integration-tests
> part of the PR author's workflow?
>
> 1. Keep the integration tests in the separate repo and require that
> contributors run them, add new tests prior to accepting their PRs as a
> policy. Given minikube  is easy
> to setup and can run on a single-node, it would certainly be possible.
> Friction however, stems from contributors potentially having to modify the
> integration test code hosted in that separate repository when
> adding/changing functionality in the scheduler backend. Also, it's
> certainly going to lead to at least brief inconsistencies between the two
> repositories.
>
> 2. Alternatively, we check in the integration tests alongside the actual
> scheduler backend code. This would work really well and is what we did in
> our fork. It would have to be a separate package which would take certain
> parameters (like cluster endpoint) and run integration test code against a
> local or remote cluster. It would include least some code dealing with
> accessing the cluster, reading results from K8s containers, test fixtures,
> etc.
>
> I see value in adopting (2), given it's a clearer path for contributors
> and lets us keep the two pieces consistent, but it seems uncommon
> elsewhere. How do the other backends, i.e. YARN, Mesos and Standalone deal
> with accepting patches and ensuring that they do not break existing
> clusters? Is there automation employed for this thus far? Would love to get
> opinions on (1) v/s (2).
>
> Thanks,
> Anirudh
>
>
>


[jira] [Commented] (SPARK-21293) R document update structured streaming

2018-01-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317805#comment-16317805
 ] 

Felix Cheung commented on SPARK-21293:
--

leaving it open for the rest of items

> R document update structured streaming
> --
>
> Key: SPARK-21293
> URL: https://issues.apache.org/jira/browse/SPARK-21293
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
>
> add examples for
> Window Operations on Event Time
> Join Operations
> Streaming Deduplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21292) R document Catalog function metadata refresh

2018-01-08 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21292.
--
  Resolution: Fixed
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> R document Catalog function metadata refresh
> 
>
> Key: SPARK-21292
> URL: https://issues.apache.org/jira/browse/SPARK-21292
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
> Fix For: 2.3.0
>
>
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/sql-programming-guide.html#metadata-refreshing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21290) R document Programmatically Specifying the Schema in SQL guide

2018-01-08 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21290:
-
Target Version/s:   (was: 2.3.0)

> R document Programmatically Specifying the Schema in SQL guide
> --
>
> Key: SPARK-21290
> URL: https://issues.apache.org/jira/browse/SPARK-21290
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21293) R document update structured streaming

2018-01-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317739#comment-16317739
 ] 

Felix Cheung commented on SPARK-21293:
--

not done: 
Join Operations
Streaming Deduplication

> R document update structured streaming
> --
>
> Key: SPARK-21293
> URL: https://issues.apache.org/jira/browse/SPARK-21293
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>    Assignee: Felix Cheung
>
> add examples for
> Window Operations on Event Time
> Join Operations
> Streaming Deduplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Felix Cheung
And Hadoop-3.x is not part of the release and sign off for 2.2.1.

Maybe we could update the website to avoid any confusion with "later".


From: Josh Rosen 
Sent: Monday, January 8, 2018 10:17:14 AM
To: akshay naidu
Cc: Saisai Shao; Raj Adyanthaya; spark users
Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

My current best guess is that Spark does not fully support Hadoop 3.x because 
https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for 
Hadoop 3.x) has not been resolved. There are also likely to be transitive 
dependency conflicts which will need to be resolved.

On Mon, Jan 8, 2018 at 8:52 AM akshay naidu 
> wrote:
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and 
later', but my confusion is because spark was released on 1st dec and hadoop-3 
stable version released on 13th Dec. And  to my similar question on 
stackoverflow.com
 , Mr. jacek-laskowski 
replied that spark-2.2.1 doesn't support hadoop-3. so I am just looking for 
more clarity on this doubt before moving on to upgrades.

Thanks all for help.

Akshay.

On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao 
> wrote:
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it is 
not clear whether it is supported or not (or has some issues). I think in the 
download page "Pre-Built for Apache Hadoop 2.7 and later" mostly means that it 
supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).

Thanks
Jerry

2018-01-08 4:50 GMT+08:00 Raj Adyanthaya 
>:
Hi Akshay

On the Spark Download page when you select Spark 2.2.1 it gives you an option 
to select package type. In that, there is an option to select  "Pre-Built for 
Apache Hadoop 2.7 and later". I am assuming it means that it does support 
Hadoop 3.0.

http://spark.apache.org/downloads.html

Thanks,
Raj A.

On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
> wrote:
hello Users,
I need to know whether we can run latest spark on  latest hadoop version i.e., 
spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
thanks.





[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",

2018-01-07 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315438#comment-16315438
 ] 

Felix Cheung commented on SPARK-22918:
--

[~sameerag]
we might want to check this for 2.3.0 release

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(Nucleus

<    4   5   6   7   8   9   10   11   12   13   >