[jira] [Commented] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException

2016-08-26 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440786#comment-15440786
 ] 

Miao Wang commented on SPARK-17110:
---

I set up a two-node cluster, one master, one worker, 48 cores. 1G memory. 
pyspark run the above code works fine. No exception. It seems that this bug has 
been fix in latest master branch. Can you upgrade and try again?

> Pyspark with locality ANY throw java.io.StreamCorruptedException
> 
>
> Key: SPARK-17110
> URL: https://issues.apache.org/jira/browse/SPARK-17110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Cluster of 2 AWS r3.xlarge nodes launched via ec2 
> scripts, Spark 2.0.0, hadoop: yarn, pyspark shell
>Reporter: Tomer Kaftan
>Priority: Critical
>
> In Pyspark 2.0.0, any task that accesses cached data non-locally throws a 
> StreamCorruptedException like the stacktrace below:
> {noformat}
> WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): 
> java.io.StreamCorruptedException: invalid stream header: 12010A80
> at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807)
> at java.io.ObjectInputStream.(ObjectInputStream.java:302)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
> at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The simplest way I have found to reproduce this is by running the following 
> code in the pyspark shell, on a cluster of 2 nodes set to use only one worker 
> core each:
> {code}
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}
> Or by running the following via spark-submit:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17276) Stop environment parameters flooding Jenkins build output

2016-08-26 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-17276:

Attachment: Screen Shot 2016-08-26 at 10.52.07 PM.png

> Stop environment parameters flooding Jenkins build output
> -
>
> Key: SPARK-17276
> URL: https://issues.apache.org/jira/browse/SPARK-17276
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Xin Ren
>Priority: Minor
> Attachments: Screen Shot 2016-08-26 at 10.52.07 PM.png
>
>
> When I was trying to find error msg in a failed Jenkins build job, annoyed by 
> the huge env output. 
> The env parameter output should be muted.
> {code}
> [info] PipedRDDSuite:
> [info] - basic pipe (51 milliseconds)
>   0   0   0
> [info] - basic pipe with tokenization (60 milliseconds)
> [info] - failure in iterating over pipe input (49 milliseconds)
> [info] - advanced pipe (100 milliseconds)
> [info] - pipe with empty partition (117 milliseconds)
> PATH=/home/anaconda/envs/py3k/bin:/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:/usr/java/jdk1.8.0_60/bin:/home/jenkins/tools/hudson.model.JDK/JDK_7u60/bin:/home/jenkins/.cargo/bin:/home/anaconda/bin:/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.1.1/bin/:/home/android-sdk/:/usr/local/bin:/bin:/usr/bin:/home/anaconda/envs/py3k/bin
> BUILD_CAUSE_GHPRBCAUSE=true
> SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Phive -Pkinesis-asl 
> -Phive-thriftserver
> HUDSON_HOME=/var/lib/jenkins
> AWS_SECRET_ACCESS_KEY=
> JOB_URL=https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
> HUDSON_COOKIE=638da3d2-d27a-4724-b41a-5ff6e8ce6752
> LINES=24
> CURRENT_BLOCK=18
> ANDROID_HOME=/home/android-sdk/
> ghprbActualCommit=70a751c6959048e65c083ab775b01523da4578a2
> ghprbSourceBranch=codeWalkThroughML
> GITHUB_OAUTH_KEY=
> MAIL=/var/mail/jenkins
> AMPLAB_JENKINS=1
> JENKINS_SERVER_COOKIE=472906e9832aeb79
> ghprbPullTitle=[MINOR][MLlib][SQL] Clean up unused variables and unused import
> LOGNAME=jenkins
> PWD=/home/jenkins/workspace/SparkPullRequestBuilder
> JENKINS_URL=https://amplab.cs.berkeley.edu/jenkins/
> SPARK_VERSIONS_SUITE_IVY_PATH=/home/sparkivy/per-executor-caches/9/.ivy2
> ROOT_BUILD_CAUSE_GHPRBCAUSE=true
> ghprbActualCommitAuthorEmail=iamsh...@126.com
> ghprbTargetBranch=master
> BUILD_TAG=jenkins-SparkPullRequestBuilder-64504
> SHELL=/bin/bash
> ROOT_BUILD_CAUSE=GHPRBCAUSE
> SBT_OPTS=-Duser.home=/home/sparkivy/per-executor-caches/9 
> -Dsbt.ivy.home=/home/sparkivy/per-executor-caches/9/.ivy2
> JENKINS_HOME=/var/lib/jenkins
> sha1=origin/pr/14836/merge
> ghprbPullDescription=GitHub pull request #14836 of commit 
> 70a751c6959048e65c083ab775b01523da4578a2 automatically merged.
> NODE_NAME=amp-jenkins-worker-02
> BUILD_DISPLAY_NAME=#64504
> JAVA_7_HOME=/usr/java/jdk1.7.0_79
> GIT_BRANCH=codeWalkThroughML
> SHLVL=3
> AMP_JENKINS_PRB=true
> JAVA_HOME=/usr/java/jdk1.8.0_60
> JENKINS_MASTER_HOSTNAME=amp-jenkins-master
> BUILD_ID=64504
> XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
> ghprbPullLink=https://api.github.com/repos/apache/spark/pulls/14836
> JOB_NAME=SparkPullRequestBuilder
> BUILD_CAUSE=GHPRBCAUSE
> SPARK_SCALA_VERSION=2.11
> AWS_ACCESS_KEY_ID=
> NODE_LABELS=amp-jenkins-worker-02 centos spark-compile spark-test
> HUDSON_URL=https://amplab.cs.berkeley.edu/jenkins/
> SPARK_PREPEND_CLASSES=1
> COLUMNS=80
> WORKSPACE=/home/jenkins/workspace/SparkPullRequestBuilder
> SPARK_TESTING=1
> _=/usr/java/jdk1.8.0_60/bin/java
> GIT_COMMIT=b31b82bcc9d8767561ee720c9e7192252f4fd3fc
> ghprbPullId=14836
> EXECUTOR_NUMBER=9
> SSH_CLIENT=192.168.10.10 44762 22
> HUDSON_SERVER_COOKIE=472906e9832aeb79
> cat: nonexistent_file: No such file or directory
> cat: nonexistent_file: No such file or directory
> 

[jira] [Commented] (SPARK-17276) Stop environment parameters flooding Jenkins build output

2016-08-26 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440782#comment-15440782
 ] 

Xin Ren commented on SPARK-17276:
-

I'm working on it.

> Stop environment parameters flooding Jenkins build output
> -
>
> Key: SPARK-17276
> URL: https://issues.apache.org/jira/browse/SPARK-17276
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Xin Ren
>Priority: Minor
> Attachments: Screen Shot 2016-08-26 at 10.52.07 PM.png
>
>
> When I was trying to find error msg in a failed Jenkins build job, annoyed by 
> the huge env output. 
> The env parameter output should be muted.
> {code}
> [info] PipedRDDSuite:
> [info] - basic pipe (51 milliseconds)
>   0   0   0
> [info] - basic pipe with tokenization (60 milliseconds)
> [info] - failure in iterating over pipe input (49 milliseconds)
> [info] - advanced pipe (100 milliseconds)
> [info] - pipe with empty partition (117 milliseconds)
> PATH=/home/anaconda/envs/py3k/bin:/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:/usr/java/jdk1.8.0_60/bin:/home/jenkins/tools/hudson.model.JDK/JDK_7u60/bin:/home/jenkins/.cargo/bin:/home/anaconda/bin:/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.1.1/bin/:/home/android-sdk/:/usr/local/bin:/bin:/usr/bin:/home/anaconda/envs/py3k/bin
> BUILD_CAUSE_GHPRBCAUSE=true
> SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Phive -Pkinesis-asl 
> -Phive-thriftserver
> HUDSON_HOME=/var/lib/jenkins
> AWS_SECRET_ACCESS_KEY=
> JOB_URL=https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
> HUDSON_COOKIE=638da3d2-d27a-4724-b41a-5ff6e8ce6752
> LINES=24
> CURRENT_BLOCK=18
> ANDROID_HOME=/home/android-sdk/
> ghprbActualCommit=70a751c6959048e65c083ab775b01523da4578a2
> ghprbSourceBranch=codeWalkThroughML
> GITHUB_OAUTH_KEY=
> MAIL=/var/mail/jenkins
> AMPLAB_JENKINS=1
> JENKINS_SERVER_COOKIE=472906e9832aeb79
> ghprbPullTitle=[MINOR][MLlib][SQL] Clean up unused variables and unused import
> LOGNAME=jenkins
> PWD=/home/jenkins/workspace/SparkPullRequestBuilder
> JENKINS_URL=https://amplab.cs.berkeley.edu/jenkins/
> SPARK_VERSIONS_SUITE_IVY_PATH=/home/sparkivy/per-executor-caches/9/.ivy2
> ROOT_BUILD_CAUSE_GHPRBCAUSE=true
> ghprbActualCommitAuthorEmail=iamsh...@126.com
> ghprbTargetBranch=master
> BUILD_TAG=jenkins-SparkPullRequestBuilder-64504
> SHELL=/bin/bash
> ROOT_BUILD_CAUSE=GHPRBCAUSE
> SBT_OPTS=-Duser.home=/home/sparkivy/per-executor-caches/9 
> -Dsbt.ivy.home=/home/sparkivy/per-executor-caches/9/.ivy2
> JENKINS_HOME=/var/lib/jenkins
> sha1=origin/pr/14836/merge
> ghprbPullDescription=GitHub pull request #14836 of commit 
> 70a751c6959048e65c083ab775b01523da4578a2 automatically merged.
> NODE_NAME=amp-jenkins-worker-02
> BUILD_DISPLAY_NAME=#64504
> JAVA_7_HOME=/usr/java/jdk1.7.0_79
> GIT_BRANCH=codeWalkThroughML
> SHLVL=3
> AMP_JENKINS_PRB=true
> JAVA_HOME=/usr/java/jdk1.8.0_60
> JENKINS_MASTER_HOSTNAME=amp-jenkins-master
> BUILD_ID=64504
> XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
> ghprbPullLink=https://api.github.com/repos/apache/spark/pulls/14836
> JOB_NAME=SparkPullRequestBuilder
> BUILD_CAUSE=GHPRBCAUSE
> SPARK_SCALA_VERSION=2.11
> AWS_ACCESS_KEY_ID=
> NODE_LABELS=amp-jenkins-worker-02 centos spark-compile spark-test
> HUDSON_URL=https://amplab.cs.berkeley.edu/jenkins/
> SPARK_PREPEND_CLASSES=1
> COLUMNS=80
> WORKSPACE=/home/jenkins/workspace/SparkPullRequestBuilder
> SPARK_TESTING=1
> _=/usr/java/jdk1.8.0_60/bin/java
> GIT_COMMIT=b31b82bcc9d8767561ee720c9e7192252f4fd3fc
> ghprbPullId=14836
> EXECUTOR_NUMBER=9
> SSH_CLIENT=192.168.10.10 44762 22
> HUDSON_SERVER_COOKIE=472906e9832aeb79
> cat: nonexistent_file: No such file or directory
> cat: nonexistent_file: No such file or directory
> 

[jira] [Created] (SPARK-17276) Stop environment parameters flooding Jenkins build output

2016-08-26 Thread Xin Ren (JIRA)
Xin Ren created SPARK-17276:
---

 Summary: Stop environment parameters flooding Jenkins build output
 Key: SPARK-17276
 URL: https://issues.apache.org/jira/browse/SPARK-17276
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Tests
Affects Versions: 2.0.0
Reporter: Xin Ren
Priority: Minor


When I was trying to find error msg in a failed Jenkins build job, annoyed by 
the huge env output. 

The env parameter output should be muted.

{code}
[info] PipedRDDSuite:
[info] - basic pipe (51 milliseconds)
  0   0   0
[info] - basic pipe with tokenization (60 milliseconds)
[info] - failure in iterating over pipe input (49 milliseconds)
[info] - advanced pipe (100 milliseconds)
[info] - pipe with empty partition (117 milliseconds)
PATH=/home/anaconda/envs/py3k/bin:/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:/usr/java/jdk1.8.0_60/bin:/home/jenkins/tools/hudson.model.JDK/JDK_7u60/bin:/home/jenkins/.cargo/bin:/home/anaconda/bin:/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.1.1/bin/:/home/android-sdk/:/usr/local/bin:/bin:/usr/bin:/home/anaconda/envs/py3k/bin
BUILD_CAUSE_GHPRBCAUSE=true
SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Phive -Pkinesis-asl -Phive-thriftserver
HUDSON_HOME=/var/lib/jenkins
AWS_SECRET_ACCESS_KEY=
JOB_URL=https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
HUDSON_COOKIE=638da3d2-d27a-4724-b41a-5ff6e8ce6752
LINES=24
CURRENT_BLOCK=18
ANDROID_HOME=/home/android-sdk/
ghprbActualCommit=70a751c6959048e65c083ab775b01523da4578a2
ghprbSourceBranch=codeWalkThroughML
GITHUB_OAUTH_KEY=
MAIL=/var/mail/jenkins
AMPLAB_JENKINS=1
JENKINS_SERVER_COOKIE=472906e9832aeb79
ghprbPullTitle=[MINOR][MLlib][SQL] Clean up unused variables and unused import
LOGNAME=jenkins
PWD=/home/jenkins/workspace/SparkPullRequestBuilder
JENKINS_URL=https://amplab.cs.berkeley.edu/jenkins/
SPARK_VERSIONS_SUITE_IVY_PATH=/home/sparkivy/per-executor-caches/9/.ivy2
ROOT_BUILD_CAUSE_GHPRBCAUSE=true
ghprbActualCommitAuthorEmail=iamsh...@126.com
ghprbTargetBranch=master
BUILD_TAG=jenkins-SparkPullRequestBuilder-64504
SHELL=/bin/bash
ROOT_BUILD_CAUSE=GHPRBCAUSE
SBT_OPTS=-Duser.home=/home/sparkivy/per-executor-caches/9 
-Dsbt.ivy.home=/home/sparkivy/per-executor-caches/9/.ivy2
JENKINS_HOME=/var/lib/jenkins
sha1=origin/pr/14836/merge
ghprbPullDescription=GitHub pull request #14836 of commit 
70a751c6959048e65c083ab775b01523da4578a2 automatically merged.
NODE_NAME=amp-jenkins-worker-02
BUILD_DISPLAY_NAME=#64504
JAVA_7_HOME=/usr/java/jdk1.7.0_79
GIT_BRANCH=codeWalkThroughML
SHLVL=3
AMP_JENKINS_PRB=true
JAVA_HOME=/usr/java/jdk1.8.0_60
JENKINS_MASTER_HOSTNAME=amp-jenkins-master
BUILD_ID=64504
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
ghprbPullLink=https://api.github.com/repos/apache/spark/pulls/14836
JOB_NAME=SparkPullRequestBuilder
BUILD_CAUSE=GHPRBCAUSE
SPARK_SCALA_VERSION=2.11
AWS_ACCESS_KEY_ID=
NODE_LABELS=amp-jenkins-worker-02 centos spark-compile spark-test
HUDSON_URL=https://amplab.cs.berkeley.edu/jenkins/
SPARK_PREPEND_CLASSES=1
COLUMNS=80
WORKSPACE=/home/jenkins/workspace/SparkPullRequestBuilder
SPARK_TESTING=1
_=/usr/java/jdk1.8.0_60/bin/java
GIT_COMMIT=b31b82bcc9d8767561ee720c9e7192252f4fd3fc
ghprbPullId=14836
EXECUTOR_NUMBER=9
SSH_CLIENT=192.168.10.10 44762 22
HUDSON_SERVER_COOKIE=472906e9832aeb79
cat: nonexistent_file: No such file or directory
cat: nonexistent_file: No such file or directory

[jira] [Commented] (SPARK-17274) Move join optimizer rules into a separate file

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440750#comment-15440750
 ] 

Apache Spark commented on SPARK-17274:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14846

> Move join optimizer rules into a separate file
> --
>
> Key: SPARK-17274
> URL: https://issues.apache.org/jira/browse/SPARK-17274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17274) Move join optimizer rules into a separate file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17274:


Assignee: Reynold Xin  (was: Apache Spark)

> Move join optimizer rules into a separate file
> --
>
> Key: SPARK-17274
> URL: https://issues.apache.org/jira/browse/SPARK-17274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17274) Move join optimizer rules into a separate file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17274:


Assignee: Apache Spark  (was: Reynold Xin)

> Move join optimizer rules into a separate file
> --
>
> Key: SPARK-17274
> URL: https://issues.apache.org/jira/browse/SPARK-17274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17275) Flaky test: org.apache.spark.deploy.RPackageUtilsSuite.jars that don't exist are skipped and print warning

2016-08-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440747#comment-15440747
 ] 

Yin Huai commented on SPARK-17275:
--

cc [~felixcheung] [~shivaram]

> Flaky test: org.apache.spark.deploy.RPackageUtilsSuite.jars that don't exist 
> are skipped and print warning
> --
>
> Key: SPARK-17275
> URL: https://issues.apache.org/jira/browse/SPARK-17275
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1623/testReport/junit/org.apache.spark.deploy/RPackageUtilsSuite/jars_that_don_t_exist_are_skipped_and_print_warning/
> {code}
> Error Message
> java.io.IOException: Unable to delete directory 
> /home/jenkins/.ivy2/cache/a/mylib.
> Stacktrace
> sbt.ForkMain$ForkError: java.io.IOException: Unable to delete directory 
> /home/jenkins/.ivy2/cache/a/mylib.
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1541)
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
>   at 
> org.apache.spark.deploy.IvyTestUtils$.purgeLocalIvyCache(IvyTestUtils.scala:394)
>   at 
> org.apache.spark.deploy.IvyTestUtils$.withRepository(IvyTestUtils.scala:384)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply$mcV$sp(RPackageUtilsSuite.scala:103)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply(RPackageUtilsSuite.scala:100)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply(RPackageUtilsSuite.scala:100)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(RPackageUtilsSuite.scala:38)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite.runTest(RPackageUtilsSuite.scala:38)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
>   at 
> 

[jira] [Created] (SPARK-17275) Flaky test: org.apache.spark.deploy.RPackageUtilsSuite.jars that don't exist are skipped and print warning

2016-08-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-17275:


 Summary: Flaky test: 
org.apache.spark.deploy.RPackageUtilsSuite.jars that don't exist are skipped 
and print warning
 Key: SPARK-17275
 URL: https://issues.apache.org/jira/browse/SPARK-17275
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Yin Huai


https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1623/testReport/junit/org.apache.spark.deploy/RPackageUtilsSuite/jars_that_don_t_exist_are_skipped_and_print_warning/
{code}
Error Message

java.io.IOException: Unable to delete directory 
/home/jenkins/.ivy2/cache/a/mylib.
Stacktrace

sbt.ForkMain$ForkError: java.io.IOException: Unable to delete directory 
/home/jenkins/.ivy2/cache/a/mylib.
at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1541)
at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
at 
org.apache.spark.deploy.IvyTestUtils$.purgeLocalIvyCache(IvyTestUtils.scala:394)
at 
org.apache.spark.deploy.IvyTestUtils$.withRepository(IvyTestUtils.scala:384)
at 
org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply$mcV$sp(RPackageUtilsSuite.scala:103)
at 
org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply(RPackageUtilsSuite.scala:100)
at 
org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply(RPackageUtilsSuite.scala:100)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.deploy.RPackageUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(RPackageUtilsSuite.scala:38)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.deploy.RPackageUtilsSuite.runTest(RPackageUtilsSuite.scala:38)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 

[jira] [Created] (SPARK-17274) Move join optimizer rules into a separate file

2016-08-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17274:
---

 Summary: Move join optimizer rules into a separate file
 Key: SPARK-17274
 URL: https://issues.apache.org/jira/browse/SPARK-17274
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17273) Move expression optimizer rules into a separate file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17273:


Assignee: Reynold Xin  (was: Apache Spark)

> Move expression optimizer rules into a separate file
> 
>
> Key: SPARK-17273
> URL: https://issues.apache.org/jira/browse/SPARK-17273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17273) Move expression optimizer rules into a separate file

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440742#comment-15440742
 ] 

Apache Spark commented on SPARK-17273:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14845

> Move expression optimizer rules into a separate file
> 
>
> Key: SPARK-17273
> URL: https://issues.apache.org/jira/browse/SPARK-17273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17273) Move expression optimizer rules into a separate file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17273:


Assignee: Apache Spark  (was: Reynold Xin)

> Move expression optimizer rules into a separate file
> 
>
> Key: SPARK-17273
> URL: https://issues.apache.org/jira/browse/SPARK-17273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17272) Move subquery optimizer rules into its own file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17272:


Assignee: Apache Spark  (was: Reynold Xin)

> Move subquery optimizer rules into its own file
> ---
>
> Key: SPARK-17272
> URL: https://issues.apache.org/jira/browse/SPARK-17272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17272) Move subquery optimizer rules into its own file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17272:


Assignee: Reynold Xin  (was: Apache Spark)

> Move subquery optimizer rules into its own file
> ---
>
> Key: SPARK-17272
> URL: https://issues.apache.org/jira/browse/SPARK-17272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17273) Move expression optimizer rules into a separate file

2016-08-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17273:
---

 Summary: Move expression optimizer rules into a separate file
 Key: SPARK-17273
 URL: https://issues.apache.org/jira/browse/SPARK-17273
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17272) Move subquery optimizer rules into its own file

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440733#comment-15440733
 ] 

Apache Spark commented on SPARK-17272:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14844

> Move subquery optimizer rules into its own file
> ---
>
> Key: SPARK-17272
> URL: https://issues.apache.org/jira/browse/SPARK-17272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17272) Move subquery optimizer rules into its own file

2016-08-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17272:
---

 Summary: Move subquery optimizer rules into its own file
 Key: SPARK-17272
 URL: https://issues.apache.org/jira/browse/SPARK-17272
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17270) Move object optimization rules into its own file

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440725#comment-15440725
 ] 

Apache Spark commented on SPARK-17270:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14843

> Move object optimization rules into its own file
> 
>
> Key: SPARK-17270
> URL: https://issues.apache.org/jira/browse/SPARK-17270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17269) Move finish analysis stage into its own file

2016-08-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17269.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Move finish analysis stage into its own file
> 
>
> Key: SPARK-17269
> URL: https://issues.apache.org/jira/browse/SPARK-17269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17270) Move object optimization rules into its own file

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17270.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

https://github.com/apache/spark/pull/14839 has been merged to master.

> Move object optimization rules into its own file
> 
>
> Key: SPARK-17270
> URL: https://issues.apache.org/jira/browse/SPARK-17270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10747) add support for NULLS FIRST|LAST in ORDER BY clause

2016-08-26 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-10747:
---
Summary: add support for NULLS FIRST|LAST in ORDER BY clause  (was: add 
support for window specification to include how NULLS are ordered)

> add support for NULLS FIRST|LAST in ORDER BY clause
> ---
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17268) Break Optimizer.scala apart

2016-08-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17268:

Description: 
Optimizer.scala has become too large to maintain. We would need to break it 
apart into multiple files each of which contains rules that are logically 
relevant.

We can create the following files for logical grouping:
- finish analysis
- joins
- expressions
- subquery
- objects


  was:
Optimizer.scala has become too large to maintain. We would need to break it 
apart into multiple files each of which contains rules that are logically 
relevant.



> Break Optimizer.scala apart
> ---
>
> Key: SPARK-17268
> URL: https://issues.apache.org/jira/browse/SPARK-17268
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Optimizer.scala has become too large to maintain. We would need to break it 
> apart into multiple files each of which contains rules that are logically 
> relevant.
> We can create the following files for logical grouping:
> - finish analysis
> - joins
> - expressions
> - subquery
> - objects



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10747) add support for window specification to include how NULLS are ordered

2016-08-26 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440656#comment-15440656
 ] 

Xin Wu commented on SPARK-10747:


This JIRA may be changed to support NULLS FIRST|LAST feature in ORDER BY 
clause. 

> add support for window specification to include how NULLS are ordered
> -
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10747) add support for window specification to include how NULLS are ordered

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10747:


Assignee: Apache Spark

> add support for window specification to include how NULLS are ordered
> -
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>Assignee: Apache Spark
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Commented] (SPARK-10747) add support for window specification to include how NULLS are ordered

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440651#comment-15440651
 ] 

Apache Spark commented on SPARK-10747:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/14842

> add support for window specification to include how NULLS are ordered
> -
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10747) add support for window specification to include how NULLS are ordered

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10747:


Assignee: (was: Apache Spark)

> add support for window specification to include how NULLS are ordered
> -
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10747) add support for window specification to include how NULLS are ordered

2016-08-26 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-10747:
---
Issue Type: New Feature  (was: Improvement)

> add support for window specification to include how NULLS are ordered
> -
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

2016-08-26 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440598#comment-15440598
 ] 

Frederick Reiss commented on SPARK-16963:
-

Updated the pull request to address some conflicting changes in the main branch 
and to address some minor review comments. Changed the name of `getMinOffset` 
to `lastCommittedOffset` per Prashant's comments. Changes are still ready for 
review.

> Change Source API so that sources do not need to keep unbounded state
> -
>
> Key: SPARK-16963
> URL: https://issues.apache.org/jira/browse/SPARK-16963
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Frederick Reiss
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() 
> method for fetching records from the source, with the following Scaladoc 
> comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When 
> `start` is `None` then
>  * the batch should begin with the first available record. This method must 
> always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the 
> stream that it backs. Further, a Source is also required to retain this data 
> across restarts of the process where the Source is instantiated, even when 
> the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any 
> implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
>  for more information.
> This JIRA will cover augmenting the Source API with an additional callback 
> that will allow Structured Streaming scheduler to notify the source when it 
> is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17271) Planner adds un-necessary Sort even if child ordering is semantically same as required ordering

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17271:


Assignee: Apache Spark

> Planner adds un-necessary Sort even if child ordering is semantically same as 
> required ordering
> ---
>
> Key: SPARK-17271
> URL: https://issues.apache.org/jira/browse/SPARK-17271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>
> Found a case when the planner is adding un-needed SORT operation due to bug 
> in the way comparison for `SortOrder` is done at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
> `SortOrder` needs to be compared semantically because `Expression` within two 
> `SortOrder` can be "semantically equal" but not literally equal objects.
> eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`
> Expression in required SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId,
> qualifier = Some("a")
>   )
> {code}
> Expression in child SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId)
> {code}
> Notice that the output column has a qualifier but the child attribute does 
> not but the inherent expression is the same and hence in this case we can say 
> that the child satisfies the required sort order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17271) Planner adds un-necessary Sort even if child ordering is semantically same as required ordering

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17271:


Assignee: (was: Apache Spark)

> Planner adds un-necessary Sort even if child ordering is semantically same as 
> required ordering
> ---
>
> Key: SPARK-17271
> URL: https://issues.apache.org/jira/browse/SPARK-17271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Tejas Patil
>
> Found a case when the planner is adding un-needed SORT operation due to bug 
> in the way comparison for `SortOrder` is done at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
> `SortOrder` needs to be compared semantically because `Expression` within two 
> `SortOrder` can be "semantically equal" but not literally equal objects.
> eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`
> Expression in required SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId,
> qualifier = Some("a")
>   )
> {code}
> Expression in child SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId)
> {code}
> Notice that the output column has a qualifier but the child attribute does 
> not but the inherent expression is the same and hence in this case we can say 
> that the child satisfies the required sort order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17271) Planner adds un-necessary Sort even if child ordering is semantically same as required ordering

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440557#comment-15440557
 ] 

Apache Spark commented on SPARK-17271:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/14841

> Planner adds un-necessary Sort even if child ordering is semantically same as 
> required ordering
> ---
>
> Key: SPARK-17271
> URL: https://issues.apache.org/jira/browse/SPARK-17271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Tejas Patil
>
> Found a case when the planner is adding un-needed SORT operation due to bug 
> in the way comparison for `SortOrder` is done at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
> `SortOrder` needs to be compared semantically because `Expression` within two 
> `SortOrder` can be "semantically equal" but not literally equal objects.
> eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`
> Expression in required SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId,
> qualifier = Some("a")
>   )
> {code}
> Expression in child SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId)
> {code}
> Notice that the output column has a qualifier but the child attribute does 
> not but the inherent expression is the same and hence in this case we can say 
> that the child satisfies the required sort order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17271) Planner adds un-necessary Sort even if child ordering is semantically same as required ordering

2016-08-26 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-17271:

Description: 
Found a case when the planner is adding un-needed SORT operation due to bug in 
the way comparison for `SortOrder` is done at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253

`SortOrder` needs to be compared semantically because `Expression` within two 
`SortOrder` can be "semantically equal" but not literally equal objects.

eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`

Expression in required SortOrder:

{code}
  AttributeReference(
name = "col1",
dataType = LongType,
nullable = false
  ) (exprId = exprId,
qualifier = Some("a")
  )
{code}

Expression in child SortOrder:

{code}
  AttributeReference(
name = "col1",
dataType = LongType,
nullable = false
  ) (exprId = exprId)
{code}

Notice that the output column has a qualifier but the child attribute does not 
but the inherent expression is the same and hence in this case we can say that 
the child satisfies the required sort order.

  was:
Found a case when the planner is adding un-needed SORT operation due to bug in 
the way comparison for `SortOrder` is done at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253

`SortOrder` needs to be compared semantically because `Expression` within two 
`SortOrder` can be "semantically equal" but not literally equal objects.

eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`

Expression in required SortOrder:

```
  AttributeReference(
name = "col1",
dataType = LongType,
nullable = false
  ) (exprId = exprId,
qualifier = Some("a")
  )
```

Expression in child SortOrder:

```
  AttributeReference(
name = "col1",
dataType = LongType,
nullable = false
  ) (exprId = exprId)
```

Notice that the output column has a qualifier but the child attribute does not 
but the inherent expression is the same and hence in this case we can say that 
the child satisfies the required sort order.


> Planner adds un-necessary Sort even if child ordering is semantically same as 
> required ordering
> ---
>
> Key: SPARK-17271
> URL: https://issues.apache.org/jira/browse/SPARK-17271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Tejas Patil
>
> Found a case when the planner is adding un-needed SORT operation due to bug 
> in the way comparison for `SortOrder` is done at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
> `SortOrder` needs to be compared semantically because `Expression` within two 
> `SortOrder` can be "semantically equal" but not literally equal objects.
> eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`
> Expression in required SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId,
> qualifier = Some("a")
>   )
> {code}
> Expression in child SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId)
> {code}
> Notice that the output column has a qualifier but the child attribute does 
> not but the inherent expression is the same and hence in this case we can say 
> that the child satisfies the required sort order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440546#comment-15440546
 ] 

Apache Spark commented on SPARK-16216:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14840

> CSV data source does not write date and timestamp correctly
> ---
>
> Key: SPARK-16216
> URL: https://issues.apache.org/jira/browse/SPARK-16216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 2.0.1, 2.1.0
>
>
> Currently, CSV data source write {{DateType}} and {{TimestampType}} as below:
> {code}
> ++
> |date|
> ++
> |14406372|
> |14144598|
> |14540400|
> ++
> {code}
> It would be nicer if it write dates and timestamps as a formatted string just 
> like JSON data sources.
> Also, CSV data source currently supports {{dateFormat}} option to read dates 
> and timestamps in a custom format. It might be better if this option can be 
> applied in writing as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17266) PrefixComparatorsSuite's "String prefix comparator" failed when both input strings are empty strings

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17266.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14837
[https://github.com/apache/spark/pull/14837]

> PrefixComparatorsSuite's "String prefix comparator" failed when both input 
> strings are empty strings
> 
>
> Key: SPARK-17266
> URL: https://issues.apache.org/jira/browse/SPARK-17266
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/
> {code}
> org.scalatest.exceptions.GeneratorDrivenPropertyCheckFailedException: 
> TestFailedException was thrown during property evaluation.   Message: 0 
> equaled 0, but 1 did not equal 0, and 0 was not less than 0, and 0 was not 
> greater than 0   Location: (PrefixComparatorsSuite.scala:42)   Occurred when 
> passed generated values ( arg0 = "", arg1 = ""   )
> {code}
> I could not reproduce it locally. But, let me add this case in the 
> regressionTests to explicitly test it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-26 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440505#comment-15440505
 ] 

Sun Rui commented on SPARK-13525:
-

What's your spark cluster deployment mode? yarn or standalone?

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: Ă¢â‚¬ËœSparkRĂ¢â‚¬â„¢
> The following objects are masked from Ă¢â‚¬Ëœpackage:baseĂ¢â‚¬â„¢:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at 

[jira] [Updated] (SPARK-17266) PrefixComparatorsSuite's "String prefix comparator" failed when both input strings are empty strings

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17266:
-
Assignee: Yin Huai

> PrefixComparatorsSuite's "String prefix comparator" failed when both input 
> strings are empty strings
> 
>
> Key: SPARK-17266
> URL: https://issues.apache.org/jira/browse/SPARK-17266
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/
> {code}
> org.scalatest.exceptions.GeneratorDrivenPropertyCheckFailedException: 
> TestFailedException was thrown during property evaluation.   Message: 0 
> equaled 0, but 1 did not equal 0, and 0 was not less than 0, and 0 was not 
> greater than 0   Location: (PrefixComparatorsSuite.scala:42)   Occurred when 
> passed generated values ( arg0 = "", arg1 = ""   )
> {code}
> I could not reproduce it locally. But, let me add this case in the 
> regressionTests to explicitly test it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17271) Planner adds un-necessary Sort even if child ordering is semantically same as required ordering

2016-08-26 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-17271:
---

 Summary: Planner adds un-necessary Sort even if child ordering is 
semantically same as required ordering
 Key: SPARK-17271
 URL: https://issues.apache.org/jira/browse/SPARK-17271
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.2
Reporter: Tejas Patil


Found a case when the planner is adding un-needed SORT operation due to bug in 
the way comparison for `SortOrder` is done at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253

`SortOrder` needs to be compared semantically because `Expression` within two 
`SortOrder` can be "semantically equal" but not literally equal objects.

eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`

Expression in required SortOrder:

```
  AttributeReference(
name = "col1",
dataType = LongType,
nullable = false
  ) (exprId = exprId,
qualifier = Some("a")
  )
```

Expression in child SortOrder:

```
  AttributeReference(
name = "col1",
dataType = LongType,
nullable = false
  ) (exprId = exprId)
```

Notice that the output column has a qualifier but the child attribute does not 
but the inherent expression is the same and hence in this case we can say that 
the child satisfies the required sort order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-26 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440476#comment-15440476
 ] 

Sun Rui commented on SPARK-13525:
-


Another guess: could you check "localhost" works for local TCP connection?


> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: Ă¢â‚¬ËœSparkRĂ¢â‚¬â„¢
> The following objects are masked from Ă¢â‚¬Ëœpackage:baseĂ¢â‚¬â„¢:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at 

[jira] [Commented] (SPARK-17270) Move object optimization rules into its own file

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440368#comment-15440368
 ] 

Apache Spark commented on SPARK-17270:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14839

> Move object optimization rules into its own file
> 
>
> Key: SPARK-17270
> URL: https://issues.apache.org/jira/browse/SPARK-17270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17270) Move object optimization rules into its own file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17270:


Assignee: Reynold Xin  (was: Apache Spark)

> Move object optimization rules into its own file
> 
>
> Key: SPARK-17270
> URL: https://issues.apache.org/jira/browse/SPARK-17270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17270) Move object optimization rules into its own file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17270:


Assignee: Apache Spark  (was: Reynold Xin)

> Move object optimization rules into its own file
> 
>
> Key: SPARK-17270
> URL: https://issues.apache.org/jira/browse/SPARK-17270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17270) Move object optimization rules into its own file

2016-08-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17270:
---

 Summary: Move object optimization rules into its own file
 Key: SPARK-17270
 URL: https://issues.apache.org/jira/browse/SPARK-17270
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17269) Move finish analysis stage into its own file

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440359#comment-15440359
 ] 

Apache Spark commented on SPARK-17269:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14838

> Move finish analysis stage into its own file
> 
>
> Key: SPARK-17269
> URL: https://issues.apache.org/jira/browse/SPARK-17269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17269) Move finish analysis stage into its own file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17269:


Assignee: Reynold Xin  (was: Apache Spark)

> Move finish analysis stage into its own file
> 
>
> Key: SPARK-17269
> URL: https://issues.apache.org/jira/browse/SPARK-17269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17269) Move finish analysis stage into its own file

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17269:


Assignee: Apache Spark  (was: Reynold Xin)

> Move finish analysis stage into its own file
> 
>
> Key: SPARK-17269
> URL: https://issues.apache.org/jira/browse/SPARK-17269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17269) Move finish analysis stage into its own file

2016-08-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17269:
---

 Summary: Move finish analysis stage into its own file
 Key: SPARK-17269
 URL: https://issues.apache.org/jira/browse/SPARK-17269
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17268) Break Optimizer.scala apart

2016-08-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17268:
---

 Summary: Break Optimizer.scala apart
 Key: SPARK-17268
 URL: https://issues.apache.org/jira/browse/SPARK-17268
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Optimizer.scala has become too large to maintain. We would need to break it 
apart into multiple files each of which contains rules that are logically 
relevant.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17244) Joins should not pushdown non-deterministic conditions

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17244.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14815
[https://github.com/apache/spark/pull/14815]

> Joins should not pushdown non-deterministic conditions
> --
>
> Key: SPARK-17244
> URL: https://issues.apache.org/jira/browse/SPARK-17244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17244) Joins should not pushdown non-deterministic conditions

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17244:
-
Assignee: Sameer Agarwal

> Joins should not pushdown non-deterministic conditions
> --
>
> Key: SPARK-17244
> URL: https://issues.apache.org/jira/browse/SPARK-17244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-26 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440183#comment-15440183
 ] 

DB Tsai commented on SPARK-17163:
-

It relates to this [SPARK-17201], but seems that it's not a concern. 

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17266) PrefixComparatorsSuite's "String prefix comparator" failed when both input strings are empty strings

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440146#comment-15440146
 ] 

Apache Spark commented on SPARK-17266:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14837

> PrefixComparatorsSuite's "String prefix comparator" failed when both input 
> strings are empty strings
> 
>
> Key: SPARK-17266
> URL: https://issues.apache.org/jira/browse/SPARK-17266
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/
> {code}
> org.scalatest.exceptions.GeneratorDrivenPropertyCheckFailedException: 
> TestFailedException was thrown during property evaluation.   Message: 0 
> equaled 0, but 1 did not equal 0, and 0 was not less than 0, and 0 was not 
> greater than 0   Location: (PrefixComparatorsSuite.scala:42)   Occurred when 
> passed generated values ( arg0 = "", arg1 = ""   )
> {code}
> I could not reproduce it locally. But, let me add this case in the 
> regressionTests to explicitly test it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17266) PrefixComparatorsSuite's "String prefix comparator" failed when both input strings are empty strings

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17266:


Assignee: (was: Apache Spark)

> PrefixComparatorsSuite's "String prefix comparator" failed when both input 
> strings are empty strings
> 
>
> Key: SPARK-17266
> URL: https://issues.apache.org/jira/browse/SPARK-17266
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/
> {code}
> org.scalatest.exceptions.GeneratorDrivenPropertyCheckFailedException: 
> TestFailedException was thrown during property evaluation.   Message: 0 
> equaled 0, but 1 did not equal 0, and 0 was not less than 0, and 0 was not 
> greater than 0   Location: (PrefixComparatorsSuite.scala:42)   Occurred when 
> passed generated values ( arg0 = "", arg1 = ""   )
> {code}
> I could not reproduce it locally. But, let me add this case in the 
> regressionTests to explicitly test it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17266) PrefixComparatorsSuite's "String prefix comparator" failed when both input strings are empty strings

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17266:


Assignee: Apache Spark

> PrefixComparatorsSuite's "String prefix comparator" failed when both input 
> strings are empty strings
> 
>
> Key: SPARK-17266
> URL: https://issues.apache.org/jira/browse/SPARK-17266
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/
> {code}
> org.scalatest.exceptions.GeneratorDrivenPropertyCheckFailedException: 
> TestFailedException was thrown during property evaluation.   Message: 0 
> equaled 0, but 1 did not equal 0, and 0 was not less than 0, and 0 was not 
> greater than 0   Location: (PrefixComparatorsSuite.scala:42)   Occurred when 
> passed generated values ( arg0 = "", arg1 = ""   )
> {code}
> I could not reproduce it locally. But, let me add this case in the 
> regressionTests to explicitly test it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440142#comment-15440142
 ] 

Joseph K. Bradley commented on SPARK-17163:
---

I was guessing that optimization would be more likely to diverge and return 
blown-up coefficients when not pivoting with regParam=0 (more likely than when 
pivoting).  A given training dataset could constrain the problem enough to make 
a well-defined optimal solution with regParam=0 and pivoting, but the same 
might not hold true when not pivoting.

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17267) Long running structured streaming requirements

2016-08-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17267:

Priority: Blocker  (was: Major)

> Long running structured streaming requirements
> --
>
> Key: SPARK-17267
> URL: https://issues.apache.org/jira/browse/SPARK-17267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Priority: Blocker
>
> This is an umbrella ticket to track various things that are required in order 
> to have the engine for structured streaming run non-stop in production.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15698) Ability to remove old metadata for structure streaming MetadataLog

2016-08-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15698:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17267

> Ability to remove old metadata for structure streaming MetadataLog
> --
>
> Key: SPARK-15698
> URL: https://issues.apache.org/jira/browse/SPARK-15698
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Priority: Minor
>
> Current MetadataLog lacks the ability to remove old checkpoint file, we'd 
> better add this functionality to the MetadataLog and honor it in the place 
> where MetadataLog is used, that will reduce unnecessary small files in the 
> long running scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17235) MetadataLog should support purging old logs

2016-08-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17235:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-17267

> MetadataLog should support purging old logs
> ---
>
> Key: SPARK-17235
> URL: https://issues.apache.org/jira/browse/SPARK-17235
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> This is a useful primitive operation to have to support checkpointing and 
> forgetting old logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17165) FileStreamSource should not track the list of seen files indefinitely

2016-08-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17165:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17267

> FileStreamSource should not track the list of seen files indefinitely
> -
>
> Key: SPARK-17165
> URL: https://issues.apache.org/jira/browse/SPARK-17165
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> FileStreamSource currently tracks all the files seen indefinitely, which 
> means it can run out of memory or overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17267) Long running structured streaming requirements

2016-08-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17267:
---

 Summary: Long running structured streaming requirements
 Key: SPARK-17267
 URL: https://issues.apache.org/jira/browse/SPARK-17267
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Reporter: Reynold Xin


This is an umbrella ticket to track various things that are required in order 
to have the engine for structured streaming run non-stop in production.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17235) MetadataLog should support purging old logs

2016-08-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17235.
-
   Resolution: Fixed
 Assignee: Peter Lee
Fix Version/s: 2.1.0
   2.0.1

> MetadataLog should support purging old logs
> ---
>
> Key: SPARK-17235
> URL: https://issues.apache.org/jira/browse/SPARK-17235
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> This is a useful primitive operation to have to support checkpointing and 
> forgetting old logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17266) PrefixComparatorsSuite's "String prefix comparator" failed when both input strings are empty strings

2016-08-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-17266:


 Summary: PrefixComparatorsSuite's "String prefix comparator" 
failed when both input strings are empty strings
 Key: SPARK-17266
 URL: https://issues.apache.org/jira/browse/SPARK-17266
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Reporter: Yin Huai


https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/

{code}
org.scalatest.exceptions.GeneratorDrivenPropertyCheckFailedException: 
TestFailedException was thrown during property evaluation.   Message: 0 equaled 
0, but 1 did not equal 0, and 0 was not less than 0, and 0 was not greater than 
0   Location: (PrefixComparatorsSuite.scala:42)   Occurred when passed 
generated values ( arg0 = "", arg1 = ""   )
{code}

I could not reproduce it locally. But, let me add this case in the 
regressionTests to explicitly test it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-08-26 Thread Arihanth Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15440090#comment-15440090
 ] 

Arihanth Jain commented on SPARK-13525:
---

[~sunrui] I have tried "spark.sparkr.use.daemon" to false with no luck. Now, 
dealing with this by creating R cluster using base package "parallel" and 
makePSOCKcluster function. I believe this gets closer to finding:

By passing nodes Hostname the workers fail with following error and hangs on it 
forever

@@@
@   WARNING: POSSIBLE DNS SPOOFING DETECTED!  @
@@@
The RSA host key for test02.servers.jiffybox.net has changed,
and the key for the corresponding IP address 134.xxx.xx.xxx
is unchanged. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
Offending key for IP in /root/.ssh/known_hosts:10
@@@
@WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending key in /root/.ssh/known_hosts:9
RSA host key for test02.servers.jiffybox.net has changed and you have requested 
strict checking.
Host key verification failed.


The same works fine and all workers are started when passing nodes IP address 
instead of Hostname.

starting worker pid=32407 on master.jiffybox.net:11575 at 22:18:50.050
starting worker pid=3523 on master.jiffybox.net:11575 at 22:18:50.464
starting worker pid=2583 on master.jiffybox.net:11575 at 22:18:50.885
starting worker pid=5227 on master.jiffybox.net:11575 at 22:18:51.294

--

The above "DNS SPOOFING" issue was simply resolved by removing the matching 
entries from .ssh/known_hosts and recreating them for all nodes "ssh 
root@hostname". This fixed the previous issue and was able to able to create 
socket cluster with 4 nodes (now at port 11977).


starting worker pid=6804 on master.jiffybox.net:11977 at 23:59:23.245
starting worker pid=10257 on master.jiffybox.net:11977 at 23:59:23.668
starting worker pid=9776 on master.jiffybox.net:11977 at 23:59:24.107
starting worker pid=12073 on master.jiffybox.net:11977 at 23:59:24.540

note: Neither the path to Rscript not any port number was specified.

--

Unfortunately, this did not resolve the problem with SparkR. It fails with 
existing issue "java.net.SocketTimeoutException: Accept timed out".


> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: Ă¢â‚¬ËœSparkRĂ¢â‚¬â„¢
> The following objects are masked from Ă¢â‚¬Ëœpackage:baseĂ¢â‚¬â„¢:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   

[jira] [Commented] (SPARK-17044) Add window function test in SQLQueryTestSuite

2016-08-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439925#comment-15439925
 ] 

Dongjoon Hyun commented on SPARK-17044:
---

Hi, [~rxin].
Could you review this issue?

> Add window function test in SQLQueryTestSuite
> -
>
> Key: SPARK-17044
> URL: https://issues.apache.org/jira/browse/SPARK-17044
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds a SQL query test for Window functions for new 
> `SQLQueryTestSuite`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17044) Add window function test in SQLQueryTestSuite

2016-08-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17044:
--
Description: This issue adds a SQL query test for Window functions for new 
`SQLQueryTestSuite`.  (was: New `SQLQueryTestSuite` simplifies SQL testcases.
This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with 
`window_functions.sql` in `sql/core` module.)

> Add window function test in SQLQueryTestSuite
> -
>
> Key: SPARK-17044
> URL: https://issues.apache.org/jira/browse/SPARK-17044
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds a SQL query test for Window functions for new 
> `SQLQueryTestSuite`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439857#comment-15439857
 ] 

Apache Spark commented on SPARK-17243:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/14835

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17243:


Assignee: (was: Apache Spark)

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17243:


Assignee: Apache Spark

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>Assignee: Apache Spark
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-26 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439799#comment-15439799
 ] 

Alex Bozarth commented on SPARK-17243:
--

So I decided to work on this as a short break from my current work and I have a 
fix that just requires some final testing before I open a pr, should be open by 
EOD.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

2016-08-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439767#comment-15439767
 ] 

Joseph K. Bradley commented on SPARK-15882:
---

This seems like an important feature, but less critical than other feature 
parity issues in {{spark.ml}}.  Essentially, most users who want distributed 
linear algebra are fairly expert.  Those expert users are often experienced 
enough to know how to work with RDDs and DataFrames to do conversions as 
needed.  Missing algorithms, on the other hand, often impact non-experts who do 
not know how to combine spark.ml with spark.mllib.  Therefore, I'd prioritize 
adding missing algorithms (FPGrowth, etc.) to spark.ml over adding distributed 
linear algebra.

That said, we definitely need to do this task before too long, so it would be 
great to start thinking about design.

RDDs vs. Datasets: For the initial implementation, I'd say we should use RDDs 
to limit initial work required, though I'm open to Datasets if we do scaling 
tests.  However, I strongly prefer only using Datasets in the public APIs, with 
the expectation that we can eventually switch over to Dataset-based 
implementations.  It is true that RDDs offer more flexibility now, but we 
should push for the needed flexibility in Datasets so that we can take 
advantage of their other improvements over RDDs.

Functionality: This can be sketched in the design doc.  The main question is 
whether we want to change APIs from spark.mllib, especially if any are not 
Java-friendly.

Plugging in other local linear algebra: This should be addressed in the design 
doc.  I hope, however, that this decision can be made later (by exposing 
internal APIs as needed) so that the migration is not held up by massive design 
discussions.

Scaling: Regardless of our approach, we'll need to do proper scalability tests 
to make sure we do not have regressions in the migration.

"Migration": I should clarify that I'm assuming we will leave 
spark.mllib.linalg alone and will be adding new APIs in spark.ml.linalg.

> Discuss distributed linear algebra in spark.ml package
> --
>
> Key: SPARK-15882
> URL: https://issues.apache.org/jira/browse/SPARK-15882
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
> should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this 
> move?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17246) Support BigDecimal literal parsing

2016-08-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17246.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Support BigDecimal literal parsing
> --
>
> Key: SPARK-17246
> URL: https://issues.apache.org/jira/browse/SPARK-17246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16967) Collect Mesos support code into a module/profile

2016-08-26 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16967.

   Resolution: Fixed
 Assignee: Michael Gummelt
Fix Version/s: 2.1.0

> Collect Mesos support code into a module/profile
> 
>
> Key: SPARK-16967
> URL: https://issues.apache.org/jira/browse/SPARK-16967
> Project: Spark
>  Issue Type: Task
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Michael Gummelt
>Priority: Critical
> Fix For: 2.1.0
>
>
> CC [~mgummelt] [~tnachen] [~skonto] 
> I think this is fairly easy and would be beneficial as more work goes into 
> Mesos. It should separate into a module like YARN does, just on principle 
> really, but because it also means anyone that doesn't need Mesos support can 
> build without it.
> I'm entirely willing to take a shot at this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17265) EdgeRDD Difference throws an exception

2016-08-26 Thread Shishir Kharel (JIRA)
Shishir Kharel created SPARK-17265:
--

 Summary: EdgeRDD Difference throws an exception
 Key: SPARK-17265
 URL: https://issues.apache.org/jira/browse/SPARK-17265
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: windows, ubuntu
Reporter: Shishir Kharel


Subtracting two edge RDD throws and exception.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17261) Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"

2016-08-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439613#comment-15439613
 ] 

Dongjoon Hyun commented on SPARK-17261:
---

Hi, [~dakghar]

For me, those seems not to work even in `spark-shell`. Could you add a `show` 
at the end? I tested and got the same result in 2.0.0 and current master branch.
{code}
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
scala> spark.sql("show databases").show
++
|databaseName|
++
| default|
++

scala> spark.stop()
scala> val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
scala> spark.sql("show databases").show
16/08/26 12:09:22 ERROR Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = 
jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating 
connection pool (set lazyInit to true if you expect to start your database 
after your app). Original Exception: --
java.sql.SQLException: Failed to start database 'metastore_db' with class 
loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@6b60d99c, 
see the next exception for details.
{code}

> Using HiveContext after re-creating SparkContext in Spark 2.0 throws 
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> -
>
> Key: SPARK-17261
> URL: https://issues.apache.org/jira/browse/SPARK-17261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Amazon AWS EMR 5.0
>Reporter: Rahul Jain
> Fix For: 2.0.0
>
>
> After stopping SparkSession if we recreate it and use HiveContext in it. it 
> will throw error.
> Steps to reproduce:
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> spark.stop()
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> spark.sql("show databases")
> "Java.lang.illegalStateException: Cannot call methods on a stopped 
> sparkContext"
> Above error occurs only in case of Pyspark not in SparkShell



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17220) Upgrade Py4J to 0.10.3

2016-08-26 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-17220:
-
Component/s: PySpark

> Upgrade Py4J to 0.10.3
> --
>
> Key: SPARK-17220
> URL: https://issues.apache.org/jira/browse/SPARK-17220
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Weiqing Yang
>Priority: Minor
>
> Py4J 0.10.3 has landed. It includes some important bug fixes. For example:
> Both sides: fixed memory leak issue with ClientServer and potential deadlock 
> issue by creating a memory leak test suite. (Py4J 0.10.2)
> Both sides: added more memory leak tests and fixed a potential memory leak 
> related to listeners. (Py4J 0.10.3)
> So it's time to upgrade py4j from 0.10.1 to 0.10.3. The changelog is 
> available at https://www.py4j.org/changelog.html 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17265) EdgeRDD Difference throws an exception

2016-08-26 Thread Shishir Kharel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shishir Kharel updated SPARK-17265:
---
Description: 
Subtracting two edge RDD throws and exception.

val difference = graph1.edges.subtract(graph2.edges)

gives

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost 
task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ClassCastException: 
org.apache.spark.graphx.Edge cannot be cast to scala.Tuple2
at 
org.apache.spark.rdd.RDD$$anonfun$subtract$3$$anon$3.getPartition(RDD.scala:968)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

  was:
Subtracting two edge RDD throws and exception.



> EdgeRDD Difference throws an exception
> --
>
> Key: SPARK-17265
> URL: https://issues.apache.org/jira/browse/SPARK-17265
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: windows, ubuntu
>Reporter: Shishir Kharel
>
> Subtracting two edge RDD throws and exception.
> val difference = graph1.edges.subtract(graph2.edges)
> gives
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.Edge cannot be cast to scala.Tuple2
> at 
> org.apache.spark.rdd.RDD$$anonfun$subtract$3$$anon$3.getPartition(RDD.scala:968)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16501) spark.mesos.secret exposed on UI and command line

2016-08-26 Thread Eric Daniel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439609#comment-15439609
 ] 

Eric Daniel commented on SPARK-16501:
-

Great to know, thanks!

> spark.mesos.secret exposed on UI and command line
> -
>
> Key: SPARK-16501
> URL: https://issues.apache.org/jira/browse/SPARK-16501
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, Web UI
>Affects Versions: 1.6.2
>Reporter: Eric Daniel
>  Labels: security
>
> There are two related problems with spark.mesos.secret:
> 1) The web UI shows its value in the "environment" tab
> 2) Passing it as a command-line option to spark-submit (or creating a 
> SparkContext from python, with the effect of launching spark-submit)  exposes 
> it to "ps"
> I'll be happy to submit a patch but I could use some advice first.
> The first problem is easy enough, just don't show that value in the UI
> For the second problem, I'm not sure what the best solution is. A 
> "spark.mesos.secret-file" parameter would let the user store the secret in a 
> non-world-readable file. Alternatively, the mesos secret could be obtained 
> from the environment, which other users don't have access to.  Either 
> solution would work in client mode, but I don't know if they're workable in 
> cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17207) Comparing Vector in relative tolerance or absolute tolerance in UnitTests error

2016-08-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-17207.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14785
[https://github.com/apache/spark/pull/14785]

> Comparing Vector in relative tolerance or absolute tolerance in UnitTests 
> error 
> 
>
> Key: SPARK-17207
> URL: https://issues.apache.org/jira/browse/SPARK-17207
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
> Fix For: 2.1.0
>
>
> The result of compare two vectors using UnitTests 
> (org.apache.spark.mllib.util.TestingUtils) is not right sometime.
> For example:
> val a = Vectors.dense(Arrary(1.0, 2.0))
> val b = Vectors.zeros(0)
> a ~== b absTol 1e-1 // the result is true. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17165) FileStreamSource should not track the list of seen files indefinitely

2016-08-26 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17165:
-
Fix Version/s: 2.1.0
   2.0.1

> FileStreamSource should not track the list of seen files indefinitely
> -
>
> Key: SPARK-17165
> URL: https://issues.apache.org/jira/browse/SPARK-17165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> FileStreamSource currently tracks all the files seen indefinitely, which 
> means it can run out of memory or overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17165) FileStreamSource should not track the list of seen files indefinitely

2016-08-26 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17165.
--
Resolution: Fixed
  Assignee: Peter Lee

> FileStreamSource should not track the list of seen files indefinitely
> -
>
> Key: SPARK-17165
> URL: https://issues.apache.org/jira/browse/SPARK-17165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Peter Lee
>
> FileStreamSource currently tracks all the files seen indefinitely, which 
> means it can run out of memory or overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-26 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439517#comment-15439517
 ] 

Gang Wu commented on SPARK-17243:
-

Thanks [~ajbozarth]! Let me know when it is done.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17250) Remove HiveClient and setCurrentDatabase from HiveSessionCatalog

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17250:
-
Assignee: Xiao Li  (was: Apache Spark)

> Remove HiveClient and setCurrentDatabase from HiveSessionCatalog
> 
>
> Key: SPARK-17250
> URL: https://issues.apache.org/jira/browse/SPARK-17250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> This is the first step to clean `HiveClient` from `HiveSessionState`. In the 
> metastore interaction, we always set fully qualified names when 
> accessing/operating a table. That means, we always specify the database. 
> Thus, it is not necessary to use `HiveClient` to change the active database 
> in Hive metastore. 
> In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses 
> `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17250) Remove HiveClient and setCurrentDatabase from HiveSessionCatalog

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17250.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14821
[https://github.com/apache/spark/pull/14821]

> Remove HiveClient and setCurrentDatabase from HiveSessionCatalog
> 
>
> Key: SPARK-17250
> URL: https://issues.apache.org/jira/browse/SPARK-17250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>
> This is the first step to clean `HiveClient` from `HiveSessionState`. In the 
> metastore interaction, we always set fully qualified names when 
> accessing/operating a table. That means, we always specify the database. 
> Thus, it is not necessary to use `HiveClient` to change the active database 
> in Hive metastore. 
> In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses 
> `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-26 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439499#comment-15439499
 ] 

Seth Hendrickson commented on SPARK-17163:
--

[~dbtsai] We can discuss these design points on the WIP PR. We can change from 
what is currently implemented there, but I find it is always easier to 
communicate if we can directly look at code :)

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17192) Issuing an exception when users specify the partitioning columns without a given schema

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17192:
-
Assignee: Xiao Li

> Issuing an exception when users specify the partitioning columns without a 
> given schema
> ---
>
> Key: SPARK-17192
> URL: https://issues.apache.org/jira/browse/SPARK-17192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> We need to issue an exception when users specify the partitioning columns 
> without a given schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17192) Issuing an exception when users specify the partitioning columns without a given schema

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17192.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14572
[https://github.com/apache/spark/pull/14572]

> Issuing an exception when users specify the partitioning columns without a 
> given schema
> ---
>
> Key: SPARK-17192
> URL: https://issues.apache.org/jira/browse/SPARK-17192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> We need to issue an exception when users specify the partitioning columns 
> without a given schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17252) Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors during query parsing

2016-08-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439478#comment-15439478
 ] 

Josh Rosen commented on SPARK-17252:


Looks like this issue only affects 2.0.0, so I'm going to resolve it as fixed 
in 2.0.1.

> Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors 
> during query parsing
> -
>
> Key: SPARK-17252
> URL: https://issues.apache.org/jira/browse/SPARK-17252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
> Fix For: 2.0.1
>
>
> The following example fails with a ClassCastException:
> {code}
> create table t(d double);
> insert into t VALUES (1 * 1.0);
> {code}
>  Here's the error:
> {code}
> java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
> cast to java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at scala.math.Numeric$IntIsIntegral$.times(Numeric.scala:57)
>   at 
> org.apache.spark.sql.catalyst.expressions.Multiply.nullSafeEval(arithmetic.scala:207)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypeCreator.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:320)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:677)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:674)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:674)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:658)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:658)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:43)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableContext.accept(SqlBaseParser.java:9358)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitInlineTableDefault1(SqlBaseBaseVisitor.java:608)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableDefault1Context.accept(SqlBaseParser.java:7073)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitQueryTermDefault(SqlBaseBaseVisitor.java:580)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$QueryTermDefaultContext.accept(SqlBaseParser.java:6895)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:158)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:162)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
>   at 
> 

[jira] [Resolved] (SPARK-17252) Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors during query parsing

2016-08-26 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17252.

   Resolution: Fixed
Fix Version/s: 2.0.1

> Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors 
> during query parsing
> -
>
> Key: SPARK-17252
> URL: https://issues.apache.org/jira/browse/SPARK-17252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
> Fix For: 2.0.1
>
>
> The following example fails with a ClassCastException:
> {code}
> create table t(d double);
> insert into t VALUES (1 * 1.0);
> {code}
>  Here's the error:
> {code}
> java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
> cast to java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at scala.math.Numeric$IntIsIntegral$.times(Numeric.scala:57)
>   at 
> org.apache.spark.sql.catalyst.expressions.Multiply.nullSafeEval(arithmetic.scala:207)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypeCreator.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:320)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:677)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:674)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:674)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:658)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:658)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:43)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableContext.accept(SqlBaseParser.java:9358)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitInlineTableDefault1(SqlBaseBaseVisitor.java:608)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableDefault1Context.accept(SqlBaseParser.java:7073)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitQueryTermDefault(SqlBaseBaseVisitor.java:580)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$QueryTermDefaultContext.accept(SqlBaseParser.java:6895)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:158)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:162)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleInsertQuery(AstBuilder.scala:157)
>   at 
> 

[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439475#comment-15439475
 ] 

Apache Spark commented on SPARK-17163:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/14834

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17163:


Assignee: Apache Spark

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17163) Merge MLOR into a single LOR interface

2016-08-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17163:


Assignee: (was: Apache Spark)

> Merge MLOR into a single LOR interface
> --
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1
> *Update*: Seems we have decided to merge the two estimators. I changed the 
> title to reflect that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-26 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439470#comment-15439470
 ] 

Alex Bozarth commented on SPARK-17243:
--

Thanks [~ste...@apache.org], this idea is great. [~wgtmac], based on this I 
might be able to get a small fix for this out next week instead of waiting to 
include it in my larger update next month.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17165) FileStreamSource should not track the list of seen files indefinitely

2016-08-26 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439391#comment-15439391
 ] 

Frederick Reiss commented on SPARK-17165:
-

This problem is actually deeper than just FileStreamSource. With the current 
version of the Source trait, *every* source needs to keep infinite state. 
[~scrapco...@gmail.com] ran into that issue while writing a connector for MQTT. 
I opened SPARK-16963 a few weeks back to cover the core issue with the Stream 
trait. My open PR for that JIRA (https://github.com/apache/spark/pull/14553) 
has a fair amount of overlap with the PR here and with the one in SPARK-17235.

Can we merge our efforts here to make a single sequence of small, 
easy-to-review change sets that will resolve these state management issues 
across all sources? I'm thinking that we can create a single JIRA (or reuse one 
of the existing ones) to cover "keep only bounded state for Structured 
Streaming data sources", then divide that JIRA into the following tasks:
# Add a method to `Source` to trigger cleaning of processed data
# Add a method to `HDFSMetadataLog` to clean out processed metadata
# Implement garbage collection of old data (metadata and files) in 
`FileStreamSource`
# Implement garbage collection of old data in `MemoryStream` and other stubs of 
Source
# Modify the scheduler (`StreamExecution`) so that it triggers garbage 
collection of data and metadata

Thoughts?

> FileStreamSource should not track the list of seen files indefinitely
> -
>
> Key: SPARK-17165
> URL: https://issues.apache.org/jira/browse/SPARK-17165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> FileStreamSource currently tracks all the files seen indefinitely, which 
> means it can run out of memory or overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator

2016-08-26 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439360#comment-15439360
 ] 

Herman van Hovell commented on SPARK-17251:
---

Ok, I have taken a look at this one. We should make {{OuterReference}} a 
{{NamedExpression}} and then we are good (have most of the code working 
locally). 

If we fix this, it will fail analysis because we are using a correlated 
predicate in a {{Project}}. We could make an exception for IN, but I am just 
wondering if we support such a weird construct at all.

> "ClassCastException: OuterReference cannot be cast to NamedExpression" for 
> correlated subquery on the RHS of an IN operator
> ---
>
> Key: SPARK-17251
> URL: https://issues.apache.org/jira/browse/SPARK-17251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> The following test case produces a ClassCastException in the analyzer:
> {code}
> CREATE TABLE t1(a INTEGER);
> INSERT INTO t1 VALUES(1),(2);
> CREATE TABLE t2(b INTEGER);
> INSERT INTO t2 VALUES(1);
> SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2);
> {code}
> Here's the exception:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.NamedExpression
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48)
>   at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
>   at scala.collection.immutable.List.exists(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> 

[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439352#comment-15439352
 ] 

Steve Loughran commented on SPARK-17243:


The REST API actually lets you set a time range for querying entries coming 
back, though not a limit.

This problem could presumably be addressed in a couple of ways

# add a {{limit}} argument to the REST API, declaring the max #of responses to 
return
# leave the REST API alone but tweak the client code to work backwards from now 
to try and get a range. That's more convoluted and is probably brittle to 
clocks. 

strategy #1 is simpler and would avoid the server being overloaded from large 
requests made directly by arbitrary callers —that serialization is going to be 
expensive too, and an easy to way to bring the history server down.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow

2016-08-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439336#comment-15439336
 ] 

Takeshi Yamamuro commented on SPARK-16998:
--

can we link this ticket to SPARK-15214?

> select($"column1", explode($"column2")) is extremely slow
> -
>
> Key: SPARK-16998
> URL: https://issues.apache.org/jira/browse/SPARK-16998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>
> Using a Dataset containing 10.000 rows, each containing null and an array of 
> 5.000 Ints, I observe the following performance (in local mode):
> {noformat}
> scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect)
> 1.219052 seconds  
>   
> res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196])
> scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, 
> 1).collect)
> 20.219447 seconds 
>   
> res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], 
> [null,3196])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow

2016-08-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439331#comment-15439331
 ] 

Takeshi Yamamuro commented on SPARK-16998:
--

yea, no problem. thanks!

> select($"column1", explode($"column2")) is extremely slow
> -
>
> Key: SPARK-16998
> URL: https://issues.apache.org/jira/browse/SPARK-16998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>
> Using a Dataset containing 10.000 rows, each containing null and an array of 
> 5.000 Ints, I observe the following performance (in local mode):
> {noformat}
> scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect)
> 1.219052 seconds  
>   
> res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196])
> scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, 
> 1).collect)
> 20.219447 seconds 
>   
> res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], 
> [null,3196])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17260) move CreateTables to HiveStrategies

2016-08-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17260.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14825
[https://github.com/apache/spark/pull/14825]

> move CreateTables to HiveStrategies
> ---
>
> Key: SPARK-17260
> URL: https://issues.apache.org/jira/browse/SPARK-17260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow

2016-08-26 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439276#comment-15439276
 ] 

Herman van Hovell commented on SPARK-16998:
---

[~maropu] Do you mind if I do it myself? I already started hacking.

> select($"column1", explode($"column2")) is extremely slow
> -
>
> Key: SPARK-16998
> URL: https://issues.apache.org/jira/browse/SPARK-16998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>
> Using a Dataset containing 10.000 rows, each containing null and an array of 
> 5.000 Ints, I observe the following performance (in local mode):
> {noformat}
> scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect)
> 1.219052 seconds  
>   
> res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196])
> scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, 
> 1).collect)
> 20.219447 seconds 
>   
> res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], 
> [null,3196])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17252) Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors during query parsing

2016-08-26 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439269#comment-15439269
 ] 

Herman van Hovell commented on SPARK-17252:
---

I cannot reproduce this. I tried both on the latest master and branch-2.0.

> Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors 
> during query parsing
> -
>
> Key: SPARK-17252
> URL: https://issues.apache.org/jira/browse/SPARK-17252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> The following example fails with a ClassCastException:
> {code}
> create table t(d double);
> insert into t VALUES (1 * 1.0);
> {code}
>  Here's the error:
> {code}
> java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
> cast to java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at scala.math.Numeric$IntIsIntegral$.times(Numeric.scala:57)
>   at 
> org.apache.spark.sql.catalyst.expressions.Multiply.nullSafeEval(arithmetic.scala:207)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypeCreator.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:320)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:677)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:674)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:674)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:658)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:658)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:43)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableContext.accept(SqlBaseParser.java:9358)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitInlineTableDefault1(SqlBaseBaseVisitor.java:608)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableDefault1Context.accept(SqlBaseParser.java:7073)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitQueryTermDefault(SqlBaseBaseVisitor.java:580)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$QueryTermDefaultContext.accept(SqlBaseParser.java:6895)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:158)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:162)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
>   at 
> 

  1   2   >