[jira] [Assigned] (SPARK-12530) Build break at Spark-Master-Maven-Snapshots from #1293

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12530:


Assignee: Apache Spark

> Build break at Spark-Master-Maven-Snapshots from #1293
> --
>
> Key: SPARK-12530
> URL: https://issues.apache.org/jira/browse/SPARK-12530
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> Build break happens at Spark-Master-Maven-Snapshots from #1293 due to 
> compilation error of misc.scala.
> {noformat}
> /home/jenkins/workspace/Spark-Master-Maven-Snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala:61:
>  error: annotation argument needs to be a constant; found: "_FUNC_(input, 
> bitLength) - Returns a checksum of SHA-2 family as a hex string of the 
> ".+("input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length 
> of 0 is equivalent ").+("to 256")
> "input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length 
> of 0 is equivalent " +
>   
> ^
> {noformat}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/consoleFull
> This file is changed by [SPARK-12456]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12530) Build break at Spark-Master-Maven-Snapshots from #1293

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12530:


Assignee: (was: Apache Spark)

> Build break at Spark-Master-Maven-Snapshots from #1293
> --
>
> Key: SPARK-12530
> URL: https://issues.apache.org/jira/browse/SPARK-12530
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>
> Build break happens at Spark-Master-Maven-Snapshots from #1293 due to 
> compilation error of misc.scala.
> {noformat}
> /home/jenkins/workspace/Spark-Master-Maven-Snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala:61:
>  error: annotation argument needs to be a constant; found: "_FUNC_(input, 
> bitLength) - Returns a checksum of SHA-2 family as a hex string of the 
> ".+("input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length 
> of 0 is equivalent ").+("to 256")
> "input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length 
> of 0 is equivalent " +
>   
> ^
> {noformat}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/consoleFull
> This file is changed by [SPARK-12456]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12530) Build break at Spark-Master-Maven-Snapshots from #1293

2015-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072103#comment-15072103
 ] 

Apache Spark commented on SPARK-12530:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10488

> Build break at Spark-Master-Maven-Snapshots from #1293
> --
>
> Key: SPARK-12530
> URL: https://issues.apache.org/jira/browse/SPARK-12530
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>
> Build break happens at Spark-Master-Maven-Snapshots from #1293 due to 
> compilation error of misc.scala.
> {noformat}
> /home/jenkins/workspace/Spark-Master-Maven-Snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala:61:
>  error: annotation argument needs to be a constant; found: "_FUNC_(input, 
> bitLength) - Returns a checksum of SHA-2 family as a hex string of the 
> ".+("input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length 
> of 0 is equivalent ").+("to 256")
> "input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length 
> of 0 is equivalent " +
>   
> ^
> {noformat}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/consoleFull
> This file is changed by [SPARK-12456]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12461) Add ExpressionDescription to math functions

2015-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072110#comment-15072110
 ] 

Apache Spark commented on SPARK-12461:
--

User 'vectorijk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10489

> Add ExpressionDescription to math functions
> ---
>
> Key: SPARK-12461
> URL: https://issues.apache.org/jira/browse/SPARK-12461
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12517) No default RDD name for ones created by sc.textFile

2015-12-27 Thread yaron weinsberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072184#comment-15072184
 ] 

yaron weinsberg commented on SPARK-12517:
-

https://github.com/apache/spark/pull/10456

> No default RDD name for ones created by sc.textFile 
> 
>
> Key: SPARK-12517
> URL: https://issues.apache.org/jira/browse/SPARK-12517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.2
>Reporter: yaron weinsberg
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Having a default name for an RDD created from a file is very handy. 
> The feature was first added at commit: 7b877b2 but was later removed 
> (probably by mistake) at commit: fc8b581. 
> This change sets the default path of RDDs created via sc.textFile(...) to the 
> path argument.
> Here is the symptom:
> Using spark-1.5.2-bin-hadoop2.6:
> scala> sc.textFile("/home/root/.bashrc").name
> res5: String = null
> scala> sc.binaryFiles("/home/root/.bashrc").name
> res6: String = /home/root/.bashrc
> while using Spark 1.3.1:
> scala> sc.textFile("/home/root/.bashrc").name
> res0: String = /home/root/.bashrc
> scala> sc.binaryFiles("/home/root/.bashrc").name
> res1: String = /home/root/.bashrc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12531) Add median and mode to Summary statistics

2015-12-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072196#comment-15072196
 ] 

Sean Owen commented on SPARK-12531:
---

Those are non-trivial to compute exactly; unlike moments, they can't be 
computed from a couple summary statistics. I am not sure this can be added to 
the generic object, no. Is it valuable enough for an additional implementation?

> Add median and mode to Summary statistics
> -
>
> Key: SPARK-12531
> URL: https://issues.apache.org/jira/browse/SPARK-12531
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Gaurav Kumar
>Priority: Minor
>
> Summary statistics should also include calculating median and mode in 
> addition to mean, variance and others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12521) DataFrame Partitions in java does not work

2015-12-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072225#comment-15072225
 ] 

Sean Owen commented on SPARK-12521:
---

PS see https://issues.apache.org/jira/browse/SPARK-12515

> DataFrame Partitions in java does not work
> --
>
> Key: SPARK-12521
> URL: https://issues.apache.org/jira/browse/SPARK-12521
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.5.2
>Reporter: Sergey Podolsky
>
> Hello,
> Partition does not work in Java interface of the DataFrame:
> {code}
> SQLContext sqlContext = new SQLContext(sc);
> Map options = new HashMap<>();
> options.put("driver", ORACLE_DRIVER);
> options.put("url", ORACLE_CONNECTION_URL);
> options.put("dbtable",
> "(SELECT * FROM JOBS WHERE ROWNUM < 1) tt");
> options.put("lowerBound", "2704225000");
> options.put("upperBound", "2704226000");
> options.put("partitionColumn", "ID");
> options.put("numPartitions", "10");
> DataFrame jdbcDF = sqlContext.load("jdbc", options);
> List jobsRows = jdbcDF.collectAsList();
> System.out.println(jobsRows.size());
> {code}
> gives  while expected 1000. Is it because of big decimal of boundaries or 
> partitioins does not work at all in Java?
> Thanks.
> Sergey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12518) Problem in Spark deserialization with htsjdk BAMRecordCodec

2015-12-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-12518:
---

(Questions should go to the mailing list; this was not a Spark problem and was 
not resolved by a commit so "Fixed" is not the right resolution)

> Problem in Spark deserialization with htsjdk BAMRecordCodec
> ---
>
> Key: SPARK-12518
> URL: https://issues.apache.org/jira/browse/SPARK-12518
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 1.5.2
> Environment: Linux Red Hat 4.8.2-16, Java 8, htsjdk-1.130
>Reporter: Zhanpeng Wu
>
> When I used [htsjdk|https://github.com/samtools/htsjdk] in my Spark 
> application, I found some problem in record deserialization. The object of 
> *SAMRecord* could not be deserialzed and throw the exception: 
> {quote}
> WARN ThrowableSerializationWrapper: Task exception could not be deserialized
> java.lang.ClassNotFoundException: htsjdk.samtools.util.RuntimeIOException
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:340)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
> at 
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at 
> org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:167)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:108)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:105)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> It seems that the application encountered a premature EOF when deserialing.
> Here is my test code: 
> 

[jira] [Commented] (SPARK-12263) IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit

2015-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072200#comment-15072200
 ] 

Apache Spark commented on SPARK-12263:
--

User 'nssalian' has created a pull request for this issue:
https://github.com/apache/spark/pull/10483

> IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit
> -
>
> Key: SPARK-12263
> URL: https://issues.apache.org/jira/browse/SPARK-12263
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> When starting a worker with the following command - note 
> {{SPARK_WORKER_MEMORY=1024}} it fails saying that the memory was 0 while it 
> was 1024 (without size unit).
> {code}
> ➜  spark git:(master) ✗ SPARK_WORKER_MEMORY=1024 SPARK_WORKER_CORES=5 
> ./sbin/start-slave.sh spark://localhost:7077
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
>   INFO ShutdownHookManager: Shutdown hook called
>   INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> full log in 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}
> The full stack trace is as follows:
> {code}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> INFO Worker: Registered signal handlers for [TERM, HUP, INT]
> Exception in thread "main" java.lang.IllegalStateException: Memory can't be 
> 0, missing a M or G on the end of the memory specification?
> at 
> org.apache.spark.deploy.worker.WorkerArguments.checkWorkerMemory(WorkerArguments.scala:179)
> at 
> org.apache.spark.deploy.worker.WorkerArguments.(WorkerArguments.scala:64)
> at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:691)
> at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
> INFO ShutdownHookManager: Shutdown hook called
> INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> {code}
> The following command starts spark standalone worker successfully:
> {code}
> SPARK_WORKER_MEMORY=1g SPARK_WORKER_CORES=5 ./sbin/start-slave.sh 
> spark://localhost:7077
> {code}
> The master reports:
> {code}
> INFO Master: Registering worker 192.168.1.6:63884 with 5 cores, 1024.0 MB RAM
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12263) IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12263:


Assignee: Apache Spark

> IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit
> -
>
> Key: SPARK-12263
> URL: https://issues.apache.org/jira/browse/SPARK-12263
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> When starting a worker with the following command - note 
> {{SPARK_WORKER_MEMORY=1024}} it fails saying that the memory was 0 while it 
> was 1024 (without size unit).
> {code}
> ➜  spark git:(master) ✗ SPARK_WORKER_MEMORY=1024 SPARK_WORKER_CORES=5 
> ./sbin/start-slave.sh spark://localhost:7077
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
>   INFO ShutdownHookManager: Shutdown hook called
>   INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> full log in 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}
> The full stack trace is as follows:
> {code}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> INFO Worker: Registered signal handlers for [TERM, HUP, INT]
> Exception in thread "main" java.lang.IllegalStateException: Memory can't be 
> 0, missing a M or G on the end of the memory specification?
> at 
> org.apache.spark.deploy.worker.WorkerArguments.checkWorkerMemory(WorkerArguments.scala:179)
> at 
> org.apache.spark.deploy.worker.WorkerArguments.(WorkerArguments.scala:64)
> at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:691)
> at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
> INFO ShutdownHookManager: Shutdown hook called
> INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> {code}
> The following command starts spark standalone worker successfully:
> {code}
> SPARK_WORKER_MEMORY=1g SPARK_WORKER_CORES=5 ./sbin/start-slave.sh 
> spark://localhost:7077
> {code}
> The master reports:
> {code}
> INFO Master: Registering worker 192.168.1.6:63884 with 5 cores, 1024.0 MB RAM
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12532) Join-key Pushdown via Predicate Transitivity

2015-12-27 Thread Xiao Li (JIRA)
Xiao Li created SPARK-12532:
---

 Summary: Join-key Pushdown via Predicate Transitivity
 Key: SPARK-12532
 URL: https://issues.apache.org/jira/browse/SPARK-12532
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Xiao Li


{code}
"SELECT * FROM upperCaseData JOIN lowerCaseData where lowerCaseData.n = 
upperCaseData.N and lowerCaseData.n = 3"
{code}
{code}
== Analyzed Logical Plan ==
N: int, L: string, n: int, l: string
Project [N#16,L#17,n#18,l#19]
+- Filter ((n#18 = N#16) && (n#18 = 3))
   +- Join Inner, None
  :- Subquery upperCaseData
  :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
BeforeAndAfterAll.scala:187
  +- Subquery lowerCaseData
 +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
BeforeAndAfterAll.scala:187
{code}
{code}
== Optimized Logical Plan ==
Project [N#16,L#17,n#18,l#19]
+- Join Inner, Some((n#18 = N#16))
   :- Filter (N#16 = 3)
   :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
BeforeAndAfterAll.scala:187
   +- Filter (n#18 = 3)
  +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
BeforeAndAfterAll.scala:187
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12532) Join-key Pushdown via Predicate Transitivity

2015-12-27 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12532:

Description: 
{code}
"SELECT * FROM upperCaseData JOIN lowerCaseData where lowerCaseData.n = 
upperCaseData.N and lowerCaseData.n = 3"
{code}
{code}
== Analyzed Logical Plan ==
N: int, L: string, n: int, l: string
Project [N#16,L#17,n#18,l#19]
+- Filter ((n#18 = N#16) && (n#18 = 3))
   +- Join Inner, None
  :- Subquery upperCaseData
  :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
BeforeAndAfterAll.scala:187
  +- Subquery lowerCaseData
 +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
BeforeAndAfterAll.scala:187
{code}
Before the improvement, the optimized logical plan is
{code}
== Optimized Logical Plan ==
Project [N#16,L#17,n#18,l#19]
+- Join Inner, Some((n#18 = N#16))
   :- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
BeforeAndAfterAll.scala:187
   +- Filter (n#18 = 3)
  +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
BeforeAndAfterAll.scala:187
{code}
After the improvement, the optimized logical plan should be like
{code}
== Optimized Logical Plan ==
Project [N#16,L#17,n#18,l#19]
+- Join Inner, Some((n#18 = N#16))
   :- Filter (N#16 = 3)
   :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
BeforeAndAfterAll.scala:187
   +- Filter (n#18 = 3)
  +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
BeforeAndAfterAll.scala:187
{code}

  was:
{code}
"SELECT * FROM upperCaseData JOIN lowerCaseData where lowerCaseData.n = 
upperCaseData.N and lowerCaseData.n = 3"
{code}
{code}
== Analyzed Logical Plan ==
N: int, L: string, n: int, l: string
Project [N#16,L#17,n#18,l#19]
+- Filter ((n#18 = N#16) && (n#18 = 3))
   +- Join Inner, None
  :- Subquery upperCaseData
  :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
BeforeAndAfterAll.scala:187
  +- Subquery lowerCaseData
 +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
BeforeAndAfterAll.scala:187
{code}
{code}
== Optimized Logical Plan ==
Project [N#16,L#17,n#18,l#19]
+- Join Inner, Some((n#18 = N#16))
   :- Filter (N#16 = 3)
   :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
BeforeAndAfterAll.scala:187
   +- Filter (n#18 = 3)
  +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
BeforeAndAfterAll.scala:187
{code}


> Join-key Pushdown via Predicate Transitivity
> 
>
> Key: SPARK-12532
> URL: https://issues.apache.org/jira/browse/SPARK-12532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>  Labels: SQL
>
> {code}
> "SELECT * FROM upperCaseData JOIN lowerCaseData where lowerCaseData.n = 
> upperCaseData.N and lowerCaseData.n = 3"
> {code}
> {code}
> == Analyzed Logical Plan ==
> N: int, L: string, n: int, l: string
> Project [N#16,L#17,n#18,l#19]
> +- Filter ((n#18 = N#16) && (n#18 = 3))
>+- Join Inner, None
>   :- Subquery upperCaseData
>   :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>   +- Subquery lowerCaseData
>  +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}
> Before the improvement, the optimized logical plan is
> {code}
> == Optimized Logical Plan ==
> Project [N#16,L#17,n#18,l#19]
> +- Join Inner, Some((n#18 = N#16))
>:- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>+- Filter (n#18 = 3)
>   +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}
> After the improvement, the optimized logical plan should be like
> {code}
> == Optimized Logical Plan ==
> Project [N#16,L#17,n#18,l#19]
> +- Join Inner, Some((n#18 = N#16))
>:- Filter (N#16 = 3)
>:  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>+- Filter (n#18 = 3)
>   +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12529) Spark streaming: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY

2015-12-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072197#comment-15072197
 ] 

Sean Owen commented on SPARK-12529:
---

This means you've dragged in (old) Hadoop dependencies somehow in your app, or 
in your runtime classpath. I don't think this has to do with Spark per se.

> Spark streaming: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY
> ---
>
> Key: SPARK-12529
> URL: https://issues.apache.org/jira/browse/SPARK-12529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: MacOSX Standalone
>Reporter: Brad Cox
>
> Posted originally on stackoverflow. Reposted here on request by Josh Rosen.
> I'm trying to start spark streaming in standalone mode (MacOSX) and getting 
> the following error nomatter what:
> Exception in thread "main" java.lang.ExceptionInInitializerError at 
> org.apache.spark.storage.DiskBlockManager.addShutdownHook(DiskBlockManager.scala:147)
>  at org.apache.spark.storage.DiskBlockManager.(DiskBlockManager.scala:54) at 
> org.apache.spark.storage.BlockManager.(BlockManager.scala:75) at 
> org.apache.spark.storage.BlockManager.(BlockManager.scala:173) at 
> org.apache.spark.SparkEnv$.create(SparkEnv.scala:347) at 
> org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194) at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277) at 
> org.apache.spark.SparkContext.(SparkContext.scala:450) at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:566)
>  at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:578)
>  at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:90) 
> at 
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:78)
>  at io.ascolta.pcap.PcapOfflineReceiver.main(PcapOfflineReceiver.java:103) 
> Caused by: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY at 
> java.lang.Class.getField(Class.java:1584) at 
> org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:220)
>  at 
> org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
>  at 
> org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
>  at 
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:189)
>  at org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala:58) 
> at org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala) ... 
> 13 more
> This symptom is discussed in relation to EC2 at 
> https://forums.databricks.com/questions/2227/shutdown-hook-priority-javalangnosuchfieldexceptio.html
>  as a Hadoop2 dependency. But I'm running locally (for now), and am using the 
> spark-1.5.2-bin-hadoop2.6.tgz binary from 
> https://spark.apache.org/downloads.html which I'd hoped would eliminate this 
> possibility.
> I've pruned my code down to essentially nothing; like this:
> SparkConf conf = new SparkConf()
>   .setAppName(appName)
>   .setMaster(master);
>   JavaStreamingContext ssc = new JavaStreamingContext(conf, new 
> Duration(1000));
> I've permuted maven dependencies to ensure all spark stuff is consistent at 
> version 1.5.2. Yet the ssc initialization above fails nomatter what. So I 
> thought it was time to ask for help.
> Build environment is eclipse and maven with the shade plugin. Launch/run is 
> from eclipse debugger, not spark-submit, for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12518) Problem in Spark deserialization with htsjdk BAMRecordCodec

2015-12-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12518.
---
Resolution: Not A Problem

> Problem in Spark deserialization with htsjdk BAMRecordCodec
> ---
>
> Key: SPARK-12518
> URL: https://issues.apache.org/jira/browse/SPARK-12518
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 1.5.2
> Environment: Linux Red Hat 4.8.2-16, Java 8, htsjdk-1.130
>Reporter: Zhanpeng Wu
>
> When I used [htsjdk|https://github.com/samtools/htsjdk] in my Spark 
> application, I found some problem in record deserialization. The object of 
> *SAMRecord* could not be deserialzed and throw the exception: 
> {quote}
> WARN ThrowableSerializationWrapper: Task exception could not be deserialized
> java.lang.ClassNotFoundException: htsjdk.samtools.util.RuntimeIOException
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:340)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
> at 
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at 
> org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:167)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:108)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:105)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> It seems that the application encountered a premature EOF when deserialing.
> Here is my test code: 
> {code:title=Test.java|borderStyle=solid}
> public class Test {
>   public static void main(String[] args) {
>   SparkConf sparkConf = 

[jira] [Assigned] (SPARK-12263) IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12263:


Assignee: (was: Apache Spark)

> IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit
> -
>
> Key: SPARK-12263
> URL: https://issues.apache.org/jira/browse/SPARK-12263
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> When starting a worker with the following command - note 
> {{SPARK_WORKER_MEMORY=1024}} it fails saying that the memory was 0 while it 
> was 1024 (without size unit).
> {code}
> ➜  spark git:(master) ✗ SPARK_WORKER_MEMORY=1024 SPARK_WORKER_CORES=5 
> ./sbin/start-slave.sh spark://localhost:7077
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
>   INFO ShutdownHookManager: Shutdown hook called
>   INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> full log in 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}
> The full stack trace is as follows:
> {code}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> INFO Worker: Registered signal handlers for [TERM, HUP, INT]
> Exception in thread "main" java.lang.IllegalStateException: Memory can't be 
> 0, missing a M or G on the end of the memory specification?
> at 
> org.apache.spark.deploy.worker.WorkerArguments.checkWorkerMemory(WorkerArguments.scala:179)
> at 
> org.apache.spark.deploy.worker.WorkerArguments.(WorkerArguments.scala:64)
> at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:691)
> at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
> INFO ShutdownHookManager: Shutdown hook called
> INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> {code}
> The following command starts spark standalone worker successfully:
> {code}
> SPARK_WORKER_MEMORY=1g SPARK_WORKER_CORES=5 ./sbin/start-slave.sh 
> spark://localhost:7077
> {code}
> The master reports:
> {code}
> INFO Master: Registering worker 192.168.1.6:63884 with 5 cores, 1024.0 MB RAM
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12521) DataFrame Partitions in java does not work

2015-12-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12521.
---
Resolution: Not A Problem

It's already described as an arg that controls partition stride. It wouldn't 
make sense to specify filters separately outside the WHERE clause here.

> DataFrame Partitions in java does not work
> --
>
> Key: SPARK-12521
> URL: https://issues.apache.org/jira/browse/SPARK-12521
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.5.2
>Reporter: Sergey Podolsky
>
> Hello,
> Partition does not work in Java interface of the DataFrame:
> {code}
> SQLContext sqlContext = new SQLContext(sc);
> Map options = new HashMap<>();
> options.put("driver", ORACLE_DRIVER);
> options.put("url", ORACLE_CONNECTION_URL);
> options.put("dbtable",
> "(SELECT * FROM JOBS WHERE ROWNUM < 1) tt");
> options.put("lowerBound", "2704225000");
> options.put("upperBound", "2704226000");
> options.put("partitionColumn", "ID");
> options.put("numPartitions", "10");
> DataFrame jdbcDF = sqlContext.load("jdbc", options);
> List jobsRows = jdbcDF.collectAsList();
> System.out.println(jobsRows.size());
> {code}
> gives  while expected 1000. Is it because of big decimal of boundaries or 
> partitioins does not work at all in Java?
> Thanks.
> Sergey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11600) Spark MLlib 1.6 QA umbrella

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11600:
-
Target Version/s: 1.6.1  (was: 1.6.0)

> Spark MLlib 1.6 QA umbrella
> ---
>
> Key: SPARK-11600
> URL: https://issues.apache.org/jira/browse/SPARK-11600
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next MLlib release's QA period.
> h2. API
> * Check binary API compatibility (SPARK-11601)
> * Audit new public APIs (from the generated html doc)
> ** Scala (SPARK-11602)
> ** Java compatibility (SPARK-11605)
> ** Python coverage (SPARK-11604)
> * Check Experimental, DeveloperApi tags (SPARK-11603)
> h2. Algorithms and performance
> *Performance*
> * _List any other missing performance tests from spark-perf here_
> * ALS.recommendAll (SPARK-7457)
> * perf-tests in Python (SPARK-7539)
> * perf-tests for transformers (SPARK-2838)
> * MultilayerPerceptron (SPARK-11911)
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide (SPARK-11606)
> * For major components, create JIRAs for example code (SPARK-9670)
> * Update Programming Guide for 1.6 (towards end of QA) (SPARK-11608)
> * Update website (SPARK-11607)
> * Merge duplicate content under examples/ (SPARK-11685)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8447) Test external shuffle service with all shuffle managers

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8447:

Target Version/s: 1.6.1  (was: 1.6.0)

> Test external shuffle service with all shuffle managers
> ---
>
> Key: SPARK-8447
> URL: https://issues.apache.org/jira/browse/SPARK-8447
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Tests
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Priority: Critical
>
> There is a mismatch between the shuffle managers in Spark core and in the 
> external shuffle service. The latest unsafe shuffle manager is an example of 
> this (SPARK-8430). This issue arose because we apparently do not have 
> sufficient tests for making sure that these two components deal with the same 
> set of shuffle managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11224) Flaky test: o.a.s.ExternalShuffleServiceSuite

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11224:
-
Target Version/s: 1.6.1  (was: 1.6.0)

> Flaky test: o.a.s.ExternalShuffleServiceSuite
> -
>
> Key: SPARK-11224
> URL: https://issues.apache.org/jira/browse/SPARK-11224
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3798/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/testReport/junit/org.apache.spark/ExternalShuffleServiceSuite/using_external_shuffle_service/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11266) Peak memory tests swallow failures

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11266:
-
Target Version/s: 1.6.1  (was: 1.6.0)

> Peak memory tests swallow failures
> --
>
> Key: SPARK-11266
> URL: https://issues.apache.org/jira/browse/SPARK-11266
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Priority: Critical
>
> You have something like the following without the tests failing:
> {code}
> 22:29:03.493 ERROR org.apache.spark.scheduler.LiveListenerBus: Listener 
> SaveInfoListener threw an exception
> org.scalatest.exceptions.TestFailedException: peak execution memory 
> accumulator not set in 'aggregation with codegen'
>   at 
> org.apache.spark.AccumulatorSuite$$anonfun$verifyPeakExecutionMemorySet$1$$anonfun$27.apply(AccumulatorSuite.scala:340)
>   at 
> org.apache.spark.AccumulatorSuite$$anonfun$verifyPeakExecutionMemorySet$1$$anonfun$27.apply(AccumulatorSuite.scala:340)
>   at scala.Option.getOrElse(Option.scala:120)
> {code}
> E.g. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1936/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11607) Update MLlib website for 1.6

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11607:
-
Target Version/s: 1.6.1  (was: 1.6.0)

> Update MLlib website for 1.6
> 
>
> Key: SPARK-11607
> URL: https://issues.apache.org/jira/browse/SPARK-11607
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update MLlib's website to include features in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11603) ML 1.6 QA: API: Experimental, DeveloperApi, final, sealed audit

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11603:
-
Target Version/s: 1.6.1  (was: 1.6.0)

> ML 1.6 QA: API: Experimental, DeveloperApi, final, sealed audit
> ---
>
> Key: SPARK-11603
> URL: https://issues.apache.org/jira/browse/SPARK-11603
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.  This will 
> probably not include the Pipeline APIs yet since some parts (e.g., feature 
> attributes) are still under flux.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10680) Flaky test: network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10680:
-
Target Version/s: 1.6.1  (was: 1.6.0)

> Flaky test: network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests
> --
>
> Key: SPARK-10680
> URL: https://issues.apache.org/jira/browse/SPARK-10680
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Xiangrui Meng
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: flaky-test
>
> Saw several failures recently.
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.3,label=spark-test/3560/testReport/junit/org.apache.spark.network/RequestTimeoutIntegrationSuite/timeoutInactiveRequests/
> {code}
> org.apache.spark.network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests
> Failing for the past 1 build (Since Failed#3560 )
> Took 6 sec.
> Stacktrace
> java.lang.NullPointerException: null
>   at 
> org.apache.spark.network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests(RequestTimeoutIntegrationSuite.java:115)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12507) Update Streaming configurations for 1.6

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12507:
-
Target Version/s: 1.6.1  (was: 1.6.0)

> Update Streaming configurations for 1.6
> ---
>
> Key: SPARK-12507
> URL: https://issues.apache.org/jira/browse/SPARK-12507
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12534) Document missing command line options to Spark properties mapping

2015-12-27 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-12534:


 Summary: Document missing command line options to Spark properties 
mapping
 Key: SPARK-12534
 URL: https://issues.apache.org/jira/browse/SPARK-12534
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation, YARN
Affects Versions: 1.5.2
Reporter: Felix Cheung
Priority: Minor


Several Spark properties equivalent to Spark submit command line options are 
missing.

{quote}
The equivalent for spark-submit --num-executors should be 
spark.executor.instances
When use in SparkConf?
http://spark.apache.org/docs/latest/running-on-yarn.html

Could you try setting that with sparkR.init()?


_
From: Franc Carter 
Sent: Friday, December 25, 2015 9:23 PM
Subject: number of executors in sparkR.init()
To: 



Hi,

I'm having trouble working out how to get the number of executors set when 
using sparkR.init().

If I start sparkR with

  sparkR  --master yarn --num-executors 6 

then I get 6 executors

However, if start sparkR with

  sparkR 

followed by

  sc <- sparkR.init(master="yarn-client",   
sparkEnvir=list(spark.num.executors='6'))

then I only get 2 executors.

Can anyone point me in the direction of what I might doing wrong ? I need to 
initialise this was so that rStudio can hook in to SparkR

thanks

-- 
Franc
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-12-27 Thread Jiahongchao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072396#comment-15072396
 ] 

Jiahongchao commented on SPARK-4924:


Where is the official document?

> Factor out code to launch Spark applications into a separate library
> 
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.4.0
>
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark 
> application from my Java/Scala program?". There currently isn't a good answer 
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one 
> active context at the moment, plus you lose the ability to submit apps in 
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic 
> handled by the shell scripts
> - Calling the shell script directly is doable,  but sort of ugly from an API 
> point of view.
> I think it would be nice to have a small library that handles that for users. 
> On top of that, this library could be used by Spark itself to replace a lot 
> of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12534) Document missing command line options to Spark properties mapping

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12534:


Assignee: Apache Spark

> Document missing command line options to Spark properties mapping
> -
>
> Key: SPARK-12534
> URL: https://issues.apache.org/jira/browse/SPARK-12534
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, YARN
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> Several Spark properties equivalent to Spark submit command line options are 
> missing.
> {quote}
> The equivalent for spark-submit --num-executors should be 
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
> Could you try setting that with sparkR.init()?
> _
> From: Franc Carter 
> Sent: Friday, December 25, 2015 9:23 PM
> Subject: number of executors in sparkR.init()
> To: 
> Hi,
> I'm having trouble working out how to get the number of executors set when 
> using sparkR.init().
> If I start sparkR with
>   sparkR  --master yarn --num-executors 6 
> then I get 6 executors
> However, if start sparkR with
>   sparkR 
> followed by
>   sc <- sparkR.init(master="yarn-client",   
> sparkEnvir=list(spark.num.executors='6'))
> then I only get 2 executors.
> Can anyone point me in the direction of what I might doing wrong ? I need to 
> initialise this was so that rStudio can hook in to SparkR
> thanks
> -- 
> Franc
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12513) SocketReceiver hang in Netcat example

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12513:


Assignee: Apache Spark

> SocketReceiver hang in Netcat example
> -
>
> Key: SPARK-12513
> URL: https://issues.apache.org/jira/browse/SPARK-12513
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Shawn Guo
>Assignee: Apache Spark
>Priority: Minor
>
> I add a SocketReceiver test based on the NetworkWordCount.
> Using pipeline and tail the continuous output to netcat
> tail -f xxx.log | nc -lk 
> and create a SocketReceiver to receive the continuous output from remote 
> netcat.
> After about 10 hours, SocketReceiver hang and can not receive no more data.
> Netcat only accept one socket connection and push "tail -f xxx.log" to 
> connected socket. other connection is wating in the netcat queue.
> When restart SocketReceive, new socket connection is created to connect to 
> Netcat. However old connection is not closed properly. new connection can not 
> read anything from Netcat. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12513) SocketReceiver hang in Netcat example

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12513:


Assignee: (was: Apache Spark)

> SocketReceiver hang in Netcat example
> -
>
> Key: SPARK-12513
> URL: https://issues.apache.org/jira/browse/SPARK-12513
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Shawn Guo
>Priority: Minor
>
> I add a SocketReceiver test based on the NetworkWordCount.
> Using pipeline and tail the continuous output to netcat
> tail -f xxx.log | nc -lk 
> and create a SocketReceiver to receive the continuous output from remote 
> netcat.
> After about 10 hours, SocketReceiver hang and can not receive no more data.
> Netcat only accept one socket connection and push "tail -f xxx.log" to 
> connected socket. other connection is wating in the netcat queue.
> When restart SocketReceive, new socket connection is created to connect to 
> Netcat. However old connection is not closed properly. new connection can not 
> read anything from Netcat. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12461) Add ExpressionDescription to math functions

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12461:


Assignee: Apache Spark

> Add ExpressionDescription to math functions
> ---
>
> Key: SPARK-12461
> URL: https://issues.apache.org/jira/browse/SPARK-12461
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12532) Join-key Pushdown via Predicate Transitivity

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12532:
-
Shepherd: Michael Armbrust

> Join-key Pushdown via Predicate Transitivity
> 
>
> Key: SPARK-12532
> URL: https://issues.apache.org/jira/browse/SPARK-12532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>  Labels: SQL
>
> {code}
> "SELECT * FROM upperCaseData JOIN lowerCaseData where lowerCaseData.n = 
> upperCaseData.N and lowerCaseData.n = 3"
> {code}
> {code}
> == Analyzed Logical Plan ==
> N: int, L: string, n: int, l: string
> Project [N#16,L#17,n#18,l#19]
> +- Filter ((n#18 = N#16) && (n#18 = 3))
>+- Join Inner, None
>   :- Subquery upperCaseData
>   :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>   +- Subquery lowerCaseData
>  +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}
> Before the improvement, the optimized logical plan is
> {code}
> == Optimized Logical Plan ==
> Project [N#16,L#17,n#18,l#19]
> +- Join Inner, Some((n#18 = N#16))
>:- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>+- Filter (n#18 = 3)
>   +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}
> After the improvement, the optimized logical plan should be like
> {code}
> == Optimized Logical Plan ==
> Project [N#16,L#17,n#18,l#19]
> +- Join Inner, Some((n#18 = N#16))
>:- Filter (N#16 = 3)
>:  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>+- Filter (n#18 = 3)
>   +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12532) Join-key Pushdown via Predicate Transitivity

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12532:
-
Target Version/s: 2.0.0

> Join-key Pushdown via Predicate Transitivity
> 
>
> Key: SPARK-12532
> URL: https://issues.apache.org/jira/browse/SPARK-12532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>  Labels: SQL
>
> {code}
> "SELECT * FROM upperCaseData JOIN lowerCaseData where lowerCaseData.n = 
> upperCaseData.N and lowerCaseData.n = 3"
> {code}
> {code}
> == Analyzed Logical Plan ==
> N: int, L: string, n: int, l: string
> Project [N#16,L#17,n#18,l#19]
> +- Filter ((n#18 = N#16) && (n#18 = 3))
>+- Join Inner, None
>   :- Subquery upperCaseData
>   :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>   +- Subquery lowerCaseData
>  +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}
> Before the improvement, the optimized logical plan is
> {code}
> == Optimized Logical Plan ==
> Project [N#16,L#17,n#18,l#19]
> +- Join Inner, Some((n#18 = N#16))
>:- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>+- Filter (n#18 = 3)
>   +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}
> After the improvement, the optimized logical plan should be like
> {code}
> == Optimized Logical Plan ==
> Project [N#16,L#17,n#18,l#19]
> +- Join Inner, Some((n#18 = N#16))
>:- Filter (N#16 = 3)
>:  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>+- Filter (n#18 = 3)
>   +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12532) Join-key Pushdown via Predicate Transitivity

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12532:
-
Assignee: Xiao Li

> Join-key Pushdown via Predicate Transitivity
> 
>
> Key: SPARK-12532
> URL: https://issues.apache.org/jira/browse/SPARK-12532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>  Labels: SQL
>
> {code}
> "SELECT * FROM upperCaseData JOIN lowerCaseData where lowerCaseData.n = 
> upperCaseData.N and lowerCaseData.n = 3"
> {code}
> {code}
> == Analyzed Logical Plan ==
> N: int, L: string, n: int, l: string
> Project [N#16,L#17,n#18,l#19]
> +- Filter ((n#18 = N#16) && (n#18 = 3))
>+- Join Inner, None
>   :- Subquery upperCaseData
>   :  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>   +- Subquery lowerCaseData
>  +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}
> Before the improvement, the optimized logical plan is
> {code}
> == Optimized Logical Plan ==
> Project [N#16,L#17,n#18,l#19]
> +- Join Inner, Some((n#18 = N#16))
>:- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>+- Filter (n#18 = 3)
>   +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}
> After the improvement, the optimized logical plan should be like
> {code}
> == Optimized Logical Plan ==
> Project [N#16,L#17,n#18,l#19]
> +- Join Inner, Some((n#18 = N#16))
>:- Filter (N#16 = 3)
>:  +- LogicalRDD [N#16,L#17], MapPartitionsRDD[17] at beforeAll at 
> BeforeAndAfterAll.scala:187
>+- Filter (n#18 = 3)
>   +- LogicalRDD [n#18,l#19], MapPartitionsRDD[19] at beforeAll at 
> BeforeAndAfterAll.scala:187
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2015-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072435#comment-15072435
 ] 

Apache Spark commented on SPARK-12453:
--

User 'BrianLondon' has created a pull request for this issue:
https://github.com/apache/spark/pull/10492

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12505) Pushdown a Limit on top of an Outer-Join

2015-12-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12505:
-
Target Version/s: 2.0.0

> Pushdown a Limit on top of an Outer-Join
> 
>
> Key: SPARK-12505
> URL: https://issues.apache.org/jira/browse/SPARK-12505
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> "Rule that applies to a Limit on top of an OUTER Join. The original Limit 
> won't go away after applying this rule, but additional Limit node(s) will be 
> created on top of the outer-side child (or children if it's a FULL OUTER 
> Join). "
> – from https://issues.apache.org/jira/browse/CALCITE-832



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12531) Add median and mode to Summary statistics

2015-12-27 Thread Gaurav Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072386#comment-15072386
 ] 

Gaurav Kumar commented on SPARK-12531:
--

[~srowen], I agree these are not exactly difficult to implement, but I guess 
these should be there in the library for the sake of completeness. For 
instance, while doing EDA on the data, one would use the 
{{Statistics.colStats(observations)}} and would expect to get all the required 
summary data similar to what {{R}}'s {{summary(dataset)}} does.

> Add median and mode to Summary statistics
> -
>
> Key: SPARK-12531
> URL: https://issues.apache.org/jira/browse/SPARK-12531
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Gaurav Kumar
>Priority: Minor
>
> Summary statistics should also include calculating median and mode in 
> addition to mean, variance and others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12533) hiveContext.table() throws the wrong exception

2015-12-27 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-12533:


 Summary: hiveContext.table() throws the wrong exception
 Key: SPARK-12533
 URL: https://issues.apache.org/jira/browse/SPARK-12533
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Michael Armbrust


This should throw an {{AnalysisException}} that includes the table name instead 
of the following:

{code}
org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at 
org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
at 
org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:122)
at 
org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:384)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:458)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:458)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:830)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:826)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11559) Make `runs` no effect in k-means

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11559:


Assignee: Apache Spark

> Make `runs` no effect in k-means
> 
>
> Key: SPARK-11559
> URL: https://issues.apache.org/jira/browse/SPARK-11559
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> We deprecated `runs` in Spark 1.6 (SPARK-11358). In 1.7.0, we can either 
> remove `runs` or make it no effect (with warning messages). So we can 
> simplify the implementation. I prefer the latter for better binary 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11560) Optimize KMeans implementation

2015-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072403#comment-15072403
 ] 

Apache Spark commented on SPARK-11560:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10306

> Optimize KMeans implementation
> --
>
> Key: SPARK-11560
> URL: https://issues.apache.org/jira/browse/SPARK-11560
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>
> After we dropped `runs`, we can simplify and optimize the k-means 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11559) Make `runs` no effect in k-means

2015-12-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11559:


Assignee: (was: Apache Spark)

> Make `runs` no effect in k-means
> 
>
> Key: SPARK-11559
> URL: https://issues.apache.org/jira/browse/SPARK-11559
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>
> We deprecated `runs` in Spark 1.6 (SPARK-11358). In 1.7.0, we can either 
> remove `runs` or make it no effect (with warning messages). So we can 
> simplify the implementation. I prefer the latter for better binary 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072254#comment-15072254
 ] 

Apache Spark commented on SPARK-12363:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/10420

> PowerIterationClustering test case failed if we deprecated KMeans.setRuns
> -
>
> Key: SPARK-12363
> URL: https://issues.apache.org/jira/browse/SPARK-12363
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> We plan to deprecated `runs` of KMeans, PowerIterationClustering will 
> leverage KMeans to train model.
> I removed `setRuns` used in PowerIterationClustering, but one of the test 
> cases failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12533) hiveContext.table() throws the wrong exception

2015-12-27 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072447#comment-15072447
 ] 

Thomas Sebastian commented on SPARK-12533:
--

Hi Michael,Can you tell the scenario to replicate this exception? like any 
specific commands, you are trying. I shall work on a fix for this.

> hiveContext.table() throws the wrong exception
> --
>
> Key: SPARK-12533
> URL: https://issues.apache.org/jira/browse/SPARK-12533
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> This should throw an {{AnalysisException}} that includes the table name 
> instead of the following:
> {code}
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:122)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:384)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:458)
>   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:458)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:830)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:826)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12531) Add median and mode to Summary statistics

2015-12-27 Thread Gaurav Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072386#comment-15072386
 ] 

Gaurav Kumar edited comment on SPARK-12531 at 12/28/15 3:36 AM:


[~srowen], I agree these are not exactly difficult to implement, but I guess 
these should be there in the library for the sake of completeness. For 
instance, while doing EDA on the data, one would use the 
{{Statistics.colStats(observations)}} and would expect to get all the required 
summary data similar to what {{R}}'s {{summary(dataset)}} does. For the same 
reasons, while we are at it, we should also add 25th and 75th percentiles as 
well.


was (Author: gauravkumar37):
[~srowen], I agree these are not exactly difficult to implement, but I guess 
these should be there in the library for the sake of completeness. For 
instance, while doing EDA on the data, one would use the 
{{Statistics.colStats(observations)}} and would expect to get all the required 
summary data similar to what {{R}}'s {{summary(dataset)}} does.

> Add median and mode to Summary statistics
> -
>
> Key: SPARK-12531
> URL: https://issues.apache.org/jira/browse/SPARK-12531
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Gaurav Kumar
>Priority: Minor
>
> Summary statistics should also include calculating median and mode in 
> addition to mean, variance and others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12239) ​SparkR - Not distributing SparkR module in YARN

2015-12-27 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072402#comment-15072402
 ] 

Sun Rui commented on SPARK-12239:
-

To have a formal fix for this issue, we can have two ways:
1. Similar to https://github.com/apache/spark/pull/9290 for SPARK-11340. That 
is, if "yarn-client" is detected for master, then insert "--master yarn-client" 
into SPARKR_SUBMIT_ARGS;
2. A more generic way is to standardize the SPARKR_SUBMIT_ARGS env var and 
document that assign the command line arguments intended for spark-submit to 
SPARKR_SUBMIT_ARGS in order to launch SparkR in Rstudio.

> ​SparkR - Not distributing SparkR module in YARN
> 
>
> Key: SPARK-12239
> URL: https://issues.apache.org/jira/browse/SPARK-12239
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.2, 1.5.3
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Critical
>
> Hello,
> I am trying to use the SparkR in a YARN environment and I have encountered 
> the following problem:
> Every thing work correctly when using bin/sparkR, but if I try running the 
> same jobs using sparkR directly through R it does not work.
> I have managed to track down what is causing the problem, when sparkR is 
> launched through R the "SparkR" module is not distributed to the worker nodes.
> I have tried working around this issue using the setting 
> "spark.yarn.dist.archives", but it does not work as it deploys the 
> file/extracted folder with the extension ".zip" and workers are actually 
> looking for a folder with the name "sparkr"
> Is there currently any way to make this work?
> {code}
> # spark-defaults.conf
> spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip
> # R
> library(SparkR, lib.loc="/opt/apps/spark/R/lib/")
> sc <- sparkR.init(appName="SparkR", master="yarn-client", 
> sparkEnvir=list(spark.executor.instances="1"))
> sqlContext <- sparkRSQL.init(sc)
> df <- createDataFrame(sqlContext, faithful)
> head(df)
> 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
> {code}
> Container stderr:
> {code}
> 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as 
> values in memory (estimated size 8.7 KB, free 530.0 MB)
> 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file 
> '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R':
>  No such file or directory
> 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.net.SocketTimeoutException: Accept timed out
>   at java.net.PlainSocketImpl.socketAccept(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
>   at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>   at java.net.ServerSocket.accept(ServerSocket.java:513)
>   at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)
> {code}
> Worker Node that runned the Container:
> {code}
> # ls -la 
> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02
> total 71M
> drwx--x--- 3 yarn hadoop 4.0K Dec  9 09:04 .
> drwx--x--- 7 yarn hadoop 4.0K Dec  9 09:04 ..
> -rw-r--r-- 1 yarn hadoop  110 Dec  9 09:03 container_tokens
> -rw-r--r-- 1 yarn hadoop   12 Dec  9 09:03 .container_tokens.crc
> -rwx-- 1 yarn hadoop  736 Dec  9 09:03 
> default_container_executor_session.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 
> .default_container_executor_session.sh.crc
> -rwx-- 1 yarn hadoop  790 Dec  9 09:03 default_container_executor.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 .default_container_executor.sh.crc
> -rwxr-xr-x 1 yarn hadoop  61K Dec  9 09:04 hadoop-lzo-0.6.0.2.3.2.0-2950.jar
> -rwxr-xr-x 1 yarn hadoop 317K Dec  9 09:04 kafka-clients-0.8.2.2.jar
> -rwx-- 1 yarn hadoop 6.0K Dec  9 09:03 launch_container.sh
> -rw-r--r-- 1 yarn hadoop   56 Dec  9 09:03 .launch_container.sh.crc
> -rwxr-xr-x 1 yarn hadoop 2.2M Dec  9 09:04 
> spark-cassandra-connector_2.10-1.5.0-M3.jar
> -rwxr-xr-x 1 yarn hadoop 7.1M Dec  9 09:04 spark-csv-assembly-1.3.0.jar
> lrwxrwxrwx 1 yarn hadoop  119 Dec  9 09:03 __spark__.jar -> 
> /hadoop/hdfs/disk03/hadoop/yarn/local/usercache/spark/filecache/361/spark-assembly-1.5.3-SNAPSHOT-hadoop2.7.1.jar
> lrwxrwxrwx 1 yarn hadoop   84 Dec  9 09:03 sparkr.zip -> 
> /hadoop/hdfs/disk01/hadoop/yarn/local/usercache/spark/filecache/359/sparkr.zip

[jira] [Resolved] (SPARK-12520) Python API dataframe join returns wrong results on outer join

2015-12-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12520.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10477
[https://github.com/apache/spark/pull/10477]

> Python API dataframe join returns wrong results on outer join
> -
>
> Key: SPARK-12520
> URL: https://issues.apache.org/jira/browse/SPARK-12520
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Aravind  B
> Fix For: 2.0.0
>
>
> Consider the following dataframes:
> """
> left_table:
> +++-+--+
> |head_id_left|tail_id_left|weight|joining_column|
> +++-+--+
> |   1|   2|1|   1~2|
> +++-+--+
> right_table:
> +-+-+--+
> |head_id_right|tail_id_right|joining_column|
> +-+-+--+
> +-+-+--+
> """
> The following code returns an empty dataframe:
> """
> joined_table = left_table.join(right_table, "joining_column", "outer")
> """
> joined_table has zero rows. 
> However:
> """
> joined_table = left_table.join(right_table, left_table.joining_column == 
> right_table.joining_column, "outer")
> """
> returns the correct answer with one row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12520) Python API dataframe join returns wrong results on outer join

2015-12-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12520:
---
Assignee: Xiao Li

> Python API dataframe join returns wrong results on outer join
> -
>
> Key: SPARK-12520
> URL: https://issues.apache.org/jira/browse/SPARK-12520
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Aravind  B
>Assignee: Xiao Li
> Fix For: 1.6.0, 2.0.0
>
>
> Consider the following dataframes:
> """
> left_table:
> +++-+--+
> |head_id_left|tail_id_left|weight|joining_column|
> +++-+--+
> |   1|   2|1|   1~2|
> +++-+--+
> right_table:
> +-+-+--+
> |head_id_right|tail_id_right|joining_column|
> +-+-+--+
> +-+-+--+
> """
> The following code returns an empty dataframe:
> """
> joined_table = left_table.join(right_table, "joining_column", "outer")
> """
> joined_table has zero rows. 
> However:
> """
> joined_table = left_table.join(right_table, left_table.joining_column == 
> right_table.joining_column, "outer")
> """
> returns the correct answer with one row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12520) Python API dataframe join returns wrong results on outer join

2015-12-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12520:
---
Fix Version/s: 1.6.0

> Python API dataframe join returns wrong results on outer join
> -
>
> Key: SPARK-12520
> URL: https://issues.apache.org/jira/browse/SPARK-12520
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Aravind  B
> Fix For: 1.6.0, 2.0.0
>
>
> Consider the following dataframes:
> """
> left_table:
> +++-+--+
> |head_id_left|tail_id_left|weight|joining_column|
> +++-+--+
> |   1|   2|1|   1~2|
> +++-+--+
> right_table:
> +-+-+--+
> |head_id_right|tail_id_right|joining_column|
> +-+-+--+
> +-+-+--+
> """
> The following code returns an empty dataframe:
> """
> joined_table = left_table.join(right_table, "joining_column", "outer")
> """
> joined_table has zero rows. 
> However:
> """
> joined_table = left_table.join(right_table, left_table.joining_column == 
> right_table.joining_column, "outer")
> """
> returns the correct answer with one row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12520) Python API dataframe join returns wrong results on outer join

2015-12-27 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12520:
---
Fix Version/s: 1.5.3

> Python API dataframe join returns wrong results on outer join
> -
>
> Key: SPARK-12520
> URL: https://issues.apache.org/jira/browse/SPARK-12520
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Aravind  B
>Assignee: Xiao Li
> Fix For: 1.5.3, 1.6.0, 2.0.0
>
>
> Consider the following dataframes:
> """
> left_table:
> +++-+--+
> |head_id_left|tail_id_left|weight|joining_column|
> +++-+--+
> |   1|   2|1|   1~2|
> +++-+--+
> right_table:
> +-+-+--+
> |head_id_right|tail_id_right|joining_column|
> +-+-+--+
> +-+-+--+
> """
> The following code returns an empty dataframe:
> """
> joined_table = left_table.join(right_table, "joining_column", "outer")
> """
> joined_table has zero rows. 
> However:
> """
> joined_table = left_table.join(right_table, left_table.joining_column == 
> right_table.joining_column, "outer")
> """
> returns the correct answer with one row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12535) Generating scaladoc using sbt fails for network-common and catalyst modules

2015-12-27 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-12535:
---

 Summary: Generating scaladoc using sbt fails for network-common 
and catalyst modules
 Key: SPARK-12535
 URL: https://issues.apache.org/jira/browse/SPARK-12535
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Jacek Laskowski
Priority: Blocker


Executing {{./build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 
-Phive -Phive-thriftserver -DskipTests network-common/compile:doc 
catalyst/compile:doc}} fail with scaladoc errors (the command was narrowed to 
the modules that failed - I initially used {{clean publishLocal}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12535) Generating scaladoc using sbt fails for network-common and catalyst modules

2015-12-27 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072501#comment-15072501
 ] 

Jacek Laskowski commented on SPARK-12535:
-

I fixed the others, but this one has no solution yet:

{code}
[error] 
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala:61:
 annotation argument needs to be a constant; found: "_FUNC_(input, bitLength) - 
Returns a checksum of SHA-2 family as a hex string of the ".+("input. SHA-224, 
SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent 
").+("to 256")
[error] "input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit 
length of 0 is equivalent " +
[error] 
  ^
{code}

> Generating scaladoc using sbt fails for network-common and catalyst modules
> ---
>
> Key: SPARK-12535
> URL: https://issues.apache.org/jira/browse/SPARK-12535
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Blocker
>
> Executing {{./build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 
> -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests 
> network-common/compile:doc catalyst/compile:doc}} fail with scaladoc errors 
> (the command was narrowed to the modules that failed - I initially used 
> {{clean publishLocal}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12535) Generating scaladoc using sbt fails for network-common and catalyst modules

2015-12-27 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072525#comment-15072525
 ] 

Herman van Hovell commented on SPARK-12535:
---

This is caused by the same problem as SPARK-12530.

> Generating scaladoc using sbt fails for network-common and catalyst modules
> ---
>
> Key: SPARK-12535
> URL: https://issues.apache.org/jira/browse/SPARK-12535
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Blocker
>
> Executing {{./build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 
> -Dscala-2.11 -Phive -Phive-thriftserver -DskipTests 
> network-common/compile:doc catalyst/compile:doc}} fail with scaladoc errors 
> (the command was narrowed to the modules that failed - I initially used 
> {{clean publishLocal}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12531) Add median and mode to Summary statistics

2015-12-27 Thread Gaurav Kumar (JIRA)
Gaurav Kumar created SPARK-12531:


 Summary: Add median and mode to Summary statistics
 Key: SPARK-12531
 URL: https://issues.apache.org/jira/browse/SPARK-12531
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.2
Reporter: Gaurav Kumar
Priority: Minor


Summary statistics should also include calculating median and mode in addition 
to mean, variance and others.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org