[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791708#comment-14791708
 ] 

Reynold Xin commented on SPARK-10474:
-

It seems like the problem is that although we reserve a page, we don't reserve 
memory for the pointer array?

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-4440) Enhance the job progress API to expose more information

2015-09-17 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791732#comment-14791732
 ] 

Rui Li commented on SPARK-4440:
---

For Hive on Spark, we want completion time for each stage so we can compute how 
long the stage takes(there's already a submission time in {{SparkStageInfo}}).
It'll be great if we can also get task metrics. Currently we have to implement 
SparkListener to collect metrics.

[~chengxiang li] and [~xuefuz], do you have anything to add?

> Enhance the job progress API to expose more information
> ---
>
> Key: SPARK-4440
> URL: https://issues.apache.org/jira/browse/SPARK-4440
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Rui Li
>
> The progress API introduced in SPARK-2321 provides a new way for user to 
> monitor job progress. However the information exposed in the API is 
> relatively limited. It'll be much more useful if we can enhance the API to 
> expose more data.
> Some improvement for example may include but not limited to:
> 1. Stage submission and completion time.
> 2. Task metrics.
> The requirement is initially identified for the hive on spark 
> project(HIVE-7292), other application should benefit as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"

2015-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10642:


Assignee: Apache Spark

> Crash in rdd.lookup() with "java.lang.Long cannot be cast to 
> java.lang.Integer"
> ---
>
> Key: SPARK-10642
> URL: https://issues.apache.org/jira/browse/SPARK-10642
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: OSX
>Reporter: Thouis Jones
>Assignee: Apache Spark
>
> Running this command:
> {code}
> sc.parallelize([(('a', 'b'), 
> 'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b'))
> {code}
> gives the following error:
> {noformat}
> 15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at 
> PythonRDD.scala:361
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", 
> line 2199, in lookup
> return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)])
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", 
> line 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", 
> line 36, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : java.lang.ClassCastException: java.lang.Long cannot be cast to 
> java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530)
>   at scala.collection.Iterator$class.find(Iterator.scala:780)
>   at scala.collection.AbstractIterator.find(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.find(IterableLike.scala:79)
>   at scala.collection.AbstractIterable.find(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
>   at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361)
>   at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
>   at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"

2015-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791786#comment-14791786
 ] 

Apache Spark commented on SPARK-10642:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8796

> Crash in rdd.lookup() with "java.lang.Long cannot be cast to 
> java.lang.Integer"
> ---
>
> Key: SPARK-10642
> URL: https://issues.apache.org/jira/browse/SPARK-10642
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: OSX
>Reporter: Thouis Jones
>
> Running this command:
> {code}
> sc.parallelize([(('a', 'b'), 
> 'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b'))
> {code}
> gives the following error:
> {noformat}
> 15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at 
> PythonRDD.scala:361
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", 
> line 2199, in lookup
> return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)])
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", 
> line 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", 
> line 36, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : java.lang.ClassCastException: java.lang.Long cannot be cast to 
> java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530)
>   at scala.collection.Iterator$class.find(Iterator.scala:780)
>   at scala.collection.AbstractIterator.find(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.find(IterableLike.scala:79)
>   at scala.collection.AbstractIterable.find(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
>   at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361)
>   at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
>   at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"

2015-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10642:


Assignee: (was: Apache Spark)

> Crash in rdd.lookup() with "java.lang.Long cannot be cast to 
> java.lang.Integer"
> ---
>
> Key: SPARK-10642
> URL: https://issues.apache.org/jira/browse/SPARK-10642
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: OSX
>Reporter: Thouis Jones
>
> Running this command:
> {code}
> sc.parallelize([(('a', 'b'), 
> 'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b'))
> {code}
> gives the following error:
> {noformat}
> 15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at 
> PythonRDD.scala:361
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", 
> line 2199, in lookup
> return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)])
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", 
> line 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", 
> line 36, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : java.lang.ClassCastException: java.lang.Long cannot be cast to 
> java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530)
>   at scala.collection.Iterator$class.find(Iterator.scala:780)
>   at scala.collection.AbstractIterator.find(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.find(IterableLike.scala:79)
>   at scala.collection.AbstractIterable.find(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
>   at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361)
>   at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
>   at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10660) Doc describe error in the "Running Spark on YARN" page

2015-09-17 Thread yangping wu (JIRA)
yangping wu created SPARK-10660:
---

 Summary: Doc describe error in the "Running Spark on YARN" page
 Key: SPARK-10660
 URL: https://issues.apache.org/jira/browse/SPARK-10660
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.5.0, 1.4.1, 1.4.0
Reporter: yangping wu
Priority: Trivial


In the *Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
*spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 0.10, 
with minimum of 384" and "AM memory * 0.10, with minimum of 384" respectively. 
Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is 0.1.0, not 0.07.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10660) Doc describe error in the "Running Spark on YARN" page

2015-09-17 Thread yangping wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yangping wu updated SPARK-10660:

Description: In the *Configuration* section, the 
*spark.yarn.driver.memoryOverhead* and *spark.yarn.am.memoryOverhead*‘s  
default value should be "driverMemory * 0.10, with minimum of 384" and "AM 
memory * 0.10, with minimum of 384" respectively. Because from Spark 1.4.0, the 
*MEMORY_OVERHEAD_FACTOR* is set to 0.1.0, not 0.07.  (was: In the 
*Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
*spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 0.10, 
with minimum of 384" and "AM memory * 0.10, with minimum of 384" respectively. 
Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is 0.1.0, not 0.07.)

> Doc describe error in the "Running Spark on YARN" page
> --
>
> Key: SPARK-10660
> URL: https://issues.apache.org/jira/browse/SPARK-10660
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: yangping wu
>Priority: Trivial
>
> In the *Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
> *spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 
> 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" 
> respectively. Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is set 
> to 0.1.0, not 0.07.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10285) Add @since annotation to pyspark.ml.util

2015-09-17 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791815#comment-14791815
 ] 

Yu Ishikawa commented on SPARK-10285:
-

Close this PR because those are non-public API.

> Add @since annotation to pyspark.ml.util
> 
>
> Key: SPARK-10285
> URL: https://issues.apache.org/jira/browse/SPARK-10285
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10625) Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds unserializable objects into connection properties

2015-09-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791830#comment-14791830
 ] 

Sean Owen commented on SPARK-10625:
---

Dumb question here, but if the driver needs these objects, and Spark needs to 
serialize them, how will that ever work? Does the driver not actually need 
them, just uses them if they're around?

> Spark SQL JDBC read/write is unable to handle JDBC Drivers that adds 
> unserializable objects into connection properties
> --
>
> Key: SPARK-10625
> URL: https://issues.apache.org/jira/browse/SPARK-10625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
> Environment: Ubuntu 14.04
>Reporter: Peng Cheng
>  Labels: jdbc, spark, sparksql
>
> Some JDBC drivers (e.g. SAP HANA) tries to optimize connection pooling by 
> adding new objects into the connection properties, which is then reused by 
> Spark to be deployed to workers. When some of these new objects are unable to 
> be serializable it will trigger an org.apache.spark.SparkException: Task not 
> serializable. The following test code snippet demonstrate this problem by 
> using a modified H2 driver:
>   test("INSERT to JDBC Datasource with UnserializableH2Driver") {
> object UnserializableH2Driver extends org.h2.Driver {
>   override def connect(url: String, info: Properties): Connection = {
> val result = super.connect(url, info)
> info.put("unserializableDriver", this)
> result
>   }
>   override def getParentLogger: Logger = ???
> }
> import scala.collection.JavaConversions._
> val oldDrivers = 
> DriverManager.getDrivers.filter(_.acceptsURL("jdbc:h2:")).toSeq
> oldDrivers.foreach{
>   DriverManager.deregisterDriver
> }
> DriverManager.registerDriver(UnserializableH2Driver)
> sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE")
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", properties).count)
> assert(2 === sqlContext.read.jdbc(url1, "TEST.PEOPLE1", 
> properties).collect()(0).length)
> DriverManager.deregisterDriver(UnserializableH2Driver)
> oldDrivers.foreach{
>   DriverManager.registerDriver
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10660) Doc describe error in the "Running Spark on YARN" page

2015-09-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791834#comment-14791834
 ] 

Sean Owen commented on SPARK-10660:
---

Agree with that, do you want to make a PR?

> Doc describe error in the "Running Spark on YARN" page
> --
>
> Key: SPARK-10660
> URL: https://issues.apache.org/jira/browse/SPARK-10660
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: yangping wu
>Priority: Trivial
>
> In the *Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
> *spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 
> 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" 
> respectively. Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is set 
> to 0.1.0, not 0.07.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10614) SystemClock uses non-monotonic time in its wait logic

2015-09-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14768900#comment-14768900
 ] 

Steve Loughran edited comment on SPARK-10614 at 9/17/15 10:08 AM:
--

Having done a little more detailed research on the current state of this clock, 
I'm now having doubts about this.

On x86, its generally assumed that the {{System.nanoTime()}} uses the {{TSC}} 
counter to get the timestamp —which is fast and only goes forwards (albeit at a 
rate which depends on CPU power states). But it turns out that on manycore 
CPUs, because that could lead to different answers on different cores, the OS 
may use alternative mechanisms to return a counter: which may be neither 
monotonic nor fast.

# [Inside the Hotspot VM: Clocks, Timers and Scheduling Events - Part I - 
Windows|https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks]
# [JDK-6440250 : On Windows System.nanoTime() may be 25x slower than 
System.currentTimeMillis()|http://bugs.java.com/view_bug.do?bug_id=6440250]
# [JDK-6458294 : nanoTime affected by system clock change on Linux (RH9) or in 
general lacks 
monotonicity|http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6458294]
# [Redhat on timestamps in 
Linux|https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html]
# [Timekeeping in VMware Virtual 
Machines|http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]

These docs imply that nanotime may be fast-but-unreliable-on multiple socket 
systems (the latest many core parts share one counter) —and may downgrade to 
something slower than calls to getTimeMillis()., or even something that isn't 
guaranteed to be monotonic. 

It's not clear that on deployments of physical many-core systems moving to 
nanotime actually offers much. I don't know about EC2 or other cloud 
infrastructures though.

maybe its just best to WONTFIX this as it won't raise unrealistic expectations 
about nanoTime working


was (Author: ste...@apache.org):
Having done a little more detailed research on the current state of this clock, 
I'm now having doubts about this.

On x86, its generally assumed that the {{System.nanoTime()}} uses the {{TSC}} 
counter to get the timestamp —which is fast and only goes forwards (albeit at a 
rate which depends on CPU power states). But it turns out that on manycore 
CPUs, because that could lead to different answers on different cores, the OS 
may use alternative mechanisms to return a counter: which may be neither 
monotonic nor fast.

# [Inside the Hotspot VM: Clocks, Timers and Scheduling Events - Part I - 
Windows|https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks]
# [JDK-6440250 : On Windows System.nanoTime() may be 25x slower than 
System.currentTimeMillis()|http://bugs.java.com/view_bug.do?bug_id=6440250]
# [JDK-6458294 : nanoTime affected by system clock change on Linux (RH9) or in 
general lacks monotonicity|JDK-6458294 : nanoTime affected by system clock 
change on Linux (RH9) or in general lacks monotonicity]
# [Redhat on timestamps in 
Linux|https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html]
# [Timekeeping in VMware Virtual 
Machineshttp://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf]

These docs imply that nanotime may be fast-but-unreliable-on multiple socket 
systems (the latest many core parts share one counter) —and may downgrade to 
something slower than calls to getTimeMillis()., or even something that isn't 
guaranteed to be monotonic. 

It's not clear that on deployments of physical many-core systems moving to 
nanotime actually offers much. I don't know about EC2 or other cloud 
infrastructures though.

maybe its just best to WONTFIX this as it won't raise unrealistic expectations 
about nanoTime working

> SystemClock uses non-monotonic time in its wait logic
> -
>
> Key: SPARK-10614
> URL: https://issues.apache.org/jira/browse/SPARK-10614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Steve Loughran
>Priority: Minor
>
> The consolidated (SPARK-4682) clock uses {{System.currentTimeMillis()}} for 
> measuring time, which means its {{waitTillTime()}} routine is brittle against 
> systems (VMs in particular) whose time can go backwards as well as forward.
> For the {{ExecutorAllocationManager}} this appears to be a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h..

[jira] [Created] (SPARK-10661) The PipelineModel class inherits from Serializable twice.

2015-09-17 Thread Matt Hagen (JIRA)
Matt Hagen created SPARK-10661:
--

 Summary: The PipelineModel class inherits from Serializable twice.
 Key: SPARK-10661
 URL: https://issues.apache.org/jira/browse/SPARK-10661
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.5.0
Reporter: Matt Hagen
Priority: Minor
 Fix For: 1.5.0


The Scaladoc shows that org.apache.spark.ml.PipelineModel inherits from 
Serializable twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10388) Public dataset loader interface

2015-09-17 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802747#comment-14802747
 ] 

Kai Sasaki commented on SPARK-10388:


[~mengxr] I totally agree with you. The initial version should be minimal and 
simple. So the previous suggestion is just for desired features. In this sense, 
the initial suggestion might be sufficient as MVP. 
{quote}
For example, I don't think json and orc are commonly used for ML datasets.
{quote}
Yes, json or orc are not used for machine learning data. I just think public 
dataset loader should be flexible to later extension. That means other dataset 
format can be added as plugin. 
{quote}
A proper implementation would be implementing HTTP as a Hadoop FileSystem.
{quote}
Does it mean public dataset can be used through RDD directly? For example we 
can use {{val data = sc.textFile( // public dataset url )}}

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10662) Code snippets are not properly formatted in docs

2015-09-17 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-10662:

Issue Type: Task  (was: Bug)
   Summary: Code snippets are not properly formatted in docs  (was: Code 
examples are not properly formatted)

> Code snippets are not properly formatted in docs
> 
>
> Key: SPARK-10662
> URL: https://issues.apache.org/jira/browse/SPARK-10662
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Jacek Laskowski
>
> Backticks (markdown) in tables are not processed and hence not formatted 
> properly. See 
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
>  and search for {{`yarn-client`}}.
> As per [Sean's 
> suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] 
> I'm creating the JIRA task.
> {quote}
> This is a good fix, but this is another instance where I suspect the same 
> issue exists in several markup files, like configuration.html. It's worth a 
> JIRA since I think catching and fixing all of these is one non-trivial 
> logical change.
> If you can, avoid whitespace changes like stripping or adding space at the 
> end of lines. It just adds to the diff and makes for a tiny extra chance of 
> merge conflicts.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10662) Code examples are not properly formatted

2015-09-17 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-10662:
---

 Summary: Code examples are not properly formatted
 Key: SPARK-10662
 URL: https://issues.apache.org/jira/browse/SPARK-10662
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.5.0
Reporter: Jacek Laskowski


Backticks (markdown) in tables are not processed and hence not formatted 
properly. See 
http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
 and search for {{`yarn-client`}}.

As per [Sean's 
suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] 
I'm creating the JIRA task.
{quote}
This is a good fix, but this is another instance where I suspect the same issue 
exists in several markup files, like configuration.html. It's worth a JIRA 
since I think catching and fixing all of these is one non-trivial logical 
change.

If you can, avoid whitespace changes like stripping or adding space at the end 
of lines. It just adds to the diff and makes for a tiny extra chance of merge 
conflicts.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10660) Doc describe error in the "Running Spark on YARN" page

2015-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802796#comment-14802796
 ] 

Apache Spark commented on SPARK-10660:
--

User '397090770' has created a pull request for this issue:
https://github.com/apache/spark/pull/8797

> Doc describe error in the "Running Spark on YARN" page
> --
>
> Key: SPARK-10660
> URL: https://issues.apache.org/jira/browse/SPARK-10660
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: yangping wu
>Priority: Trivial
>
> In the *Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
> *spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 
> 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" 
> respectively. Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is set 
> to 0.1.0, not 0.07.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10660) Doc describe error in the "Running Spark on YARN" page

2015-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10660:


Assignee: (was: Apache Spark)

> Doc describe error in the "Running Spark on YARN" page
> --
>
> Key: SPARK-10660
> URL: https://issues.apache.org/jira/browse/SPARK-10660
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: yangping wu
>Priority: Trivial
>
> In the *Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
> *spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 
> 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" 
> respectively. Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is set 
> to 0.1.0, not 0.07.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10660) Doc describe error in the "Running Spark on YARN" page

2015-09-17 Thread yangping wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802792#comment-14802792
 ] 

yangping wu commented on SPARK-10660:
-

Hi [~srowen], Thank you for your reply! I had make a PR 
[https://github.com/apache/spark/pull/8797], Could you please review it?

> Doc describe error in the "Running Spark on YARN" page
> --
>
> Key: SPARK-10660
> URL: https://issues.apache.org/jira/browse/SPARK-10660
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: yangping wu
>Priority: Trivial
>
> In the *Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
> *spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 
> 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" 
> respectively. Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is set 
> to 0.1.0, not 0.07.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10660) Doc describe error in the "Running Spark on YARN" page

2015-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10660:


Assignee: Apache Spark

> Doc describe error in the "Running Spark on YARN" page
> --
>
> Key: SPARK-10660
> URL: https://issues.apache.org/jira/browse/SPARK-10660
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: yangping wu
>Assignee: Apache Spark
>Priority: Trivial
>
> In the *Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
> *spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 
> 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" 
> respectively. Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is set 
> to 0.1.0, not 0.07.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10661) The PipelineModel class inherits from Serializable twice.

2015-09-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10661:
--
Target Version/s:   (was: 1.5.0)
Priority: Trivial  (was: Minor)
   Fix Version/s: (was: 1.5.0)

I think that's a scaladoc issue if anything, not Spark. It's pretty harmless in 
any event, so not something that's worth working around.

Please also read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. 
This can't target or be fixed for 1.5.0, since it's not fixed and 1.5.0 is 
released.

> The PipelineModel class inherits from Serializable twice.
> -
>
> Key: SPARK-10661
> URL: https://issues.apache.org/jira/browse/SPARK-10661
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Matt Hagen
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The Scaladoc shows that org.apache.spark.ml.PipelineModel inherits from 
> Serializable twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10661) The PipelineModel class inherits from Serializable twice.

2015-09-17 Thread Matt Hagen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802821#comment-14802821
 ] 

Matt Hagen commented on SPARK-10661:


Thanks. Still calibrating how to report doc issues. Will label as trivial going 
forward. 

> The PipelineModel class inherits from Serializable twice.
> -
>
> Key: SPARK-10661
> URL: https://issues.apache.org/jira/browse/SPARK-10661
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Matt Hagen
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The Scaladoc shows that org.apache.spark.ml.PipelineModel inherits from 
> Serializable twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-17 Thread Matt Hagen (JIRA)
Matt Hagen created SPARK-10663:
--

 Summary: Change test.toDF to test in Spark ML Programming Guide
 Key: SPARK-10663
 URL: https://issues.apache.org/jira/browse/SPARK-10663
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Matt Hagen
Priority: Trivial


Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline

I believe model.transform(test.toDF) should be model.transform(test).

Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10474:


Assignee: (was: Apache Spark)

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802906#comment-14802906
 ] 

Apache Spark commented on SPARK-10474:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/8798

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Assigned] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10474:


Assignee: Apache Spark

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Apache Spark
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802912#comment-14802912
 ] 

Cheng Hao commented on SPARK-10474:
---

The root reason for this failure, is because of the 
`TungstenAggregationIterator.switchToSortBasedAggregation`, as it's eat out 
memory by HashAggregation, and then, we cannot allocate memory when turn the 
sort-based aggregation even in the spilling time.

I post a workaround solution PR for discussion.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by

[jira] [Comment Edited] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-17 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802912#comment-14802912
 ] 

Cheng Hao edited comment on SPARK-10474 at 9/17/15 1:48 PM:


The root reason for this failure, is the trigger condition from  hash-based 
aggregation to sort-based aggregation in the `TungstenAggregationIterator`, 
current code logic is if no more memory to can be allocated, then turn to 
sort-based aggregation,  however, since no memory left, the data spill will 
also failed in UnsafeExternalSorter.initializeWriting.

I post a workaround solution PR for discussion.


was (Author: chenghao):
The root reason for this failure, is because of the 
`TungstenAggregationIterator.switchToSortBasedAggregation`, as it's eat out 
memory by HashAggregation, and then, we cannot allocate memory when turn the 
sort-based aggregation even in the spilling time.

I post a workaround solution PR for discussion.

> Aggregation failed with unable to acquire memory
> 
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Blocker
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 

[jira] [Commented] (SPARK-10289) A direct write API for testing Parquet compatibility

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802938#comment-14802938
 ] 

Maximilian Michels commented on SPARK-10289:


User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8460

> A direct write API for testing Parquet compatibility
> 
>
> Key: SPARK-10289
> URL: https://issues.apache.org/jira/browse/SPARK-10289
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.6.0
>
>
> Due to a set of unfortunate historical issues, it's relatively hard to 
> achieve full interoperability among various Parquet data models. Spark 1.5 
> implemented all backwards-compatibility rules defined in parquet-format spec 
> on the read path (SPARK-6774) to improve this.  However, testing all those 
> corner cases can be really challenging.  Currently, we are testing Parquet 
> compatibility/interoperability by two means:
> # Generate Parquet files by other systems, bundle them into Spark source tree 
> as testing resources, and write test cases against them to ensure that we can 
> interpret them correctly. Currently, we are testing parquet-thrift and 
> parquet-protobuf compatibility in this way.
> #- Pros: Easy to write test cases, easy to test against multiple versions of 
> a given external system/libraries (by generating Parquet files with these 
> versions)
> #- Cons: Hard to track how testing Parquet files are generated
> # Make external libraries as testing dependencies, and call their APIs 
> directly to write Parquet files and verify them. Currently, parquet-avro 
> compatibility is tested using this approach.
> #- Pros: Easy to track how testing Parquet files are generated
> #- Cons:
> ##- Often requires code generation (Avro/Thrift/ProtoBuf/...), either 
> complicates build system by using build time code generation, or bloats the 
> code base by checking in generated Java files.  The former one is especially 
> annoying because Spark has two build systems, and require two sets of plugins 
> to do code generation (e.g., for Avro, we need both sbt-avro and 
> avro-maven-plugin).
> ##- Can only test a single version of a given target library
> Inspired by the 
> [{{writeDirect}}|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972]
>  method in parquet-avro testing code, a direct write API can be a good 
> complement for testing Parquet compatibilities.  Ideally, this API should
> # be easy to construct arbitrary complex Parquet records
> # have a DSL that reflects the nested nature of Parquet records
> In this way, it would be both easy to track Parquet file generation and easy 
> to cover various versions of external libraries.  However, test case authors 
> must be really careful when constructing the test cases and ensure 
> constructed Parquet structures are identical to those generated by the target 
> systems/libraries.  We're probably not going to replace the above two 
> approaches with this API, but just add it as a complement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10226) Error occured in SparkSQL when using !=

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802940#comment-14802940
 ] 

Maximilian Michels commented on SPARK-10226:


User 'small-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8418

> Error occured in SparkSQL when using  !=
> 
>
> Key: SPARK-10226
> URL: https://issues.apache.org/jira/browse/SPARK-10226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: wangwei
>Assignee: wangwei
> Fix For: 1.5.0
>
>
> DataSource:  
> src/main/resources/kv1.txt
> SQL: 
>   1. create table src(id string, name string);
>   2. load data local inpath 
> '${SparkHome}/examples/src/main/resources/kv1.txt' into table src;
>   3. select count( * ) from src where id != '0';
> [ERROR] Could not expand event
> java.lang.IllegalArgumentException: != 0;: event not found
>   at jline.console.ConsoleReader.expandEvents(ConsoleReader.java:779)
>   at jline.console.ConsoleReader.finishBuffer(ConsoleReader.java:631)
>   at jline.console.ConsoleReader.accept(ConsoleReader.java:2019)
>   at jline.console.ConsoleReader.readLine(ConsoleReader.java:2666)
>   at jline.console.ConsoleReader.readLine(ConsoleReader.java:2269)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:231)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:601)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:666)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:178)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:203)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10635) pyspark - running on a different host

2015-09-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10635:
--
Component/s: PySpark

> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10643) Support HDFS urls in spark-submit

2015-09-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10643:
--
Component/s: Spark Submit

> Support HDFS urls in spark-submit
> -
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L685-L698



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10226) Error occured in SparkSQL when using !=

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802939#comment-14802939
 ] 

Maximilian Michels commented on SPARK-10226:


User 'small-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8419

> Error occured in SparkSQL when using  !=
> 
>
> Key: SPARK-10226
> URL: https://issues.apache.org/jira/browse/SPARK-10226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: wangwei
>Assignee: wangwei
> Fix For: 1.5.0
>
>
> DataSource:  
> src/main/resources/kv1.txt
> SQL: 
>   1. create table src(id string, name string);
>   2. load data local inpath 
> '${SparkHome}/examples/src/main/resources/kv1.txt' into table src;
>   3. select count( * ) from src where id != '0';
> [ERROR] Could not expand event
> java.lang.IllegalArgumentException: != 0;: event not found
>   at jline.console.ConsoleReader.expandEvents(ConsoleReader.java:779)
>   at jline.console.ConsoleReader.finishBuffer(ConsoleReader.java:631)
>   at jline.console.ConsoleReader.accept(ConsoleReader.java:2019)
>   at jline.console.ConsoleReader.readLine(ConsoleReader.java:2666)
>   at jline.console.ConsoleReader.readLine(ConsoleReader.java:2269)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:231)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:601)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:666)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:178)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:203)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8887) Explicitly define which data types can be used as dynamic partition columns

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802947#comment-14802947
 ] 

Maximilian Michels commented on SPARK-8887:
---

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8132

> Explicitly define which data types can be used as dynamic partition columns
> ---
>
> Key: SPARK-8887
> URL: https://issues.apache.org/jira/browse/SPARK-8887
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Cheng Lian
>Assignee: Yijie Shen
> Fix For: 1.6.0
>
>
> {{InsertIntoHadoopFsRelation}} implements Hive compatible dynamic 
> partitioning insertion, which uses {{String.valueOf}} to write encode 
> partition column values into dynamic partition directories. This actually 
> limits the data types that can be used in partition column. For example, 
> string representation of {{StructType}} values is not well defined. However, 
> this limitation is not explicitly enforced.
> There are several things we can improve:
> # Enforce dynamic column data type requirements by adding analysis rules and 
> throws {{AnalysisException}} when violation occurs.
> # Abstract away string representation of various data types, so that we don't 
> need to convert internal representation types (e.g. {{UTF8String}}) to 
> external types (e.g. {{String}}). A set of Hive compatible implementations 
> should be provided to ensure compatibility with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10635) pyspark - running on a different host

2015-09-17 Thread Ben Duffield (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802950#comment-14802950
 ] 

Ben Duffield commented on SPARK-10635:
--

Curious as to why you believe this to be hard to support?

We've been using this at many places for quite a long time without issue prior 
to 1.4.
 
I guess there's the question of how to plumb the correct host to 
_load_from_socket. I'm also not aware of the reasons of changing the 
ServerSocket in python rdd serveIterator to listen explicitly on localhost. 
These are the only two places though I believe need to change.

The alternative is for us to put proxying into the application itself (the 
application acting as driver) and then monkeypatching pyspark as before, but 
this isn't ideal.



> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-17 Thread Jian Feng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802952#comment-14802952
 ] 

Jian Feng Zhang commented on SPARK-10663:
-

It's correct in the Spark Website. It's same with the description in the spark 
repository in github.

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10635) pyspark - running on a different host

2015-09-17 Thread Ben Duffield (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802950#comment-14802950
 ] 

Ben Duffield edited comment on SPARK-10635 at 9/17/15 2:04 PM:
---

Curious as to why you believe this to be hard to support?

We've been using this at many places for quite a long time without issue prior 
to 1.4.
 
I guess there's the question of how to plumb the correct host to 
_load_from_socket. I'm also not aware of the reasons of changing the 
ServerSocket in python rdd serveIterator to listen explicitly on localhost. 
These are the only two places though I believe need to change.

The alternative is for us to put proxying into the application itself (the 
application acting as driver) and then monkeypatching pyspark as before, but 
this isn't ideal.


was (Author: bavardage):
Curious as to why you believe this to be hard to support?

We've been using this at many places for quite a long time without issue prior 
to 1.4.
 
I guess there's the question of how to plumb the correct host to 
_load_from_socket. I'm also not aware of the reasons of changing the 
ServerSocket in python rdd serveIterator to listen explicitly on localhost. 
These are the only two places though I believe need to change.

The alternative is for us to put proxying into the application itself (the 
application acting as driver) and then monkeypatching pyspark as before, but 
this isn't ideal.



> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10635) pyspark - running on a different host

2015-09-17 Thread Patrick Woody (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802958#comment-14802958
 ] 

Patrick Woody commented on SPARK-10635:
---

For a bit of motivation - we have a long running SparkContext that essentially 
acts as a query server with many clients in iPython Notebook. 

We want to keep the driver on a different box from the python kernels to 
protect it from potentially resource-heavy python processes (we've had OOM 
killer issues in the past). It seems reasonable via py4j, but we are running 
into the above issues post-1.4.

> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-09-17 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802968#comment-14802968
 ] 

Jacek Lewandowski commented on SPARK-6028:
--

Hey - what's the estimated date of delivery of this feature?

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-09-17 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802993#comment-14802993
 ] 

Shixiong Zhu commented on SPARK-6028:
-

I'm working on it. It will be delivered in 1.6.0.

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10662) Code snippets are not properly formatted in docs

2015-09-17 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-10662:

Attachment: spark-docs-backticks-tables.png

> Code snippets are not properly formatted in docs
> 
>
> Key: SPARK-10662
> URL: https://issues.apache.org/jira/browse/SPARK-10662
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Jacek Laskowski
> Attachments: spark-docs-backticks-tables.png
>
>
> Backticks (markdown) in tables are not processed and hence not formatted 
> properly. See 
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
>  and search for {{`yarn-client`}}.
> As per [Sean's 
> suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] 
> I'm creating the JIRA task.
> {quote}
> This is a good fix, but this is another instance where I suspect the same 
> issue exists in several markup files, like configuration.html. It's worth a 
> JIRA since I think catching and fixing all of these is one non-trivial 
> logical change.
> If you can, avoid whitespace changes like stripping or adding space at the 
> end of lines. It just adds to the diff and makes for a tiny extra chance of 
> merge conflicts.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2622) Add Jenkins build numbers to SparkQA messages

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803073#comment-14803073
 ] 

Maximilian Michels commented on SPARK-2622:
---

User 'HuangWHWHW' has created a pull request for this issue:
https://github.com/apache/flink/pull/1098

> Add Jenkins build numbers to SparkQA messages
> -
>
> Key: SPARK-2622
> URL: https://issues.apache.org/jira/browse/SPARK-2622
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.1
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It takes Jenkins 2 hours to finish testing. It is possible to have the 
> following:
> {code}
> Build 1 started.
> PR updated.
> Build 2 started.
> Build 1 finished successfully.
> A committer merged the PR because the last build seemed to be okay.
> Build 2 failed.
> {code}
> It would be nice to put the build number in the SparkQA message so it is easy 
> to match the result with the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1851) Upgrade Avro dependency to 1.7.6 so Spark can read Avro files

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803076#comment-14803076
 ] 

Maximilian Michels commented on SPARK-1851:
---

User 'aljoscha' has created a pull request for this issue:
https://github.com/apache/flink/pull/592

> Upgrade Avro dependency to 1.7.6 so Spark can read Avro files
> -
>
> Key: SPARK-1851
> URL: https://issues.apache.org/jira/browse/SPARK-1851
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
> Fix For: 1.0.0
>
>
> I tried to set up a basic example getting a Spark job to read an Avro 
> container file with Avro specifics.  This results in a 
> ClassNotFoundException: can't convert GenericData.Record to 
> com.cloudera.sparkavro.User.
> The reason is:
> * When creating records, to decide whether to be specific or generic, Avro 
> tries to load a class with the name specified in the schema.
> * Initially, executors just have the system jars (which include Avro), and 
> load the app jars dynamically with a URLClassLoader that's set as the context 
> classloader for the task threads.
> * Avro tries to load the generated classes with 
> SpecificData.class.getClassLoader(), which sidesteps this URLClassLoader and 
> goes up to the AppClassLoader.
> Avro 1.7.6 has a change (AVRO-987) that falls back to the Thread's context 
> classloader when the SpecificData.class.getClassLoader() fails.  I tested 
> with Avro 1.7.6 and did not observe the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2410) Thrift/JDBC Server

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803075#comment-14803075
 ] 

Maximilian Michels commented on SPARK-2410:
---

User 'twalthr' has created a pull request for this issue:
https://github.com/apache/flink/pull/943

> Thrift/JDBC Server
> --
>
> Key: SPARK-2410
> URL: https://issues.apache.org/jira/browse/SPARK-2410
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.1.0
>
>
> We have this, but need to make sure that it gets merged into master before 
> the 1.1 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2659) HiveQL: Division operator should always perform fractional division

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803067#comment-14803067
 ] 

Maximilian Michels commented on SPARK-2659:
---

User 'greghogan' has created a pull request for this issue:
https://github.com/apache/flink/pull/1130

> HiveQL: Division operator should always perform fractional division
> ---
>
> Key: SPARK-2659
> URL: https://issues.apache.org/jira/browse/SPARK-2659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Minor
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2637) PEP8 Compliance pull request #1540

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803065#comment-14803065
 ] 

Maximilian Michels commented on SPARK-2637:
---

User 'tillrohrmann' has created a pull request for this issue:
https://github.com/apache/flink/pull/1134

> PEP8 Compliance pull request #1540
> --
>
> Key: SPARK-2637
> URL: https://issues.apache.org/jira/browse/SPARK-2637
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Reporter: Vincent Ohprecio
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2537) Workaround Timezone specific Hive tests

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803069#comment-14803069
 ] 

Maximilian Michels commented on SPARK-2537:
---

User 'chenliang613' has created a pull request for this issue:
https://github.com/apache/flink/pull/1123

> Workaround Timezone specific Hive tests
> ---
>
> Key: SPARK-2537
> URL: https://issues.apache.org/jira/browse/SPARK-2537
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1, 1.1.0
>Reporter: Cheng Lian
>Priority: Minor
> Fix For: 1.1.0
>
>
> Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:
> - {{timestamp_1}}
> - {{timestamp_2}}
> - {{timestamp_3}}
> - {{timestamp_udf}}
> Their answers differ between different timezones. Caching golden answers 
> naively cause build failures in other timezones. Currently these tests are 
> blacklisted. A not so clever solution is to cache golden answers of all 
> timezones for these tests, then select the right version for the current 
> build according to system timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803071#comment-14803071
 ] 

Maximilian Michels commented on SPARK-2653:
---

User 'greghogan' has created a pull request for this issue:
https://github.com/apache/flink/pull/1115

> Heap size should be the sum of driver.memory and executor.memory in local mode
> --
>
> Key: SPARK-2653
> URL: https://issues.apache.org/jira/browse/SPARK-2653
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In local mode, the driver and executor run in the same JVM, so the heap size 
> of JVM should be the sum of spark.driver.memory and spark.executor.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2613) CLONE - word2vec: Distributed Representation of Words

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803072#comment-14803072
 ] 

Maximilian Michels commented on SPARK-2613:
---

User 'nikste' has created a pull request for this issue:
https://github.com/apache/flink/pull/1106

> CLONE - word2vec: Distributed Representation of Words
> -
>
> Key: SPARK-2613
> URL: https://issues.apache.org/jira/browse/SPARK-2613
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yifan Yang
>Assignee: Xiangrui Meng
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> We would like to add parallel implementation of word2vec to MLlib. word2vec 
> finds distributed representation of words through training of large data 
> sets. The Spark programming model fits nicely with word2vec as the training 
> algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram 
> model and negative sampling in our initial implementation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2591) Add config property to disable incremental collection used in Thrift server

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803070#comment-14803070
 ] 

Maximilian Michels commented on SPARK-2591:
---

User 'willmiao' has created a pull request for this issue:
https://github.com/apache/flink/pull/1121

> Add config property to disable incremental collection used in Thrift server
> ---
>
> Key: SPARK-2591
> URL: https://issues.apache.org/jira/browse/SPARK-2591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>
> {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the 
> result set one partition at a time. This is useful to avoid OOM when the 
> result is large, but introduces extra job scheduling costs as each partition 
> is collected with a separate job. Users may want to disable this when the 
> result set is expected to be small.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2566) Update ShuffleWriteMetrics as data is written

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803064#comment-14803064
 ] 

Maximilian Michels commented on SPARK-2566:
---

User 'mjsax' has created a pull request for this issue:
https://github.com/apache/flink/pull/1135

> Update ShuffleWriteMetrics as data is written
> -
>
> Key: SPARK-2566
> URL: https://issues.apache.org/jira/browse/SPARK-2566
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Sandy Ryza
> Fix For: 1.1.0
>
>
> This will allow reporting incremental progress once we have SPARK-2099.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803074#comment-14803074
 ] 

Maximilian Michels commented on SPARK-2557:
---

User 'zentol' has created a pull request for this issue:
https://github.com/apache/flink/pull/1045

> createTaskScheduler should be consistent between local and local-n-failures 
> 
>
> Key: SPARK-2557
> URL: https://issues.apache.org/jira/browse/SPARK-2557
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Xianjin YE
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
> estimates the number of cores on the machine. I think we should also be able 
> to use * in the local-n-failures mode.
> And according to the code in the LOCAL_N_REGEX pattern matching code, I 
> believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX 
> should be 
> {code}
> """local\[([0-9]+|\*)\]""".r
> {code} 
> rather than
> {code}
>  """local\[([0-9\*]+)\]""".r
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803068#comment-14803068
 ] 

Maximilian Michels commented on SPARK-2641:
---

User 'mxm' has created a pull request for this issue:
https://github.com/apache/flink/pull/1129

> Spark submit doesn't pick up executor instances from properties file
> 
>
> Key: SPARK-2641
> URL: https://issues.apache.org/jira/browse/SPARK-2641
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Kanwaljit Singh
>
> When running spark-submit in Yarn cluster mode, we provide properties file 
> using --properties-file option.
> spark.executor.instances=5
> spark.executor.memory=2120m
> spark.executor.cores=3
> The submitted job picks up the cores and memory, but not the correct 
> instances.
> I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:
> // Use properties file as fallback for values which have a direct analog to
> // arguments in this script.
> master = 
> Option(master).getOrElse(defaultProperties.get("spark.master").orNull)
> executorMemory = Option(executorMemory)
>   .getOrElse(defaultProperties.get("spark.executor.memory").orNull)
> executorCores = Option(executorCores)
>   .getOrElse(defaultProperties.get("spark.executor.cores").orNull)
> totalExecutorCores = Option(totalExecutorCores)
>   .getOrElse(defaultProperties.get("spark.cores.max").orNull)
> name = 
> Option(name).getOrElse(defaultProperties.get("spark.app.name").orNull)
> jars = Option(jars).getOrElse(defaultProperties.get("spark.jars").orNull)
> Along with these defaults, we should also set default for instances:
> numExecutors=Option(numExecutors).getOrElse(defaultProperties.get("spark.executor.instances").orNull)
> PS: spark.executor.instances is also not mentioned on 
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2640) In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803066#comment-14803066
 ] 

Maximilian Michels commented on SPARK-2640:
---

User 'mxm' has created a pull request for this issue:
https://github.com/apache/flink/pull/1132

> In "local[N]", free cores of the only executor should be touched by 
> "spark.task.cpus" for every finish/start-up of tasks.
> -
>
> Key: SPARK-2640
> URL: https://issues.apache.org/jira/browse/SPARK-2640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: woshilaiceshide
>Assignee: woshilaiceshide
>Priority: Minor
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803079#comment-14803079
 ] 

Maximilian Michels commented on SPARK-2691:
---

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/flink/pull/1140

> Allow Spark on Mesos to be launched with Docker
> ---
>
> Key: SPARK-2691
> URL: https://issues.apache.org/jira/browse/SPARK-2691
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
>Assignee: Chris Heller
>  Labels: mesos
> Fix For: 1.4.0
>
> Attachments: spark-docker.patch
>
>
> Currently to launch Spark with Mesos one must upload a tarball and specifiy 
> the executor URI to be passed in that is to be downloaded on each slave or 
> even each execution depending coarse mode or not.
> We want to make Spark able to support launching Executors via a Docker image 
> that utilizes the recent Docker and Mesos integration work. 
> With the recent integration Spark can simply specify a Docker image and 
> options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2357) HashFilteredJoin doesn't match some equi-join query

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803080#comment-14803080
 ] 

Maximilian Michels commented on SPARK-2357:
---

User 'StephanEwen' has created a pull request for this issue:
https://github.com/apache/flink/pull/1139

> HashFilteredJoin doesn't match some equi-join query
> ---
>
> Key: SPARK-2357
> URL: https://issues.apache.org/jira/browse/SPARK-2357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Zongheng Yang
>Priority: Minor
>
> For instance, this query:
> hql("""SELECT * FROM src a JOIN src b ON a.key = 238""") 
> is a case where the HashFilteredJoin pattern doesn't match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2690) Make unidoc part of our test process

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803078#comment-14803078
 ] 

Maximilian Michels commented on SPARK-2690:
---

User 'tillrohrmann' has created a pull request for this issue:
https://github.com/apache/flink/pull/1141

> Make unidoc part of our test process
> 
>
> Key: SPARK-2690
> URL: https://issues.apache.org/jira/browse/SPARK-2690
> Project: Spark
>  Issue Type: Test
>  Components: Build, Documentation
>Reporter: Yin Huai
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803082#comment-14803082
 ] 

Maximilian Michels commented on SPARK-2595:
---

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/flink/pull/1137

> The driver run garbage collection, when the executor throws OutOfMemoryError 
> exception
> --
>
> Key: SPARK-2595
> URL: https://issues.apache.org/jira/browse/SPARK-2595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
> GC-based cleaning only consider the memory usage of the drive. We should 
> consider more factors to trigger gc. eg: executor exit code, task exception, 
> task gc time .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2689) Remove use of println in ActorHelper

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803083#comment-14803083
 ] 

Maximilian Michels commented on SPARK-2689:
---

User 'fhueske' has created a pull request for this issue:
https://github.com/apache/flink/pull/1136

> Remove use of println in ActorHelper
> 
>
> Key: SPARK-2689
> URL: https://issues.apache.org/jira/browse/SPARK-2689
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Matei Zaharia
>Priority: Minor
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file

2015-09-17 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803081#comment-14803081
 ] 

Maximilian Michels commented on SPARK-2576:
---

User 'jkovacs' has created a pull request for this issue:
https://github.com/apache/flink/pull/1138

> slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark 
> QL query on HDFS CSV file
> --
>
> Key: SPARK-2576
> URL: https://issues.apache.org/jira/browse/SPARK-2576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.0.1
> Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
> slaves. 
> JDK 1.7.51 and Scala 2.10.4 on all nodes. 
> HDFS from CDH5.0.3
> Spark version: I tried both with the pre-built CDH5 spark package available 
> from http://spark.apache.org/downloads.html and by packaging spark with sbt 
> 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
> http://mesosphere.io/learn/run-spark-on-mesos/
> All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
>Reporter: Svend Vanderveken
>Assignee: Prashant Sharma
>Priority: Blocker
> Fix For: 1.0.2, 1.1.0
>
>
> Execution of SQL query against HDFS systematically throws a class not found 
> exception on slave nodes when executing .
> (this was originally reported on the user list: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
> Sample code (ran from spark-shell): 
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
> // I get the same error when pointing to the folder 
> "hdfs://vm28:8020/test/cardata"
> val data = sc.textFile("hdfs://vm28:8020/test/cardata/part-0")
> val cars = data.map(_.split(",")).map ( ar => Car(ar(0).toLong, ar(1), 
> ar(2).toBoolean))
> cars.registerAsTable("mcars")
> val allgreens = sqlContext.sql("SELECT objectid from mcars where isGreen = 
> true")
> allgreens.collect.take(10).foreach(println)
> {code}
> Stack trace on the slave nodes: 
> {code}
> I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
> I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
> 20140714-142853-485682442-5050-25487-2
> 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
> executor ID 20140714-142853-485682442-5050-25487-2
> 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
> mesos,mnubohadoop
> 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mesos, 
> mnubohadoop)
> 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
> 14/07/16 13:01:17 INFO Remoting: Starting remoting
> 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://spark@vm23:38230]
> 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://spark@vm23:38230]
> 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
> akka.tcp://spark@vm28:41632/user/MapOutputTracker
> 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
> akka.tcp://spark@vm28:41632/user/BlockManagerMaster
> 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-local-20140716130117-8ea0
> 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
> MB.
> 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
> = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
> 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
> 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
> 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
> 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973
> 14/07/16 13:01:18 INFO Executor: Running task ID 2
> 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
> 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
> curMem=0, maxMem=309225062
> 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
> memory (estimated size 122.6 KB, free 294.8 MB)
> 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
> 0.294602722 s
> 14/07/16 13:01:19 INFO HadoopRDD: Input split: 
> hdfs://vm28:8020/test/cardata/part-0:23960450+23960451
> I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
> 14/07/16 13:01:20 ERROR Executor: Exception 

[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2015-09-17 Thread Martin Tapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803095#comment-14803095
 ] 

Martin Tapp commented on SPARK-2691:


This pull request seems unrelated (Python broken links).

> Allow Spark on Mesos to be launched with Docker
> ---
>
> Key: SPARK-2691
> URL: https://issues.apache.org/jira/browse/SPARK-2691
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
>Assignee: Chris Heller
>  Labels: mesos
> Fix For: 1.4.0
>
> Attachments: spark-docker.patch
>
>
> Currently to launch Spark with Mesos one must upload a tarball and specifiy 
> the executor URI to be passed in that is to be downloaded on each slave or 
> even each execution depending coarse mode or not.
> We want to make Spark able to support launching Executors via a Docker image 
> that utilizes the recent Docker and Mesos integration work. 
> With the recent integration Spark can simply specify a Docker image and 
> options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10664) JDBC DataFrameWriter does not save data to Oracle 11 Database

2015-09-17 Thread Dmitriy Atorin (JIRA)
Dmitriy Atorin created SPARK-10664:
--

 Summary: JDBC DataFrameWriter does not save data to Oracle 11 
Database
 Key: SPARK-10664
 URL: https://issues.apache.org/jira/browse/SPARK-10664
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Dmitriy Atorin
Priority: Critical


The issue is that Oracle 11 and less does not support LIMIT function.
The issue is here:
1 . Go in org.apache.spark.sql.execution.datasources.jdbc
2. object JdbcUtils 
3.
def tableExists(conn: Connection, table: String): Boolean = {
// Somewhat hacky, but there isn't a good way to identify whether a table 
exists for all
// SQL database systems, considering "table" could also include the 
database name.
Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 
1").executeQuery().next()).isSuccess
  }

I think it is better to write in this way 

s"SELECT count(*) FROM $table WHERE 1=0"




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10284) Add @since annotation to pyspark.ml.tuning

2015-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10284.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8694
[https://github.com/apache/spark/pull/8694]

> Add @since annotation to pyspark.ml.tuning
> --
>
> Key: SPARK-10284
> URL: https://issues.apache.org/jira/browse/SPARK-10284
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10283) Add @since annotation to pyspark.ml.regression

2015-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10283.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8693
[https://github.com/apache/spark/pull/8693]

> Add @since annotation to pyspark.ml.regression
> --
>
> Key: SPARK-10283
> URL: https://issues.apache.org/jira/browse/SPARK-10283
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10281) Add @since annotation to pyspark.ml.clustering

2015-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10281.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8691
[https://github.com/apache/spark/pull/8691]

> Add @since annotation to pyspark.ml.clustering
> --
>
> Key: SPARK-10281
> URL: https://issues.apache.org/jira/browse/SPARK-10281
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10278) Add @since annotation to pyspark.mllib.tree

2015-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10278.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8685
[https://github.com/apache/spark/pull/8685]

> Add @since annotation to pyspark.mllib.tree
> ---
>
> Key: SPARK-10278
> URL: https://issues.apache.org/jira/browse/SPARK-10278
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10279) Add @since annotation to pyspark.mllib.util

2015-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10279.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8689
[https://github.com/apache/spark/pull/8689]

> Add @since annotation to pyspark.mllib.util
> ---
>
> Key: SPARK-10279
> URL: https://issues.apache.org/jira/browse/SPARK-10279
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10274) Add @since annotation to pyspark.mllib.fpm

2015-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10274.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8665
[https://github.com/apache/spark/pull/8665]

> Add @since annotation to pyspark.mllib.fpm
> --
>
> Key: SPARK-10274
> URL: https://issues.apache.org/jira/browse/SPARK-10274
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10282) Add @since annotation to pyspark.ml.recommendation

2015-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10282.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8692
[https://github.com/apache/spark/pull/8692]

> Add @since annotation to pyspark.ml.recommendation
> --
>
> Key: SPARK-10282
> URL: https://issues.apache.org/jira/browse/SPARK-10282
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-17 Thread Matt Hagen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803154#comment-14803154
 ] 

Matt Hagen commented on SPARK-10663:


I'm curious why we need to convert the DataFrame into another DataFrame:

model.transform(test.toDF)

Also, the other examples in Spark ML Programming Guide don't:

model2.transform(test)
cvModel.transform(test)
model.transform(test)

Thanks, much.

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10077) Java package doc for spark.ml.feature

2015-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10077.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8740
[https://github.com/apache/spark/pull/8740]

> Java package doc for spark.ml.feature
> -
>
> Key: SPARK-10077
> URL: https://issues.apache.org/jira/browse/SPARK-10077
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> Should be the same as SPARK-7808 but use Java for the code example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10459) PythonUDF could process UnsafeRow

2015-09-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10459.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8616
[https://github.com/apache/spark/pull/8616]

> PythonUDF could process UnsafeRow
> -
>
> Key: SPARK-10459
> URL: https://issues.apache.org/jira/browse/SPARK-10459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 1.6.0
>
>
> Currently, There will be ConvertToSafe for PythonUDF, that's not needed 
> actually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2622) Add Jenkins build numbers to SparkQA messages

2015-09-17 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803169#comment-14803169
 ] 

Nicholas Chammas commented on SPARK-2622:
-

[~mxm] - I noticed you have been posting this kind of message on several Spark 
JIRAs (with a link to a non-related Flink PR). They appear to be mistakes made 
by some automated bot. Please correct this issue.

> Add Jenkins build numbers to SparkQA messages
> -
>
> Key: SPARK-2622
> URL: https://issues.apache.org/jira/browse/SPARK-2622
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.1
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It takes Jenkins 2 hours to finish testing. It is possible to have the 
> following:
> {code}
> Build 1 started.
> PR updated.
> Build 2 started.
> Build 1 finished successfully.
> A committer merged the PR because the last build seemed to be okay.
> Build 2 failed.
> {code}
> It would be nice to put the build number in the SparkQA message so it is easy 
> to match the result with the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2015-09-17 Thread Matt Massie (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Massie updated SPARK-7263:
---
Component/s: (was: Block Manager)
 Shuffle

> Add new shuffle manager which stores shuffle blocks in Parquet
> --
>
> Key: SPARK-7263
> URL: https://issues.apache.org/jira/browse/SPARK-7263
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Matt Massie
>
> I have a working prototype of this feature that can be viewed at
> https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
> Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager.
> The dictionary support that Parquet provides appreciably reduces the amount of
> memory that objects use; however, once Parquet data is shuffled, all the
> dictionary information is lost and the column-oriented data is written to 
> shuffle
> blocks in a record-oriented fashion. This shuffle manager addresses this issue
> by reading and writing all shuffle blocks in the Parquet format.
> If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
> Parquet
> schema and used directly, otherwise, the Parquet schema is generated via 
> reflection.
> Currently, the only non-Avro keys supported is primitive types. The reflection
> code can be improved (or replaced) to support complex records.
> The ParquetShufflePair class allows the shuffle key and value to be stored in
> Parquet blocks as a single record with a single schema.
> This commit adds the following new Spark configuration options:
> "spark.shuffle.parquet.compression" - sets the Parquet compression codec
> "spark.shuffle.parquet.blocksize" - sets the Parquet block size
> "spark.shuffle.parquet.pagesize" - set the Parquet page size
> "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off
> Parquet does not (and has no plans to) support a streaming API. Metadata 
> sections
> are scattered through a Parquet file making a streaming API difficult. As 
> such,
> the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
> of map outputs into temporary blocks before loading the data into the reducer.
> Interesting future asides:
> o There is no need to define a data serializer (although Spark requires it)
> o Parquet support predicate pushdown and projection which could be used at
>   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2015-09-17 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803184#comment-14803184
 ] 

Imran Rashid commented on SPARK-10620:
--

I think you've done a good job of summarizing the key issues to consider.

Can I ask that we back up one step further, and start by asking what properties 
we want from our metric system?  I'm not at all in love with the current 
TaskMetrics, I just don't see accumulators as a good replacement.  Because 
accumulators are a public API, we are kind of stuck with the current semantics. 
 We get a bit of wiggle room w/ internal accumulators, but not a lot.

What are the things we dislike about TaskMetrics?  I think its:

(a) there is a ton of boilerplate that you need to write for every new metric, 
making adding each one a huge pain
(b) its a nuisance to filter the metrics for the common use cases -- eg., its 
easy to accidentally overcount when there is task failure or speculation, etc.

Some other key differences from accumulators -- I think these are an advantage 
of TaskMetrics, but maybe others see them as a disadvantage?

(c) the metrics are strongly typed.  Both b/c the name of the metric is a 
member, so typos like "metrics.shuffleRaedMetrics" are compile errors, and also 
the value is strongly typed, not just the toString of something.
(d) metrics can be aggregated in a variety of ways.  Eg., you can get the sum 
of the metric across tasks, the distribution, a timeline of parital sums, etc.  
You could do this with the individual values of accumulators too, but its worth 
pointing out that if this is what you use them for, they aren't really 
"accumulating", they're just a per-task holder.

I feel like there are other designs we could consider that get around the 
current limitations.  For example, if each metric was keyed by an enum, and 
they were stored in an EnumMap, then you'd get easy iteration so you could 
eliminate lots of boilerplate (a), it'd be easier to write utility functions 
for common filters (b), you'd still get type safety (c) and flexibility in 
aggregation (d).  I've been told I have an obsession with EnumMaps, so maybe 
others wont' be as keen on them -- but my main point is simply that I don't 
think we have only two alternatives here, and I'd prefer we take the time to 
consider this more completely. (just conversion back and forth to strings is 
enough to make me feel like accumulators are a kludge.)

I also want to point out that its really hard to get things right for failures, 
not because its hard to implement, but because its hard to decide what the 
right *semantics* should be.  For instance:

* If there is stage failure, and a stage is resubmitted but with only a small 
subset of tasks, what should the aggregated value be in the UI?  The value of 
just that stage attempt?  Or should it aggregate over all attempts?  Or 
aggregate in such a way that each *partition* is only counted once, favoring 
the most recent successful attempt for each partition?  There is a case to be 
made for all three.
* Suppose a user is comparing how much data is read from hdfs from two 
different runs of a job -- one with speculative execution & intermittent task 
failure, and the other without either (a "normal" run).  The average user would 
likely want to see the same amount of data read from hdfs in both jobs.  OTOH, 
they are actually reading different amounts of data.  While this difference may 
not get exposed in the standard web UI, do we want to let advanced users have 
any access to this difference, or is it an unsupported use case?  This isn't 
directly related to TaskMetrics vs. Accumulators, but goes to my overall point 
about considering the design.

Thanks for brining this up, I think this is a great thing for us to be thinking 
about and working to improve.  I hope I'm not derailing the conversation too 
much.

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement cl

[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2015-09-17 Thread Matt Massie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803185#comment-14803185
 ] 

Matt Massie commented on SPARK-7263:


The [Parquet shuffle PR](https://github.com/apache/spark/pull/7265) is ready 
for review now. 

> Add new shuffle manager which stores shuffle blocks in Parquet
> --
>
> Key: SPARK-7263
> URL: https://issues.apache.org/jira/browse/SPARK-7263
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Matt Massie
>
> I have a working prototype of this feature that can be viewed at
> https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
> Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager.
> The dictionary support that Parquet provides appreciably reduces the amount of
> memory that objects use; however, once Parquet data is shuffled, all the
> dictionary information is lost and the column-oriented data is written to 
> shuffle
> blocks in a record-oriented fashion. This shuffle manager addresses this issue
> by reading and writing all shuffle blocks in the Parquet format.
> If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
> Parquet
> schema and used directly, otherwise, the Parquet schema is generated via 
> reflection.
> Currently, the only non-Avro keys supported is primitive types. The reflection
> code can be improved (or replaced) to support complex records.
> The ParquetShufflePair class allows the shuffle key and value to be stored in
> Parquet blocks as a single record with a single schema.
> This commit adds the following new Spark configuration options:
> "spark.shuffle.parquet.compression" - sets the Parquet compression codec
> "spark.shuffle.parquet.blocksize" - sets the Parquet block size
> "spark.shuffle.parquet.pagesize" - set the Parquet page size
> "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off
> Parquet does not (and has no plans to) support a streaming API. Metadata 
> sections
> are scattered through a Parquet file making a streaming API difficult. As 
> such,
> the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
> of map outputs into temporary blocks before loading the data into the reducer.
> Interesting future asides:
> o There is no need to define a data serializer (although Spark requires it)
> o Parquet support predicate pushdown and projection which could be used at
>   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2015-09-17 Thread Matt Massie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803185#comment-14803185
 ] 

Matt Massie edited comment on SPARK-7263 at 9/17/15 4:37 PM:
-

The [Parquet shuffle PR|https://github.com/apache/spark/pull/7265] is ready for 
review now. 


was (Author: massie):
The [Parquet shuffle PR](https://github.com/apache/spark/pull/7265) is ready 
for review now. 

> Add new shuffle manager which stores shuffle blocks in Parquet
> --
>
> Key: SPARK-7263
> URL: https://issues.apache.org/jira/browse/SPARK-7263
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Matt Massie
>
> I have a working prototype of this feature that can be viewed at
> https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
> Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager.
> The dictionary support that Parquet provides appreciably reduces the amount of
> memory that objects use; however, once Parquet data is shuffled, all the
> dictionary information is lost and the column-oriented data is written to 
> shuffle
> blocks in a record-oriented fashion. This shuffle manager addresses this issue
> by reading and writing all shuffle blocks in the Parquet format.
> If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
> Parquet
> schema and used directly, otherwise, the Parquet schema is generated via 
> reflection.
> Currently, the only non-Avro keys supported is primitive types. The reflection
> code can be improved (or replaced) to support complex records.
> The ParquetShufflePair class allows the shuffle key and value to be stored in
> Parquet blocks as a single record with a single schema.
> This commit adds the following new Spark configuration options:
> "spark.shuffle.parquet.compression" - sets the Parquet compression codec
> "spark.shuffle.parquet.blocksize" - sets the Parquet block size
> "spark.shuffle.parquet.pagesize" - set the Parquet page size
> "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off
> Parquet does not (and has no plans to) support a streaming API. Metadata 
> sections
> are scattered through a Parquet file making a streaming API difficult. As 
> such,
> the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
> of map outputs into temporary blocks before loading the data into the reducer.
> Interesting future asides:
> o There is no need to define a data serializer (although Spark requires it)
> o Parquet support predicate pushdown and projection which could be used at
>   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10660) Doc describe error in the "Running Spark on YARN" page

2015-09-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10660.

   Resolution: Fixed
 Assignee: yangping wu
Fix Version/s: 1.5.1
   1.6.0
   1.4.2

> Doc describe error in the "Running Spark on YARN" page
> --
>
> Key: SPARK-10660
> URL: https://issues.apache.org/jira/browse/SPARK-10660
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: yangping wu
>Assignee: yangping wu
>Priority: Trivial
> Fix For: 1.4.2, 1.6.0, 1.5.1
>
>
> In the *Configuration* section, the *spark.yarn.driver.memoryOverhead* and 
> *spark.yarn.am.memoryOverhead*‘s  default value should be "driverMemory * 
> 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" 
> respectively. Because from Spark 1.4.0, the *MEMORY_OVERHEAD_FACTOR* is set 
> to 0.1.0, not 0.07.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2537) Workaround Timezone specific Hive tests

2015-09-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803223#comment-14803223
 ] 

Yin Huai commented on SPARK-2537:
-

[~mxm] Seems you are trying 
https://github.com/apache/spark/blob/master/dev/github_jira_sync.py for flink. 
Can you check you setting and make sure it is pointed to the correct project? 
Thanks!


> Workaround Timezone specific Hive tests
> ---
>
> Key: SPARK-2537
> URL: https://issues.apache.org/jira/browse/SPARK-2537
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1, 1.1.0
>Reporter: Cheng Lian
>Priority: Minor
> Fix For: 1.1.0
>
>
> Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:
> - {{timestamp_1}}
> - {{timestamp_2}}
> - {{timestamp_3}}
> - {{timestamp_udf}}
> Their answers differ between different timezones. Caching golden answers 
> naively cause build failures in other timezones. Currently these tests are 
> blacklisted. A not so clever solution is to cache golden answers of all 
> timezones for these tests, then select the right version for the current 
> build according to system timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10565) New /api/v1/[path] APIs don't contain as much information as original /json API

2015-09-17 Thread Kevin Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803224#comment-14803224
 ] 

Kevin Chen commented on SPARK-10565:


To summarize what has been discussed up until now in a separate email thread on 
d...@spark.apache.org:

We plan to add the remaining information to the v1 API without incrementing the 
version number, because this change will only add more endpoints / more fields 
to existing endpoints.

Mark Hamstra has also requested adding endpoints that get results by jobGroup 
(cf. SparkContext#setJobGroup) instead of just a single job.

> New /api/v1/[path] APIs don't contain as much information as original /json 
> API 
> 
>
> Key: SPARK-10565
> URL: https://issues.apache.org/jira/browse/SPARK-10565
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.5.0
>Reporter: Kevin Chen
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> [SPARK-3454] introduced official json APIs at /api/v1/[path] for data that 
> originally appeared only on the web UI. However, it does not expose all the 
> information on the web UI or on the previous unofficial endpoint at /json.
> For example, the APIs at /api/v1/[path] do not show the number of cores or 
> amount of memory per slave for each job. This is stored in 
> ApplicationInfo.desc.maxCores and ApplicationInfo.desc.memoryPerSlave, 
> respectively. This information would be useful to expose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10642) Crash in rdd.lookup() with "java.lang.Long cannot be cast to java.lang.Integer"

2015-09-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10642.

   Resolution: Fixed
Fix Version/s: 1.2.3
   1.3.2
   1.4.2
   1.5.1
   1.6.0

Issue resolved by pull request 8796
[https://github.com/apache/spark/pull/8796]

> Crash in rdd.lookup() with "java.lang.Long cannot be cast to 
> java.lang.Integer"
> ---
>
> Key: SPARK-10642
> URL: https://issues.apache.org/jira/browse/SPARK-10642
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: OSX
>Reporter: Thouis Jones
> Fix For: 1.6.0, 1.5.1, 1.4.2, 1.3.2, 1.2.3
>
>
> Running this command:
> {code}
> sc.parallelize([(('a', 'b'), 
> 'c')]).groupByKey().partitionBy(20).cache().lookup(('a', 'b'))
> {code}
> gives the following error:
> {noformat}
> 15/09/16 14:22:23 INFO SparkContext: Starting job: runJob at 
> PythonRDD.scala:361
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/rdd.py", 
> line 2199, in lookup
> return self.ctx.runJob(values, lambda x: x, [self.partitioner(key)])
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/context.py", 
> line 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/pyspark/sql/utils.py", 
> line 36, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/1.5.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : java.lang.ClassCastException: java.lang.Long cannot be cast to 
> java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitJob$1.apply(DAGScheduler.scala:530)
>   at scala.collection.Iterator$class.find(Iterator.scala:780)
>   at scala.collection.AbstractIterator.find(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.find(IterableLike.scala:79)
>   at scala.collection.AbstractIterable.find(Iterable.scala:54)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:530)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:558)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839)
>   at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:361)
>   at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
>   at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10545) HiveMetastoreTypes.toMetastoreType should handle interval type

2015-09-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10545:
-
Target Version/s:   (was: 1.6.0, 1.5.1)

> HiveMetastoreTypes.toMetastoreType should handle interval type
> --
>
> Key: SPARK-10545
> URL: https://issues.apache.org/jira/browse/SPARK-10545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Minor
>
> We need to handle interval type at 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L946-L965.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10545) HiveMetastoreTypes.toMetastoreType should handle interval type

2015-09-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10545:
-
Priority: Minor  (was: Major)

> HiveMetastoreTypes.toMetastoreType should handle interval type
> --
>
> Key: SPARK-10545
> URL: https://issues.apache.org/jira/browse/SPARK-10545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Minor
>
> We need to handle interval type at 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L946-L965.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10545) HiveMetastoreTypes.toMetastoreType should handle interval type

2015-09-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803233#comment-14803233
 ] 

Yin Huai commented on SPARK-10545:
--

Seems Hive 1.2.1's parser does not allow interval as a column type when 
defining a table (see 
https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g#L2076-L2078).
 We can revisit this issue at a later time.

> HiveMetastoreTypes.toMetastoreType should handle interval type
> --
>
> Key: SPARK-10545
> URL: https://issues.apache.org/jira/browse/SPARK-10545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Minor
>
> We need to handle interval type at 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L946-L965.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10665) Connect the local iterators with the planner

2015-09-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-10665:
---

 Summary: Connect the local iterators with the planner
 Key: SPARK-10665
 URL: https://issues.apache.org/jira/browse/SPARK-10665
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


After creating these local iterators, we'd need to actually use them.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10666) Use properties from ActiveJob associated with a Stage

2015-09-17 Thread Mark Hamstra (JIRA)
Mark Hamstra created SPARK-10666:


 Summary: Use properties from ActiveJob associated with a Stage
 Key: SPARK-10666
 URL: https://issues.apache.org/jira/browse/SPARK-10666
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.5.0, 1.4.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra


This issue was addressed in #5494, but the fix in that PR, while safe in the 
sense that it will prevent the SparkContext from shutting down, misses the 
actual bug. The intent of submitMissingTasks should be understood as "submit 
the Tasks that are missing for the Stage, and run them as part of the ActiveJob 
identified by jobId". Because of a long-standing bug, the jobId parameter was 
never being used. Instead, we were trying to use the jobId with which the Stage 
was created -- which may no longer exist as an ActiveJob, hence the crash 
reported in SPARK-6880.

The correct fix is to use the ActiveJob specified by the supplied jobId 
parameter, which is guaranteed to exist at the call sites of submitMissingTasks.

This fix should be applied to all maintenance branches, since it has existed 
since 1.0.

Tasks for a Stage that was previously part of a Job that is no longer active 
would be re-submitted as though they were part of the prior Job and with no 
properties set. Since properties are what are used to set an other-than-default 
scheduling pool, this would affect FAIR scheduler usage, but it would also 
affect anything else that depends on the settings of the properties (which 
would be just user code at this point, since Spark itself doesn't really use 
the properties for anything else other than Job Group and Description, which 
end up in the WebUI, can be used to kill by JobGroup, etc.) Even the default, 
FIFO scheduling would be affected, however, since the resubmission of the Tasks 
under the earlier jobId would effectively give them a higher priority/greater 
urgency than the ActiveJob that now actually needs them. In any event, the 
Tasks would generate correct results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10666) Use properties from ActiveJob associated with a Stage

2015-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803244#comment-14803244
 ] 

Apache Spark commented on SPARK-10666:
--

User 'markhamstra' has created a pull request for this issue:
https://github.com/apache/spark/pull/6291

> Use properties from ActiveJob associated with a Stage
> -
>
> Key: SPARK-10666
> URL: https://issues.apache.org/jira/browse/SPARK-10666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>
> This issue was addressed in #5494, but the fix in that PR, while safe in the 
> sense that it will prevent the SparkContext from shutting down, misses the 
> actual bug. The intent of submitMissingTasks should be understood as "submit 
> the Tasks that are missing for the Stage, and run them as part of the 
> ActiveJob identified by jobId". Because of a long-standing bug, the jobId 
> parameter was never being used. Instead, we were trying to use the jobId with 
> which the Stage was created -- which may no longer exist as an ActiveJob, 
> hence the crash reported in SPARK-6880.
> The correct fix is to use the ActiveJob specified by the supplied jobId 
> parameter, which is guaranteed to exist at the call sites of 
> submitMissingTasks.
> This fix should be applied to all maintenance branches, since it has existed 
> since 1.0.
> Tasks for a Stage that was previously part of a Job that is no longer active 
> would be re-submitted as though they were part of the prior Job and with no 
> properties set. Since properties are what are used to set an 
> other-than-default scheduling pool, this would affect FAIR scheduler usage, 
> but it would also affect anything else that depends on the settings of the 
> properties (which would be just user code at this point, since Spark itself 
> doesn't really use the properties for anything else other than Job Group and 
> Description, which end up in the WebUI, can be used to kill by JobGroup, 
> etc.) Even the default, FIFO scheduling would be affected, however, since the 
> resubmission of the Tasks under the earlier jobId would effectively give them 
> a higher priority/greater urgency than the ActiveJob that now actually needs 
> them. In any event, the Tasks would generate correct results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10666) Use properties from ActiveJob associated with a Stage

2015-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10666:


Assignee: Mark Hamstra  (was: Apache Spark)

> Use properties from ActiveJob associated with a Stage
> -
>
> Key: SPARK-10666
> URL: https://issues.apache.org/jira/browse/SPARK-10666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>
> This issue was addressed in #5494, but the fix in that PR, while safe in the 
> sense that it will prevent the SparkContext from shutting down, misses the 
> actual bug. The intent of submitMissingTasks should be understood as "submit 
> the Tasks that are missing for the Stage, and run them as part of the 
> ActiveJob identified by jobId". Because of a long-standing bug, the jobId 
> parameter was never being used. Instead, we were trying to use the jobId with 
> which the Stage was created -- which may no longer exist as an ActiveJob, 
> hence the crash reported in SPARK-6880.
> The correct fix is to use the ActiveJob specified by the supplied jobId 
> parameter, which is guaranteed to exist at the call sites of 
> submitMissingTasks.
> This fix should be applied to all maintenance branches, since it has existed 
> since 1.0.
> Tasks for a Stage that was previously part of a Job that is no longer active 
> would be re-submitted as though they were part of the prior Job and with no 
> properties set. Since properties are what are used to set an 
> other-than-default scheduling pool, this would affect FAIR scheduler usage, 
> but it would also affect anything else that depends on the settings of the 
> properties (which would be just user code at this point, since Spark itself 
> doesn't really use the properties for anything else other than Job Group and 
> Description, which end up in the WebUI, can be used to kill by JobGroup, 
> etc.) Even the default, FIFO scheduling would be affected, however, since the 
> resubmission of the Tasks under the earlier jobId would effectively give them 
> a higher priority/greater urgency than the ActiveJob that now actually needs 
> them. In any event, the Tasks would generate correct results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10666) Use properties from ActiveJob associated with a Stage

2015-09-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10666:


Assignee: Apache Spark  (was: Mark Hamstra)

> Use properties from ActiveJob associated with a Stage
> -
>
> Key: SPARK-10666
> URL: https://issues.apache.org/jira/browse/SPARK-10666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Mark Hamstra
>Assignee: Apache Spark
>
> This issue was addressed in #5494, but the fix in that PR, while safe in the 
> sense that it will prevent the SparkContext from shutting down, misses the 
> actual bug. The intent of submitMissingTasks should be understood as "submit 
> the Tasks that are missing for the Stage, and run them as part of the 
> ActiveJob identified by jobId". Because of a long-standing bug, the jobId 
> parameter was never being used. Instead, we were trying to use the jobId with 
> which the Stage was created -- which may no longer exist as an ActiveJob, 
> hence the crash reported in SPARK-6880.
> The correct fix is to use the ActiveJob specified by the supplied jobId 
> parameter, which is guaranteed to exist at the call sites of 
> submitMissingTasks.
> This fix should be applied to all maintenance branches, since it has existed 
> since 1.0.
> Tasks for a Stage that was previously part of a Job that is no longer active 
> would be re-submitted as though they were part of the prior Job and with no 
> properties set. Since properties are what are used to set an 
> other-than-default scheduling pool, this would affect FAIR scheduler usage, 
> but it would also affect anything else that depends on the settings of the 
> properties (which would be just user code at this point, since Spark itself 
> doesn't really use the properties for anything else other than Job Group and 
> Description, which end up in the WebUI, can be used to kill by JobGroup, 
> etc.) Even the default, FIFO scheduling would be affected, however, since the 
> resubmission of the Tasks under the earlier jobId would effectively give them 
> a higher priority/greater urgency than the ActiveJob that now actually needs 
> them. In any event, the Tasks would generate correct results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10172) History Server web UI gets messed up when sorting on any column

2015-09-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-10172:
---
Labels: regression  (was: )

> History Server web UI gets messed up when sorting on any column
> ---
>
> Key: SPARK-10172
> URL: https://issues.apache.org/jira/browse/SPARK-10172
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Min Shen
>Priority: Minor
>  Labels: regression
> Attachments: screen-shot.png
>
>
> If the history web UI displays the "Attempt ID" column, when clicking the 
> table header to sort on any column, the entire page gets messed up.
> This seems to be a problem with the sorttable.js not able to correctly handle 
> tables with rowspan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10667) Add 90 Runnable TPCDS Queries into spark-sql-perf

2015-09-17 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-10667:
--

 Summary: Add 90 Runnable TPCDS Queries into spark-sql-perf
 Key: SPARK-10667
 URL: https://issues.apache.org/jira/browse/SPARK-10667
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.5.0, 1.4.1
 Environment: RHEL 7.1, Spark 1.5.0 and 1.4.1
Reporter: JESSE CHEN
 Fix For: 1.5.0, 1.4.1


The IBM Spark Technology Center has made 86 (out of 99 TPCDS business queries) 
to run successfully on Spark 1.4.1 and 1.5.0. The existing spark-sql-perf test 
kit (on github from databricks) has a small subset of these. Could we add all 
the following queries into the kit so the community can learn/leverage?

query01
query02
query03
query04
query05
query07
query08
query09
query11
query12
query13
query15
query16
query17
query20
query21
query22
query24a
query24b
query25
query26
query27
query28
query29
query30
query31
query32
query33
query34
query36
query37
query38
query39a
query39b
query40
query42
query43
query45
query46
query47
query48
query49
query50
query51
query52
query53
query56
query57
query58
query59
query60
query61
query62
query63
query66
query67
query68
query69
query72
query73
query74
query75
query76
query77
query78
query79
query80
query81
query82
query83
query84
query85
query86
query87
query88
query89
query90
query91
query92
query93
query94
query95
query96
query97
query98
query99

Please contact me for the working queries.
Jesse Chen
jfc...@us.ibm.com




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10667) Add 86 Runnable TPCDS Queries into spark-sql-perf

2015-09-17 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-10667:
---
Summary: Add 86 Runnable TPCDS Queries into spark-sql-perf  (was: Add 90 
Runnable TPCDS Queries into spark-sql-perf)

> Add 86 Runnable TPCDS Queries into spark-sql-perf
> -
>
> Key: SPARK-10667
> URL: https://issues.apache.org/jira/browse/SPARK-10667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 1.4.1, 1.5.0
> Environment: RHEL 7.1, Spark 1.5.0 and 1.4.1
>Reporter: JESSE CHEN
>  Labels: spark, spark-sql-perf, sql
> Fix For: 1.4.1, 1.5.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The IBM Spark Technology Center has made 86 (out of 99 TPCDS business 
> queries) to run successfully on Spark 1.4.1 and 1.5.0. The existing 
> spark-sql-perf test kit (on github from databricks) has a small subset of 
> these. Could we add all the following queries into the kit so the community 
> can learn/leverage?
> query01
> query02
> query03
> query04
> query05
> query07
> query08
> query09
> query11
> query12
> query13
> query15
> query16
> query17
> query20
> query21
> query22
> query24a
> query24b
> query25
> query26
> query27
> query28
> query29
> query30
> query31
> query32
> query33
> query34
> query36
> query37
> query38
> query39a
> query39b
> query40
> query42
> query43
> query45
> query46
> query47
> query48
> query49
> query50
> query51
> query52
> query53
> query56
> query57
> query58
> query59
> query60
> query61
> query62
> query63
> query66
> query67
> query68
> query69
> query72
> query73
> query74
> query75
> query76
> query77
> query78
> query79
> query80
> query81
> query82
> query83
> query84
> query85
> query86
> query87
> query88
> query89
> query90
> query91
> query92
> query93
> query94
> query95
> query96
> query97
> query98
> query99
> Please contact me for the working queries.
> Jesse Chen
> jfc...@us.ibm.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-09-17 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803261#comment-14803261
 ] 

Mark Hamstra commented on SPARK-6880:
-

see SPARK-10666

> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>Assignee: pankaj arora
> Fix For: 1.4.0
>
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs
> Below is the stack trace
> 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
> failed; shutting down SparkContext
> java.util.NoSuchElementException: key not found: 28
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10172) History Server web UI gets messed up when sorting on any column

2015-09-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10172.

   Resolution: Fixed
 Assignee: Josiah Samuel Sathiadass
Fix Version/s: 1.5.1
   1.6.0
   1.4.2

> History Server web UI gets messed up when sorting on any column
> ---
>
> Key: SPARK-10172
> URL: https://issues.apache.org/jira/browse/SPARK-10172
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Min Shen
>Assignee: Josiah Samuel Sathiadass
>Priority: Minor
>  Labels: regression
> Fix For: 1.4.2, 1.6.0, 1.5.1
>
> Attachments: screen-shot.png
>
>
> If the history web UI displays the "Attempt ID" column, when clicking the 
> table header to sort on any column, the entire page gets messed up.
> This seems to be a problem with the sorttable.js not able to correctly handle 
> tables with rowspan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10531) AppId is set as AppName in status rest api

2015-09-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10531.

   Resolution: Fixed
 Assignee: Jeff Zhang
Fix Version/s: 1.6.0

> AppId is set as AppName in status rest api
> --
>
> Key: SPARK-10531
> URL: https://issues.apache.org/jira/browse/SPARK-10531
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 1.6.0
>
>
> This is the result from http://localhost:4040/api/v1/applications/
> {noformat}
> {
>id: "Spark shell",
>name: "Spark shell",
>attempts: [
>{
>startTime: "2015-09-10T06:38:21.528GMT",
>endTime: "1969-12-31T23:59:59.999GMT",
>sparkUser: "",
>completed: false
>}]
> }
> {noformat}
> And I have to use the appName in the rest url, such as 
> {code}
> http://localhost:4040/api/v1/applications/Spark%20shell/jobs
> {code}
> * This issue only appears on Spark Job UI. For master UI and history UI, 
> appId is correctly populated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10632) Cannot save DataFrame with User Defined Types

2015-09-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803272#comment-14803272
 ] 

Joseph K. Bradley commented on SPARK-10632:
---

I tried this with the current master, and it worked.  (I haven't gotten a 
chance to try it with 1.5.0 yet.)  Just to confirm: Did you run this on 1.5?  
And could you say what platform you were running on?

> Cannot save DataFrame with User Defined Types
> -
>
> Key: SPARK-10632
> URL: https://issues.apache.org/jira/browse/SPARK-10632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Joao
>
> Cannot save DataFrames that contain user-defined types.
> I tried to save a dataframe with instances of the Vector class from mlib and 
> got the error.
> The code below should reproduce the error.
> {noformat}
> val df = sc.parallelize(Seq((1,Vectors.dense(1,1,1)), 
> (2,Vectors.dense(2,2,2.toDF()
> df.write.format("json").mode(SaveMode.Overwrite).save(path)
> {noformat}
> The error log is below
> {noformat}
> 15/09/16 09:58:27 ERROR DefaultWriterContainer: Aborting task.
> scala.MatchError: [1,null,null,[1.0,1.0,1.0]] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
>   at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
>   at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
> 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
>  closed. Now beginning upload
> 15/09/16 09:58:27 INFO NativeS3FileSystem: OutputStream for key 
> 'adad/_temporary/0/_temporary/attempt_201509160958__m_00_0/part-r-0-2a262ed4-be5a-4190-92a1-a5326cc76ed6'
>  upload complete
> 15/09/16 09:58:28 ERROR DefaultWriterContainer: Task attempt 
> attempt_201509160958__m_00_0 aborted.
> 15/09/16 09:58:28 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> jav

[jira] [Updated] (SPARK-10662) Code snippets are not properly formatted in docs

2015-09-17 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-10662:

Issue Type: Bug  (was: Task)

> Code snippets are not properly formatted in docs
> 
>
> Key: SPARK-10662
> URL: https://issues.apache.org/jira/browse/SPARK-10662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Jacek Laskowski
> Attachments: spark-docs-backticks-tables.png
>
>
> Backticks (markdown) in tables are not processed and hence not formatted 
> properly. See 
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
>  and search for {{`yarn-client`}}.
> As per [Sean's 
> suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] 
> I'm creating the JIRA task.
> {quote}
> This is a good fix, but this is another instance where I suspect the same 
> issue exists in several markup files, like configuration.html. It's worth a 
> JIRA since I think catching and fixing all of these is one non-trivial 
> logical change.
> If you can, avoid whitespace changes like stripping or adding space at the 
> end of lines. It just adds to the diff and makes for a tiny extra chance of 
> merge conflicts.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10650) Spark docs include test and other extra classes

2015-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10650.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8787
[https://github.com/apache/spark/pull/8787]

> Spark docs include test and other extra classes
> ---
>
> Key: SPARK-10650
> URL: https://issues.apache.org/jira/browse/SPARK-10650
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>
> In 1.5.0 there are some extra classes in the Spark docs - including a bunch 
> of test classes. We need to figure out what commit introduced those and fix 
> it. The obvious things like genJavadoc version have not changed.
> http://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/streaming/ 
> [before]
> http://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/streaming/ 
> [after]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-09-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10668:
-

 Summary: Use WeightedLeastSquares in LinearRegression with L2 
regularization if the number of features is small
 Key: SPARK-10668
 URL: https://issues.apache.org/jira/browse/SPARK-10668
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Priority: Critical


If the number of features is small (<=4096) and the regularization is L2, we 
should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10664) JDBC DataFrameWriter does not save data to Oracle 11 Database

2015-09-17 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803331#comment-14803331
 ] 

Suresh Thalamati commented on SPARK-10664:
--

Table exists case should be fixed as part of SPARK-9078 fix.  This one is fix 
is recently in the latest code line. 

> JDBC DataFrameWriter does not save data to Oracle 11 Database
> -
>
> Key: SPARK-10664
> URL: https://issues.apache.org/jira/browse/SPARK-10664
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Dmitriy Atorin
>Priority: Critical
>
> The issue is that Oracle 11 and less does not support LIMIT function.
> The issue is here:
> 1 . Go in org.apache.spark.sql.execution.datasources.jdbc
> 2. object JdbcUtils 
> 3.
> def tableExists(conn: Connection, table: String): Boolean = {
> // Somewhat hacky, but there isn't a good way to identify whether a table 
> exists for all
> // SQL database systems, considering "table" could also include the 
> database name.
> Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 
> 1").executeQuery().next()).isSuccess
>   }
> I think it is better to write in this way 
> s"SELECT count(*) FROM $table WHERE 1=0"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10394) Make GBTParams use shared "stepSize"

2015-09-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10394.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8552
[https://github.com/apache/spark/pull/8552]

> Make GBTParams use shared "stepSize"
> 
>
> Key: SPARK-10394
> URL: https://issues.apache.org/jira/browse/SPARK-10394
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> GBTParams has "stepSize" as learning rate currently.
> ML has shared param class "HasStepSize", GBTParams can extend from it rather 
> than duplicated implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >