[jira] [Created] (SPARK-11023) Error initializing SparkContext. java.net.URISyntaxException

2015-10-09 Thread Jose Antonio (JIRA)
Jose Antonio created SPARK-11023:


 Summary: Error initializing SparkContext. 
java.net.URISyntaxException
 Key: SPARK-11023
 URL: https://issues.apache.org/jira/browse/SPARK-11023
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.1, 1.5.0
 Environment: pyspark + windows 
Reporter: Jose Antonio


Simliar to SPARK-10326. 
[https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949470#comment-14949470]


C:\WINDOWS\system32>pyspark --master yarn-client
Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, Sep 15 2015, 14:26:14) [MSC 
v.1500 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 4.0.0 – An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
15/10/08 09:28:05 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
15/10/08 09:28:06 WARN : Your hostname, PC-509512 resolves to a 
loopback/non-reachable address: fe80:0:0:0:0:5efe:a5f:c318%net3, but we 
couldn't find any external IP address!
15/10/08 09:28:08 WARN BlockReaderLocal: The short-circuit local reads feature 
cannot be used because UNIX Domain sockets are not available on Windows.
15/10/08 09:28:08 ERROR SparkContext: Error initializing SparkContext.
java.net.URISyntaxException: Illegal character in opaque part at index 2: 
C:\spark\bin\..\python\lib\pyspark.zip
at java.net.URI$Parser.fail(Unknown Source)
at java.net.URI$Parser.checkChars(Unknown Source)
at java.net.URI$Parser.parse(Unknown Source)
at java.net.URI.(Unknown Source)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$7.apply(Client.scala:558)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$7.apply(Client.scala:557)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:557)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:523)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
15/10/08 09:28:08 ERROR Utils: Uncaught exception in thread Thread-2
java.lang.NullPointerException
at 
org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100)
at 
org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1748)
at org.apache.spark.SparkContext.(SparkContext.scala:593)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
---
Py4JJavaError Traceback (most recent call last)
C:\spark\bin\..\python\pyspark\shell.py in ()
41 

[jira] [Commented] (SPARK-8333) Spark failed to delete temp directory created by HiveContext

2015-10-09 Thread Dony.Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950075#comment-14950075
 ] 

Dony.Xu commented on SPARK-8333:


when i run the Streaming javaAPI test in windows 7, this issue also can be 
reproduced.

  
java.io.IOException: Failed to delete: 
D:\workspace\spark\streaming\target\tmp\1444376717608-0
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
at org.apache.spark.util.Utils.deleteRecursively(Utils.scala)
at 
org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1728)

  

> Spark failed to delete temp directory created by HiveContext
> 
>
> Key: SPARK-8333
> URL: https://issues.apache.org/jira/browse/SPARK-8333
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Windows7 64bit
>Reporter: sheng
>Priority: Minor
>  Labels: Hive, metastore, sparksql
> Attachments: test.tar
>
>
> Spark 1.4.0 failed to stop SparkContext.
> {code:title=LocalHiveTest.scala|borderStyle=solid}
>  val sc = new SparkContext("local", "local-hive-test", new SparkConf())
>  val hc = Utils.createHiveContext(sc)
>  ... // execute some HiveQL statements
>  sc.stop()
> {code}
> sc.stop() failed to execute, it threw the following exception:
> {quote}
> 15/06/13 03:19:06 INFO Utils: Shutdown hook called
> 15/06/13 03:19:06 INFO Utils: Deleting directory 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
> 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
> java.io.IOException: Failed to delete: 
> C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963)
>   at 
> org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204)
>   at 
> org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201)
>   at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2262)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2262)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2262)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2244)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {quote}
> It seems this bug is introduced by this SPARK-6907. In SPARK-6907, a local 
> hive metastore is created in a temp directory. The problem is the local hive 
> metastore is not shut down correctly. At the end of application,  if 
> SparkContext.stop() is called, it tries to delete the temp directory which is 
> still used by the local hive metastore, and throws an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-11025:
---

 Summary: Exception key can't be empty at getSystemProperties 
function in utils 
 Key: SPARK-11025
 URL: https://issues.apache.org/jira/browse/SPARK-11025
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1, 1.4.1, 1.4.0, 1.3.1, 1.3.0
Reporter: Stavros Kontopoulos
Priority: Trivial


At file 
https://github.com/apache/spark/blob/v1.x.x/core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).

Exception thrown: java.lang.IllegalArgumentException: key can't be empty





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10679) javax.jdo.JDOFatalUserException in executor

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10679:
--
Assignee: Reynold Xin

> javax.jdo.JDOFatalUserException in executor
> ---
>
> Key: SPARK-10679
> URL: https://issues.apache.org/jira/browse/SPARK-10679
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Navis
>Assignee: Reynold Xin
>Priority: Minor
> Fix For: 1.6.0
>
>
> HadoopRDD throws exception in executor, something like below.
> {noformat}
> 5/09/17 18:51:21 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/09/17 18:51:21 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/09/17 18:51:21 WARN metastore.HiveMetaStore: Retrying creating default 
> database after error: Class 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
> javax.jdo.JDOFatalUserException: Class 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:803)
>   at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:782)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:298)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:274)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:274)
>   at 
> 

[jira] [Updated] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11016:
--
Component/s: Spark Core

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11018) Support UDT in codegen and unsafe projection

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11018:
--
Component/s: SQL

> Support UDT in codegen and unsafe projection
> 
>
> Key: SPARK-11018
> URL: https://issues.apache.org/jira/browse/SPARK-11018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> UDT is not handled correctly in codegen:
> {code}
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 41, Column 30: No applicable constructor/method found 
> for actual parameters "int, java.lang.Object"; candidates are: "public void 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(int, 
> org.apache.spark.unsafe.types.CalendarInterval)", "public void 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(int, 
> org.apache.spark.sql.types.Decimal, int, int)", "public void 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(int, 
> org.apache.spark.unsafe.types.UTF8String)", "public void 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(int, 
> byte[])"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950152#comment-14950152
 ] 

Sean Owen commented on SPARK-11016:
---

This is my ignorance, but is a proper serializer registered for roaringbitmaps 
classes in your app (or somehow by kryo by default)? Otherwise, relying on the 
default serialization may not work, indeed. This isn't a spark problem though.

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950157#comment-14950157
 ] 

Stavros Kontopoulos edited comment on SPARK-11025 at 10/9/15 10:04 AM:
---

falling back to previous impl: 
 System.getProperties.clone().asInstanceOf[java.util.Properties].toMap[String, 
String] which was ignoring it, i guess at language level java does not complain 
so i think it is ok to ignore it...unless the general strategy is to catch 
everything that is wrong... but i think we should only validate what we use... 
i know -D only may come up as a mistake... just wanted to bring to the table 
what is the strategy and if for such minor mistakes should we fail the 
execution when spark config is created etc...


was (Author: skonto):
falling back to previous impl: 
 System.getProperties.clone().asInstanceOf[java.util.Properties].toMap[String, 
String] which was ignoring it, i guess at language level java does not complain 
so i think it is ok to ignore it...unless the general strategy is to catch 
everything that is wrong... but i think we should only validate what we use... 
i know -D only may come up as a mistake... just wanted to bring to the table 
what is the strategy and if for such minor mistakes should we fail the 
execution when spark context is created etc...

> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty
> Empty keys should be ignored or just passed them without filtering at that 
> level as in previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10944) Provide self contained deployment not tighly coupled with Hadoop

2015-10-09 Thread Pranas Baliuka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranas Baliuka updated SPARK-10944:
---
 Flags: Patch  (was: Patch,Important)
Labels: patch  (was: easyfix patch)
Remaining Estimate: (was: 2h)
 Original Estimate: (was: 2h)
  Priority: Minor  (was: Major)
   Description: 
Attempt to run Spark cluster on Mac OS machine fails if Hadoop is not 
installed. There should be no real need to install full blown Hadoop 
installation just to run Spark.

Current situation

{code}
# cd $SPARK_HOME
Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
{code}

Output:
{code}
starting org.apache.spark.deploy.master.Master, logging to 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
failed to launch org.apache.spark.deploy.master.Master:
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
full log in 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
{code}

Log:
{code}
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
Spark Command: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
 -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
7077 --webui-port 8080

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at 
sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
{code}

Proposed short term fix:
Bundle all required 3rd party libs to the uberjar and/or fix  start-up script 
to include required 3rd party libs.

Long term quality improvement proposal: Introduce integration tests to check 
distribution before releasing.

  was:
Attempt to run Spark cluster on Mac OS machine fails

Invocation:
{code}
# cd $SPARK_HOME
Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
{code}

Output:
{code}
starting org.apache.spark.deploy.master.Master, logging to 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
failed to launch org.apache.spark.deploy.master.Master:
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
full log in 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
{code}

Log:
{code}
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
Spark Command: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
 -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
7077 --webui-port 8080

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at 
sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
{code}

Proposed short term fix:

[jira] [Reopened] (SPARK-10944) Provide self contained deployment not tighly coupled with Hadoop

2015-10-09 Thread Pranas Baliuka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranas Baliuka reopened SPARK-10944:


Updated as features request. 

> Provide self contained deployment not tighly coupled with Hadoop
> 
>
> Key: SPARK-10944
> URL: https://issues.apache.org/jira/browse/SPARK-10944
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 1.5.1
> Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop
>Reporter: Pranas Baliuka
>Priority: Minor
>  Labels: patch
>
> Attempt to run Spark cluster on Mac OS machine fails if Hadoop is not 
> installed. There should be no real need to install full blown Hadoop 
> installation just to run Spark.
> Current situation
> {code}
> # cd $SPARK_HOME
> Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
> {code}
> Output:
> {code}
> starting org.apache.spark.deploy.master.Master, logging to 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> failed to launch org.apache.spark.deploy.master.Master:
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 7 more
> full log in 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> {code}
> Log:
> {code}
> # Options read when launching programs locally with
> # ./bin/run-example or ./bin/spark-submit
> Spark Command: 
> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
> 7077 --webui-port 8080
> 
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
> at java.lang.Class.getMethod0(Class.java:3018)
> at java.lang.Class.getMethod(Class.java:1784)
> at 
> sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
> at 
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> {code}
> Proposed short term fix:
> Bundle all required 3rd party libs to the uberjar and/or fix  start-up script 
> to include required 3rd party libs.
> Long term quality improvement proposal: Introduce integration tests to check 
> distribution before releasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11022) Spark Worker process find Memory leaking after long time running

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11022:
--
Priority: Minor  (was: Major)

Can you update the title to be more clear about the cause and resolution? you 
are specifically suggesting that the list of executors needs to be garbage 
collected. (Do you really have 17K executors, most of which are dead, in one 
app?)

> Spark Worker process find Memory leaking after long time running
> 
>
> Key: SPARK-11022
> URL: https://issues.apache.org/jira/browse/SPARK-11022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: colin shaw
>Priority: Minor
>
> Worker process often down,while there were not any abnormal tasks,just crash 
> without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
> instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
> bytes. "
> and almost all the instance were stored in a 
> "org.apache.spark.deploy.worker.Worker" instance, the finishedExecutors field 
> hold many ExecutorRunner.
> The codes(Worker.scala) shows finishedExecutors just 
> "finishedExecutors(fullId) = executor" and 
> "finishedExecutors.values.toList",there is no action which remove the 
> Executor,all were stored in memory,so after long time running,crashed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11024) Optimize NULL in by folding it to Literal(null)

2015-10-09 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-11024:


 Summary: Optimize NULL in  by folding it to 
Literal(null)
 Key: SPARK-11024
 URL: https://issues.apache.org/jira/browse/SPARK-11024
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Dilip Biswal
Priority: Minor


Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
Literal(null). 

This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11024) Optimize NULL in by folding it to Literal(null)

2015-10-09 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950124#comment-14950124
 ] 

Dilip Biswal commented on SPARK-11024:
--

I am currently working on a PR for this issue.

> Optimize NULL in  by folding it to Literal(null)
> 
>
> Key: SPARK-11024
> URL: https://issues.apache.org/jira/browse/SPARK-11024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dilip Biswal
>Priority: Minor
>
> Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
> Literal(null). 
> This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan.
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-11025:

Description: 
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).

Exception thrown: java.lang.IllegalArgumentException: key can't be empty



  was:
At file ../core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).

Exception thrown: java.lang.IllegalArgumentException: key can't be empty




> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-11025:

Description: 
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).
Exception thrown: java.lang.IllegalArgumentException: key can't be empty
Empty keys should be ignored i think.



  was:
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).

Exception thrown: java.lang.IllegalArgumentException: key can't be empty




> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty
> Empty keys should be ignored i think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-11025:

Description: 
At file ../core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).

Exception thrown: java.lang.IllegalArgumentException: key can't be empty



  was:
At file 
https://github.com/apache/spark/blob/v1.x.x/core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).

Exception thrown: java.lang.IllegalArgumentException: key can't be empty




> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file ../core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-11025:

Description: 
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).
Exception thrown: java.lang.IllegalArgumentException: key can't be empty
Empty keys should be ignored or just passed them without filtering at that 
level as in previous versions.



  was:
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).
Exception thrown: java.lang.IllegalArgumentException: key can't be empty
Empty keys should be ignored at that level as in previous versions.




> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty
> Empty keys should be ignored or just passed them without filtering at that 
> level as in previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8654:
-
Assignee: Dilip Biswal

> Analysis exception when using "NULL IN (...)": invalid cast
> ---
>
> Key: SPARK-8654
> URL: https://issues.apache.org/jira/browse/SPARK-8654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Santiago M. Mola
>Assignee: Dilip Biswal
>Priority: Minor
>
> The following query throws an analysis exception:
> {code}
> SELECT * FROM t WHERE NULL NOT IN (1, 2, 3);
> {code}
> The exception is:
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from int to null;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
> {code}
> Here is a test that can be added to AnalysisSuite to check the issue:
> {code}
>   test("SPARK- regression test") {
> val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), 
> "a")() :: Nil,
>   LocalRelation()
> )
> caseInsensitiveAnalyze(plan)
>   }
> {code}
> Note that this kind of query is a corner case, but it is still valid SQL. An 
> expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL 
> as a result, even if the list contains NULL. So it is safe to translate these 
> expressions to Literal(null) during analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11014) RPC Time Out Exceptions

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11014:
--
Component/s: YARN

> RPC Time Out Exceptions
> ---
>
> Key: SPARK-11014
> URL: https://issues.apache.org/jira/browse/SPARK-11014
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: YARN
>Reporter: Gurpreet Singh
>
> I am seeing lots of the following RPC exception messages in YARN logs:
> 
> 15/10/08 13:04:27 WARN executor.Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Error sending message [message = 
> Heartbeat(437,[Lscala.Tuple2;@34199eb1,BlockManagerId(437, 
> phxaishdc9dn1294.stratus.phx.ebay.com, 47480))]
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:452)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after 
> [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> ... 14 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [120 seconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
> ... 15 more
> ##



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10973) __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry.

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10973:
--
Labels: backport-needed  (was: )

> __gettitem__ method throws IndexError exception when we try to access index 
> after the last non-zero entry.
> --
>
> Key: SPARK-10973
> URL: https://issues.apache.org/jira/browse/SPARK-10973
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>  Labels: backport-needed
> Fix For: 1.6.0
>
>
> \_\_gettitem\_\_ method throws IndexError exception when we try to access  
> index  after the last non-zero entry.
> {code}
> from pyspark.mllib.linalg import Vectors
> sv = Vectors.sparse(5, {1: 3})
> sv[0]
> ## 0.0
> sv[1]
> ## 3.0
> sv[2]
> ## Traceback (most recent call last):
> ##   File "", line 1, in 
> ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
> ## row_ind = inds[insert_index]
> ## IndexError: index out of bounds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950157#comment-14950157
 ] 

Stavros Kontopoulos commented on SPARK-11025:
-

falling back to previous impl: 
 System.getProperties.clone().asInstanceOf[java.util.Properties].toMap[String, 
String] which was ignoring it, i guess at language level java does not complain 
so i think it is ok to ignore it...unless the general strategy is to catch 
everything that is wrong... but i think we should only validate what we use... 
i know -D only may come up as a mistake... just wanted to bring to the table 
what is the strategy and if for such minor mistakes should we fail the 
execution when spark context is created etc...

> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty
> Empty keys should be ignored or just passed them without filtering at that 
> level as in previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10326) Cannot launch YARN job on Windows

2015-10-09 Thread Jose Antonio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950065#comment-14950065
 ] 

Jose Antonio commented on SPARK-10326:
--

Bug reported.
Thanks,
Jose





-- 
/ .- .-.. .-.. / -.-- --- ..- / -. . . -.. / .. ... / .-.. --- ...- .
José Antonio Martín H. (PhD)   E-Mail: jamart...@fdi.ucm.es
Computer Science Faculty   Phone: (+34) 91 3947650
Complutense University of Madrid   Fax: (+34) 91 3947527
C/ Prof. José García Santesmases,s/n   28040 Madrid, Spain
web: http://www.dacya.ucm.es/jam/
LinkedIn: http://www.linkedin.com/in/jamartinh (Let's connect)
.-.. --- ...- . / .. ... / .- .-.. .-.. / .-- . / -. . . -..


> Cannot launch YARN job on Windows 
> --
>
> Key: SPARK-10326
> URL: https://issues.apache.org/jira/browse/SPARK-10326
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.5.0
>
>
> The fix is already in master, and it's one line out of the patch for 
> SPARK-5754; the bug is that a Windows file path cannot be used to create a 
> URI, to {{File.toURI()}} needs to be called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10944) Provide self contained deployment not tighly coupled with Hadoop

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-10944.
-

> Provide self contained deployment not tighly coupled with Hadoop
> 
>
> Key: SPARK-10944
> URL: https://issues.apache.org/jira/browse/SPARK-10944
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 1.5.1
> Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop
>Reporter: Pranas Baliuka
>Priority: Minor
>  Labels: patch
>
> Attempt to run Spark cluster on Mac OS machine fails if Hadoop is not 
> installed. There should be no real need to install full blown Hadoop 
> installation just to run Spark.
> Current situation
> {code}
> # cd $SPARK_HOME
> Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
> {code}
> Output:
> {code}
> starting org.apache.spark.deploy.master.Master, logging to 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> failed to launch org.apache.spark.deploy.master.Master:
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 7 more
> full log in 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> {code}
> Log:
> {code}
> # Options read when launching programs locally with
> # ./bin/run-example or ./bin/spark-submit
> Spark Command: 
> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
> 7077 --webui-port 8080
> 
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
> at java.lang.Class.getMethod0(Class.java:3018)
> at java.lang.Class.getMethod(Class.java:1784)
> at 
> sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
> at 
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> {code}
> Proposed short term fix:
> Bundle all required 3rd party libs to the uberjar and/or fix  start-up script 
> to include required 3rd party libs.
> Long term quality improvement proposal: Introduce integration tests to check 
> distribution before releasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)

2015-10-09 Thread Khaled Ammar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950095#comment-14950095
 ] 

Khaled Ammar commented on SPARK-10945:
--

Hi [~ankurd], I wonder if you had a chance to work on this issue.

Thanks,
-Khaled


> GraphX computes Pagerank with NaN (with some datasets)
> --
>
> Key: SPARK-10945
> URL: https://issues.apache.org/jira/browse/SPARK-10945
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.0
> Environment: Linux
>Reporter: Khaled Ammar
>  Labels: test
>
> Hi,
> I run GraphX in a medium size standalone Spark 1.3.0 installation. The 
> pagerank typically works fine, except with one dataset (Twitter: 
> http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that 
> is commonly used in research papers.
> I found that many vertices have an NaN values. This is true, even if the 
> algorithm run for 1 iteration only.  
> Thanks,
> -Khaled



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10944) Provide self contained deployment not tighly coupled with Hadoop

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10944.
---
Resolution: Not A Problem

[~pranas] please don't reopen an issue unless there is a clear change in the 
reason that is was closed. Here, Marcelo explained the problem: you're using an 
artifact that requires you to provide Hadoop classes, but you are not. You 
should not use this artifact. In fact, Spark does require Hadoop *classes* no 
matter what.

> Provide self contained deployment not tighly coupled with Hadoop
> 
>
> Key: SPARK-10944
> URL: https://issues.apache.org/jira/browse/SPARK-10944
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 1.5.1
> Environment: Mac OS/Java 8/Spark 1.5.1 without hadoop
>Reporter: Pranas Baliuka
>Priority: Minor
>  Labels: patch
>
> Attempt to run Spark cluster on Mac OS machine fails if Hadoop is not 
> installed. There should be no real need to install full blown Hadoop 
> installation just to run Spark.
> Current situation
> {code}
> # cd $SPARK_HOME
> Imin:spark-1.5.1-bin-without-hadoop pranas$ ./sbin/start-master.sh
> {code}
> Output:
> {code}
> starting org.apache.spark.deploy.master.Master, logging to 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> failed to launch org.apache.spark.deploy.master.Master:
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 7 more
> full log in 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../logs/spark-pranas-org.apache.spark.deploy.master.Master-1-Imin.local.out
> {code}
> Log:
> {code}
> # Options read when launching programs locally with
> # ./bin/run-example or ./bin/spark-submit
> Spark Command: 
> /Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -cp 
> /Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/sbin/../conf/:/Users/pranas/Apps/spark-1.5.1-bin-without-hadoop/lib/spark-assembly-1.5.1-hadoop2.2.0.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip Imin.local --port 
> 7077 --webui-port 8080
> 
> Error: A JNI error has occurred, please check your installation and try again
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
> at java.lang.Class.getMethod0(Class.java:3018)
> at java.lang.Class.getMethod(Class.java:1784)
> at 
> sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
> at 
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> {code}
> Proposed short term fix:
> Bundle all required 3rd party libs to the uberjar and/or fix  start-up script 
> to include required 3rd party libs.
> Long term quality improvement proposal: Introduce integration tests to check 
> distribution before releasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-11025:

Description: 
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).
Exception thrown: java.lang.IllegalArgumentException: key can't be empty
Empty keys should be ignored at that level a sin previous versions.



  was:
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).
Exception thrown: java.lang.IllegalArgumentException: key can't be empty
Empty keys should be ignored i think.




> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty
> Empty keys should be ignored at that level a sin previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950145#comment-14950145
 ] 

Sean Owen commented on SPARK-11025:
---

What behavior do you suggest - ignoring it? Clearly {{-D}} by itself is a 
mistake though. It should cause an error that you notice.

> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty
> Empty keys should be ignored at that level as in previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11025) Exception key can't be empty at getSystemProperties function in utils

2015-10-09 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-11025:

Description: 
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).
Exception thrown: java.lang.IllegalArgumentException: key can't be empty
Empty keys should be ignored at that level as in previous versions.



  was:
At file core/src/main/scala/org/apache/spark/util/Utils.scala
getSystemProperties function fails when someone passes -D to the jvm and as a 
result the key passed is "" (empty).
Exception thrown: java.lang.IllegalArgumentException: key can't be empty
Empty keys should be ignored at that level a sin previous versions.




> Exception key can't be empty at getSystemProperties function in utils 
> --
>
> Key: SPARK-11025
> URL: https://issues.apache.org/jira/browse/SPARK-11025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.1
>Reporter: Stavros Kontopoulos
>Priority: Trivial
>  Labels: easyfix, easytest
>
> At file core/src/main/scala/org/apache/spark/util/Utils.scala
> getSystemProperties function fails when someone passes -D to the jvm and as a 
> result the key passed is "" (empty).
> Exception thrown: java.lang.IllegalArgumentException: key can't be empty
> Empty keys should be ignored at that level as in previous versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11006:
--
Assignee: Ted Yu

> Rename NullColumnAccess as NullColumnAccessor
> -
>
> Key: SPARK-11006
> URL: https://issues.apache.org/jira/browse/SPARK-11006
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala 
> , NullColumnAccess should be renmaed as NullColumnAccessor so that same 
> convention is adhered to for the accessors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10902) Hive UDF current_database() does not work

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10902:
--
Assignee: Davies Liu

> Hive UDF current_database() does not work
> -
>
> Key: SPARK-10902
> URL: https://issues.apache.org/jira/browse/SPARK-10902
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Hive UDF current_database() is foldable, it need to access the SessionState 
> in metadataHive to evaluate it, but this not accessible while optimizing the 
> query plan. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11004) MapReduce Hive-like join operations for RDDs

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11004:
--
Component/s: Shuffle

> MapReduce Hive-like join operations for RDDs
> 
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-10-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8654:
-
Fix Version/s: (was: 1.6.0)

> Analysis exception when using "NULL IN (...)": invalid cast
> ---
>
> Key: SPARK-8654
> URL: https://issues.apache.org/jira/browse/SPARK-8654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Minor
>
> The following query throws an analysis exception:
> {code}
> SELECT * FROM t WHERE NULL NOT IN (1, 2, 3);
> {code}
> The exception is:
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from int to null;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
> {code}
> Here is a test that can be added to AnalysisSuite to check the issue:
> {code}
>   test("SPARK- regression test") {
> val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), 
> "a")() :: Nil,
>   LocalRelation()
> )
> caseInsensitiveAnalyze(plan)
>   }
> {code}
> Note that this kind of query is a corner case, but it is still valid SQL. An 
> expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL 
> as a result, even if the list contains NULL. So it is safe to translate these 
> expressions to Literal(null) during analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-10-09 Thread Glyton Camilleri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950156#comment-14950156
 ] 

Glyton Camilleri commented on SPARK-6847:
-

Hi, 

I've also bumped into this very same issue but couldn't find a good value for 
{{checkpoint}}; our setup consists of a kafka-stream with 10s time-window, 
trying various values for the checkpoint interval (default, 10s, and 15s). 

It always takes a long time for the exception to appear, often in the range of 
10 hours or so, making the problem relatively painful to debug. We'll be trying 
to investigate further, but it would be great if someone could shed some more 
light on the issue.

> Stack overflow on updateStateByKey which followed by a dstream with 
> checkpoint set
> --
>
> Key: SPARK-6847
> URL: https://issues.apache.org/jira/browse/SPARK-6847
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Jack Hu
>  Labels: StackOverflowError, Streaming
>
> The issue happens with the following sample code: uses {{updateStateByKey}} 
> followed by a {{map}} with checkpoint interval 10 seconds
> {code}
> val sparkConf = new SparkConf().setAppName("test")
> val streamingContext = new StreamingContext(sparkConf, Seconds(10))
> streamingContext.checkpoint("""checkpoint""")
> val source = streamingContext.socketTextStream("localhost", )
> val updatedResult = source.map(
> (1,_)).updateStateByKey(
> (newlist : Seq[String], oldstate : Option[String]) => 
> newlist.headOption.orElse(oldstate))
> updatedResult.map(_._2)
> .checkpoint(Seconds(10))
> .foreachRDD((rdd, t) => {
>   println("Deep: " + rdd.toDebugString.split("\n").length)
>   println(t.toString() + ": " + rdd.collect.length)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> From the output, we can see that the dependency will be increasing time over 
> time, the {{updateStateByKey}} never get check-pointed,  and finally, the 
> stack overflow will happen. 
> Note:
> * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but 
> not the {{updateStateByKey}} 
> * If remove the {{checkpoint(Seconds(10))}} from the map result ( 
> {{updatedResult.map(_._2)}} ), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7751) Add @Since annotation to stable and experimental methods in MLlib

2015-10-09 Thread Alex Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950169#comment-14950169
 ] 

Alex Hu commented on SPARK-7751:


This is late as the epic is almost complete but an alternative of determining a 
string's provenance would be to run the following command.
{code}
git log -S{string} {filePath}
{code}

After determining the relevant commit, you can determine the tag with
{code}
git tag --contains {commit}
{code}

> Add @Since annotation to stable and experimental methods in MLlib
> -
>
> Key: SPARK-7751
> URL: https://issues.apache.org/jira/browse/SPARK-7751
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>
> This is useful to check whether a feature exists in some version of Spark. 
> This is an umbrella JIRA to track the progress. We want to have -@since tag- 
> @Since annotation for both stable (those without any 
> Experimental/DeveloperApi/AlphaComponent annotations) and experimental 
> methods in MLlib:
> (Do NOT tag private or package private classes or methods, nor local 
> variables and methods.)
> * an example PR for Scala: https://github.com/apache/spark/pull/8309
> We need to dig the history of git commit to figure out what was the Spark 
> version when a method was first introduced. Take `NaiveBayes.setModelType` as 
> an example. We can grep `def setModelType` at different version git tags.
> {code}
> meng@xm:~/src/spark
> $ git show 
> v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
>  | grep "def setModelType"
> meng@xm:~/src/spark
> $ git show 
> v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
>  | grep "def setModelType"
>   def setModelType(modelType: String): NaiveBayes = {
> {code}
> If there are better ways, please let us know.
> We cannot add all -@since tags- @Since annotation in a single PR, which is 
> hard to review. So we made some subtasks for each package, for example 
> `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
> and the `spark.ml` package.
> Plan:
> 1. In 1.5, we try to add @Since annotation to all stable/experimental methods 
> under `spark.mllib`.
> 2. Starting from 1.6, we require @Since annotation in all new PRs.
> 3. In 1.6, we try to add @SInce annotation to all stable/experimental methods 
> under `spark.ml`, `pyspark.mllib`, and `pyspark.ml`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11041) Add (NOT) IN / EXISTS support for predicates

2015-10-09 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao closed SPARK-11041.
-
Resolution: Duplicate

> Add (NOT) IN / EXISTS support for predicates
> 
>
> Key: SPARK-11041
> URL: https://issues.apache.org/jira/browse/SPARK-11041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11043) Hive Thrift Server will log warn "Couldn't find log associated with operation handle"

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11043:


Assignee: (was: Apache Spark)

> Hive Thrift Server will log warn "Couldn't find log associated with operation 
> handle"
> -
>
> Key: SPARK-11043
> URL: https://issues.apache.org/jira/browse/SPARK-11043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>
> The warnning log is below:
> {code:title=Warnning Log|borderStyle=solid}
> 15/10/09 16:48:23 WARN thrift.ThriftCLIService: Error fetching results: 
> org.apache.hive.service.cli.HiveSQLException: Couldn't find log associated 
> with operation handle: OperationHandle [opType=EXECUTE_STATEMENT, 
> getHandleIdentifier()=fb0900c7-6244-432e-a779-b449ca7f7ca0]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationLogRowSet(OperationManager.java:229)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:687)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy32.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Once I execute a statement, there will have this warnning log by the default 
> configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11043) Hive Thrift Server will log warn "Couldn't find log associated with operation handle"

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11043:


Assignee: Apache Spark

> Hive Thrift Server will log warn "Couldn't find log associated with operation 
> handle"
> -
>
> Key: SPARK-11043
> URL: https://issues.apache.org/jira/browse/SPARK-11043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
>
> The warnning log is below:
> {code:title=Warnning Log|borderStyle=solid}
> 15/10/09 16:48:23 WARN thrift.ThriftCLIService: Error fetching results: 
> org.apache.hive.service.cli.HiveSQLException: Couldn't find log associated 
> with operation handle: OperationHandle [opType=EXECUTE_STATEMENT, 
> getHandleIdentifier()=fb0900c7-6244-432e-a779-b449ca7f7ca0]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationLogRowSet(OperationManager.java:229)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:687)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy32.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Once I execute a statement, there will have this warnning log by the default 
> configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10306) sbt hive/update issue

2015-10-09 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951371#comment-14951371
 ] 

holdenk commented on SPARK-10306:
-

So the pull request that I posted has a solution that works for me, but I've 
avoided up-streaming it since it the other spark developers were not 
experiencing the issue. Could other people who experience this run 
"hive/evicted" & hive/dependencyTree and post the results here?

> sbt hive/update issue
> -
>
> Key: SPARK-10306
> URL: https://issues.apache.org/jira/browse/SPARK-10306
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>Priority: Trivial
>
> Running sbt hive/update sometimes results in the error "impossible to get 
> artifacts when data has not been loaded. IvyNode = 
> org.scala-lang#scala-library;2.10.3" which is unfortunate since it is always 
> evicted by 2.10.4 currently. An easy (but maybe not super clean) solution 
> would be adding 2.10.3 as a dependency which will then get evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10306) sbt hive/update issue

2015-10-09 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reopened SPARK-10306:
-

re-opened since other users are also experiencing the issue

> sbt hive/update issue
> -
>
> Key: SPARK-10306
> URL: https://issues.apache.org/jira/browse/SPARK-10306
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>Priority: Trivial
>
> Running sbt hive/update sometimes results in the error "impossible to get 
> artifacts when data has not been loaded. IvyNode = 
> org.scala-lang#scala-library;2.10.3" which is unfortunate since it is always 
> evicted by 2.10.4 currently. An easy (but maybe not super clean) solution 
> would be adding 2.10.3 as a dependency which will then get evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression

2015-10-09 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951438#comment-14951438
 ] 

DB Tsai commented on SPARK-2309:


I don't quite get you, can you elaborate? But I'm pretty sure that the 
implementation in Spark MLlib is the same as slide and that's standard 
multinomial LoR. You can check the test code which shows that the result 
matches R.

> Generalize the binary logistic regression into multinomial logistic regression
> --
>
> Key: SPARK-2309
> URL: https://issues.apache.org/jira/browse/SPARK-2309
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 1.3.0
>
>
> Currently, there is no multi-class classifier in mllib. Logistic regression 
> can be extended to multinomial one straightforwardly. 
> The following formula will be implemented. 
> http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11043) Hive Thrift Server will log warn "Couldn't find log associated with operation handle"

2015-10-09 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-11043:


 Summary: Hive Thrift Server will log warn "Couldn't find log 
associated with operation handle"
 Key: SPARK-11043
 URL: https://issues.apache.org/jira/browse/SPARK-11043
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: SaintBacchus


The warnning log is below:
{code:title=Warnning Log|borderStyle=solid}
15/10/09 16:48:23 WARN thrift.ThriftCLIService: Error fetching results: 
org.apache.hive.service.cli.HiveSQLException: Couldn't find log associated with 
operation handle: OperationHandle [opType=EXECUTE_STATEMENT, 
getHandleIdentifier()=fb0900c7-6244-432e-a779-b449ca7f7ca0]
at 
org.apache.hive.service.cli.operation.OperationManager.getOperationLogRowSet(OperationManager.java:229)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:687)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy32.fetchResults(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11043) Hive Thrift Server will log warn "Couldn't find log associated with operation handle"

2015-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951526#comment-14951526
 ] 

Apache Spark commented on SPARK-11043:
--

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/9056

> Hive Thrift Server will log warn "Couldn't find log associated with operation 
> handle"
> -
>
> Key: SPARK-11043
> URL: https://issues.apache.org/jira/browse/SPARK-11043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>
> The warnning log is below:
> {code:title=Warnning Log|borderStyle=solid}
> 15/10/09 16:48:23 WARN thrift.ThriftCLIService: Error fetching results: 
> org.apache.hive.service.cli.HiveSQLException: Couldn't find log associated 
> with operation handle: OperationHandle [opType=EXECUTE_STATEMENT, 
> getHandleIdentifier()=fb0900c7-6244-432e-a779-b449ca7f7ca0]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationLogRowSet(OperationManager.java:229)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:687)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy32.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Once I execute a statement, there will have this warnning log by the default 
> configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11013) SparkPlan may mistakenly register child plan's accumulators for SQL metrics

2015-10-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951534#comment-14951534
 ] 

Wenchen Fan commented on SPARK-11013:
-

The problem is we report accumulators that should not be reported.

For example, a query plan "Aggregate -> Exchange -> Aggregate", we defined 2 
metrics for `Aggregate`: `numInputRows` and `numOutputRows`. This query has 2 
stages(let's say stg1 and stg2) that are splitted by Exchange. When we run 
stg1, we should report 2 accumulators for the bottom Aggregate. When we run 
stg2, we should report another 2 accumulators for the top Aggregate.

However, when we run stg2, we report 4 accumulators, and 2 of them is for the 
bottom Aggregate which is introduced by the serialization problem described 
before, and never get updated. Then the bottom Aggregate's metrics has an extra 
zero-value updating and may lead to wrong result for future metrics like min.

> SparkPlan may mistakenly register child plan's accumulators for SQL metrics
> ---
>
> Key: SPARK-11013
> URL: https://issues.apache.org/jira/browse/SPARK-11013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> The reason is that: when we call RDD API inside SparkPlan, we are very likely 
> to reference the SparkPlan in the closure and thus serialize and transfer a 
> SparkPlan tree to executor side. When we deserialize it, the accumulators in 
> child SparkPlan are also deserialized and registered, and always report zero 
> value.
> This is not a problem currently because we only have one operation to 
> aggregate the accumulators: add. However, if we wanna support more complex 
> metric like min, the extra zero values will lead to wrong result.
> Take TungstenAggregate as an example, I logged "stageId, partitionId, 
> accumName, accumId" when an accumulator is deserialized and registered, and 
> logged the "accumId -> accumValue" map when a task ends. The output is:
> {code}
> scala> val df = Seq(1 -> "a", 2 -> "b").toDF("a", "b").groupBy().count()
> df: org.apache.spark.sql.DataFrame = [count: bigint]
> scala> df.collect
> register: 0 0 Some(number of input rows) 4
> register: 0 0 Some(number of output rows) 5
> register: 1 0 Some(number of input rows) 4
> register: 1 0 Some(number of output rows) 5
> register: 1 0 Some(number of input rows) 2
> register: 1 0 Some(number of output rows) 3
> Map(5 -> 1, 4 -> 2, 6 -> 4458496)
> Map(5 -> 0, 2 -> 1, 7 -> 4458496, 3 -> 1, 4 -> 0)
> res0: Array[org.apache.spark.sql.Row] = Array([2])
> {code}
> The best choice is to avoid serialize and deserialize a SparkPlan tree, which 
> can be achieved by LocalNode.
> Or we can do some workaround to fix this serialization problem for the 
> problematic SparkPlans like TungstenAggregate, TungstenSort.
> Or we can improve the SQL metrics framework to make it more robust to this 
> case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2015-10-09 Thread Ashish Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951540#comment-14951540
 ] 

Ashish Gupta commented on SPARK-6567:
-

did this effort succeed?

> Large linear model parallelism via a join and reduceByKey
> -
>
> Key: SPARK-6567
> URL: https://issues.apache.org/jira/browse/SPARK-6567
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Reza Zadeh
> Attachments: model-parallelism.pptx
>
>
> To train a linear model, each training point in the training set needs its 
> dot product computed against the model, per iteration. If the model is large 
> (too large to fit in memory on a single machine) then SPARK-4590 proposes 
> using parameter server.
> There is an easier way to achieve this without parameter servers. In 
> particular, if the data is held as a BlockMatrix and the model as an RDD, 
> then each block can be joined with the relevant part of the model, followed 
> by a reduceByKey to compute the dot products.
> This obviates the need for a parameter server, at least for linear models. 
> However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11040) SaslRpcHandler does not delegate all methods to underlying handler

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11040:


Assignee: Apache Spark

> SaslRpcHandler does not delegate all methods to underlying handler
> --
>
> Key: SPARK-11040
> URL: https://issues.apache.org/jira/browse/SPARK-11040
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> {{SaslRpcHandler}} only delegates {{receive}} and {{getStreamManager}}, so 
> when SASL is enabled, other events will be missed by apps.
> This affects other version too, but I think these events aren't actually used 
> there. They'll be used by the new rpc backend in 1.6, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10876) display total application time in spark history UI

2015-10-09 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951419#comment-14951419
 ] 

Jakob Odersky commented on SPARK-10876:
---

I'm not sure what you mean. The UI already has a "Duration" field for every job.

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10876) display total application time in spark history UI

2015-10-09 Thread Jakob Odersky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-10876:
--
Comment: was deleted

(was: I'm not sure what you mean. The UI already has a "Duration" field for 
every job.)

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11041) Add (NOT) IN / EXISTS support for predicates

2015-10-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-11041:
-

 Summary: Add (NOT) IN / EXISTS support for predicates
 Key: SPARK-11041
 URL: https://issues.apache.org/jira/browse/SPARK-11041
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11043) Hive Thrift Server will log warn "Couldn't find log associated with operation handle"

2015-10-09 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-11043:
-
Description: 
The warnning log is below:
{code:title=Warnning Log|borderStyle=solid}
15/10/09 16:48:23 WARN thrift.ThriftCLIService: Error fetching results: 
org.apache.hive.service.cli.HiveSQLException: Couldn't find log associated with 
operation handle: OperationHandle [opType=EXECUTE_STATEMENT, 
getHandleIdentifier()=fb0900c7-6244-432e-a779-b449ca7f7ca0]
at 
org.apache.hive.service.cli.operation.OperationManager.getOperationLogRowSet(OperationManager.java:229)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:687)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy32.fetchResults(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Once I execute a statement, there will have this warnning log by the default 
configuration.

  was:
The warnning log is below:
{code:title=Warnning Log|borderStyle=solid}
15/10/09 16:48:23 WARN thrift.ThriftCLIService: Error fetching results: 
org.apache.hive.service.cli.HiveSQLException: Couldn't find log associated with 
operation handle: OperationHandle [opType=EXECUTE_STATEMENT, 
getHandleIdentifier()=fb0900c7-6244-432e-a779-b449ca7f7ca0]
at 
org.apache.hive.service.cli.operation.OperationManager.getOperationLogRowSet(OperationManager.java:229)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:687)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy32.fetchResults(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:672)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at 

[jira] [Assigned] (SPARK-11040) SaslRpcHandler does not delegate all methods to underlying handler

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11040:


Assignee: (was: Apache Spark)

> SaslRpcHandler does not delegate all methods to underlying handler
> --
>
> Key: SPARK-11040
> URL: https://issues.apache.org/jira/browse/SPARK-11040
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> {{SaslRpcHandler}} only delegates {{receive}} and {{getStreamManager}}, so 
> when SASL is enabled, other events will be missed by apps.
> This affects other version too, but I think these events aren't actually used 
> there. They'll be used by the new rpc backend in 1.6, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11040) SaslRpcHandler does not delegate all methods to underlying handler

2015-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951405#comment-14951405
 ] 

Apache Spark commented on SPARK-11040:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9053

> SaslRpcHandler does not delegate all methods to underlying handler
> --
>
> Key: SPARK-11040
> URL: https://issues.apache.org/jira/browse/SPARK-11040
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> {{SaslRpcHandler}} only delegates {{receive}} and {{getStreamManager}}, so 
> when SASL is enabled, other events will be missed by apps.
> This affects other version too, but I think these events aren't actually used 
> there. They'll be used by the new rpc backend in 1.6, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2015-10-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10985:
--
Assignee: Bowen Zhang

> Avoid passing evicted blocks throughout BlockManager / CacheManager
> ---
>
> Key: SPARK-10985
> URL: https://issues.apache.org/jira/browse/SPARK-10985
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Andrew Or
>Assignee: Bowen Zhang
>Priority: Minor
>
> This is a minor refactoring task.
> Currently when we attempt to put a block in, we get back an array buffer of 
> blocks that are dropped in the process. We do this to propagate these blocks 
> back to our TaskContext, which will add them to its TaskMetrics so we can see 
> them in the SparkUI storage tab properly.
> Now that we have TaskContext.get, we can just use that to propagate this 
> information. This simplifies a lot of the signatures and gets rid of weird 
> return types like the following everywhere:
> {code}
> ArrayBuffer[(BlockId, BlockStatus)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10876) display total application time in spark history UI

2015-10-09 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951434#comment-14951434
 ] 

Jakob Odersky commented on SPARK-10876:
---

Do you mean to display the total run time of uncompleted apps? Completed apps 
already have a "Duration" field

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2015-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951468#comment-14951468
 ] 

Apache Spark commented on SPARK-4226:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/9055

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11042) Introduce a mechanism to ban creating new root SQLContexts in a JVM

2015-10-09 Thread Yin Huai (JIRA)
Yin Huai created SPARK-11042:


 Summary: Introduce a mechanism to ban creating new root 
SQLContexts in a JVM
 Key: SPARK-11042
 URL: https://issues.apache.org/jira/browse/SPARK-11042
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai


For some use cases, it will be useful to explicitly ban creating multiple root 
SQLContexts/HiveContexts. At here root SQLContext means the first SQLContext 
that gets created. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11038) Consolidate the format of UnsafeArrayData and UnsafeMapData

2015-10-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-11038:
--

 Summary: Consolidate the format of UnsafeArrayData and 
UnsafeMapData
 Key: SPARK-11038
 URL: https://issues.apache.org/jira/browse/SPARK-11038
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10930) History "Stages" page "duration" can be confusing

2015-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951265#comment-14951265
 ] 

Apache Spark commented on SPARK-10930:
--

User 'd2r' has created a pull request for this issue:
https://github.com/apache/spark/pull/9051

> History "Stages" page "duration" can be confusing
> -
>
> Key: SPARK-10930
> URL: https://issues.apache.org/jira/browse/SPARK-10930
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The spark history server, "stages" page shows each stage submitted time and 
> the duration.  The duration can be confusing since the time it actually 
> starts tasks might be much later then its submitted if its waiting on 
> previous stages.  This makes it hard to figure out which stages were really 
> slow without clicking into each stage.
> It would be nice to perhaps have a first task launched time or processing 
> time spent in each stage to easily be able to find the slow stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10930) History "Stages" page "duration" can be confusing

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10930:


Assignee: Apache Spark

> History "Stages" page "duration" can be confusing
> -
>
> Key: SPARK-10930
> URL: https://issues.apache.org/jira/browse/SPARK-10930
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> The spark history server, "stages" page shows each stage submitted time and 
> the duration.  The duration can be confusing since the time it actually 
> starts tasks might be much later then its submitted if its waiting on 
> previous stages.  This makes it hard to figure out which stages were really 
> slow without clicking into each stage.
> It would be nice to perhaps have a first task launched time or processing 
> time spent in each stage to easily be able to find the slow stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11039) Document all UI "retained*" configurations

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11039:


Assignee: (was: Apache Spark)

> Document all UI "retained*" configurations
> --
>
> Key: SPARK-11039
> URL: https://issues.apache.org/jira/browse/SPARK-11039
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Web UI
>Affects Versions: 1.5.1
>Reporter: Nick Pritchard
>Priority: Trivial
>
> Most are documented except these:
> - spark.sql.ui.retainedExecutions
> - spark.streaming.ui.retainedBatches
> They are really helpful for managing the memory usage of the driver 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11039) Document all UI "retained*" configurations

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11039:


Assignee: Apache Spark

> Document all UI "retained*" configurations
> --
>
> Key: SPARK-11039
> URL: https://issues.apache.org/jira/browse/SPARK-11039
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Web UI
>Affects Versions: 1.5.1
>Reporter: Nick Pritchard
>Assignee: Apache Spark
>Priority: Trivial
>
> Most are documented except these:
> - spark.sql.ui.retainedExecutions
> - spark.streaming.ui.retainedBatches
> They are really helpful for managing the memory usage of the driver 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10306) sbt hive/update issue

2015-10-09 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951362#comment-14951362
 ] 

Jakob Odersky commented on SPARK-10306:
---

Same issue here

> sbt hive/update issue
> -
>
> Key: SPARK-10306
> URL: https://issues.apache.org/jira/browse/SPARK-10306
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>Priority: Trivial
>
> Running sbt hive/update sometimes results in the error "impossible to get 
> artifacts when data has not been loaded. IvyNode = 
> org.scala-lang#scala-library;2.10.3" which is unfortunate since it is always 
> evicted by 2.10.4 currently. An easy (but maybe not super clean) solution 
> would be adding 2.10.3 as a dependency which will then get evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: Move predictNodeIndex to LearningNode

2015-10-09 Thread Luvsandondov Lkhamsuren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949940#comment-14949940
 ] 

Luvsandondov Lkhamsuren commented on SPARK-9963:


Thanks for the tip. I fixed the original PR too. 

> ML RandomForest cleanup: Move predictNodeIndex to LearningNode
> --
>
> Key: SPARK-9963
> URL: https://issues.apache.org/jira/browse/SPARK-9963
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>
> (updated form original description)
> Move ml.tree.impl.RandomForest.predictNodeIndex to LearningNode.
> We need to keep it as a separate method from Node.predictImpl because (a) it 
> needs to operate on binned features and (b) it needs to return the node ID, 
> not the node (because it can return the ID for nodes which do not yet exist).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8658) AttributeReference equals method only compare name, exprId and dataType

2015-10-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949946#comment-14949946
 ] 

Xiao Li commented on SPARK-8658:


Hi, Michael and Antonio, 

Trying to understand the problem and fix it if I can. Expression ID is the same 
but their qualifiers are different???

Could you give a query sample? I am trying to reproduce the problem? Is this 
problem related to a self join?   

Thanks, 

Xiao Li

> AttributeReference equals method only compare name, exprId and dataType
> ---
>
> Key: SPARK-8658
> URL: https://issues.apache.org/jira/browse/SPARK-8658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Antonio Jesus Navarro
>
> The AttributeReference "equals" method only accept as different objects with 
> different name, expression id or dataType. With this behavior when I tried to 
> do a "transformExpressionsDown" and try to transform qualifiers inside 
> "AttributeReferences", these objects are not replaced, because the 
> transformer considers them equal.
> I propose to add to the "equals" method this variables:
> name, dataType, nullable, metadata, epxrId, qualifiers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11022) Spark Worker process find Memory leak after long time running

2015-10-09 Thread colin shaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

colin shaw updated SPARK-11022:
---
Description: 
Worker process often down,while there were not any abnormal tasks,just crash 
without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
bytes. "
and all the instance were stored in a "org.apache.spark.deploy.worker.Worker" 
instance, the finishedExecutors field hold many ExecutorRunner.

The codes(Worker.scala) shows finishedExecutors just "finishedExecutors(fullId) 
= executor" and "finishedExecutors.values.toList",there is no action which 
remove the Executor,all were stored in memory,so after long time 
running,crashed.

  was:
Worker process often down,while there were not any abnormal task,just crash 
without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
bytes. "
and all the instance were stored in a "org.apache.spark.deploy.worker.Worker" 
instance, the finishedExecutors field hold many ExecutorRunner.

The codes(Worker.scala) shows finishedExecutors just "finishedExecutors(fullId) 
= executor" and "finishedExecutors.values.toList",there is no action which 
remove the Executor,all were stored in memory,so after long time 
running,crashed.


> Spark Worker process find Memory leak after long time running
> -
>
> Key: SPARK-11022
> URL: https://issues.apache.org/jira/browse/SPARK-11022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: colin shaw
>
> Worker process often down,while there were not any abnormal tasks,just crash 
> without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
> instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
> bytes. "
> and all the instance were stored in a "org.apache.spark.deploy.worker.Worker" 
> instance, the finishedExecutors field hold many ExecutorRunner.
> The codes(Worker.scala) shows finishedExecutors just 
> "finishedExecutors(fullId) = executor" and 
> "finishedExecutors.values.toList",there is no action which remove the 
> Executor,all were stored in memory,so after long time running,crashed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11022) Spark Worker process find Memory leaking after long time running

2015-10-09 Thread colin shaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

colin shaw updated SPARK-11022:
---
Summary: Spark Worker process find Memory leaking after long time running  
(was: Spark Worker process find Memory leak after long time running)

> Spark Worker process find Memory leaking after long time running
> 
>
> Key: SPARK-11022
> URL: https://issues.apache.org/jira/browse/SPARK-11022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: colin shaw
>
> Worker process often down,while there were not any abnormal tasks,just crash 
> without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
> instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
> bytes. "
> and almost all the instance were stored in a 
> "org.apache.spark.deploy.worker.Worker" instance, the finishedExecutors field 
> hold many ExecutorRunner.
> The codes(Worker.scala) shows finishedExecutors just 
> "finishedExecutors(fullId) = executor" and 
> "finishedExecutors.values.toList",there is no action which remove the 
> Executor,all were stored in memory,so after long time running,crashed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11022) Spark Worker process find Memory leak after long time running

2015-10-09 Thread colin shaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

colin shaw updated SPARK-11022:
---
Description: 
Worker process often down,while there were not any abnormal tasks,just crash 
without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
bytes. "
and almost all the instance were stored in a 
"org.apache.spark.deploy.worker.Worker" instance, the finishedExecutors field 
hold many ExecutorRunner.

The codes(Worker.scala) shows finishedExecutors just "finishedExecutors(fullId) 
= executor" and "finishedExecutors.values.toList",there is no action which 
remove the Executor,all were stored in memory,so after long time 
running,crashed.

  was:
Worker process often down,while there were not any abnormal tasks,just crash 
without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
bytes. "
and all the instance were stored in a "org.apache.spark.deploy.worker.Worker" 
instance, the finishedExecutors field hold many ExecutorRunner.

The codes(Worker.scala) shows finishedExecutors just "finishedExecutors(fullId) 
= executor" and "finishedExecutors.values.toList",there is no action which 
remove the Executor,all were stored in memory,so after long time 
running,crashed.


> Spark Worker process find Memory leak after long time running
> -
>
> Key: SPARK-11022
> URL: https://issues.apache.org/jira/browse/SPARK-11022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: colin shaw
>
> Worker process often down,while there were not any abnormal tasks,just crash 
> without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
> instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
> bytes. "
> and almost all the instance were stored in a 
> "org.apache.spark.deploy.worker.Worker" instance, the finishedExecutors field 
> hold many ExecutorRunner.
> The codes(Worker.scala) shows finishedExecutors just 
> "finishedExecutors(fullId) = executor" and 
> "finishedExecutors.values.toList",there is no action which remove the 
> Executor,all were stored in memory,so after long time running,crashed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11021) SparkSQL cli throws exception when using with Hive 0.12 metastore in spark-1.5.0 version

2015-10-09 Thread iward (JIRA)
iward created SPARK-11021:
-

 Summary: SparkSQL cli throws exception when using with Hive 0.12 
metastore in spark-1.5.0 version
 Key: SPARK-11021
 URL: https://issues.apache.org/jira/browse/SPARK-11021
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: iward


After upgrade spark from 1.4.1 to 1.5.0,I get the following exception when I 
set set the following properties in spark-defaults.conf:
{noformat}
spark.sql.hive.metastore.version=0.12.0
spark.sql.hive.metastore.jars=hive 0.12 jars and hadoop jars
{noformat}

when I run a task,it got following exception:
{noformat}
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.sql.hive.client.Shim_v0_12.loadTable(HiveShim.scala:249)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
at 
org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:61)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:311)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:165)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
results from 
hdfs://ns1/user/dd_edw/warehouse/tmp/gdm_m10_afs_task_process_spark/.hive-staging_hive_2015-10-09_11-34-50_831_2280183503220873069-1/-ext-1
 to destination directory: 
/user/dd_edw/warehouse/tmp/gdm_m10_afs_task_process_spark
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2303)
at org.apache.hadoop.hive.ql.metadata.Table.replaceFiles(Table.java:639)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1441)
... 40 more
{noformat}


[jira] [Created] (SPARK-11022) Spark Worker process find Memory leak after long time running

2015-10-09 Thread colin shaw (JIRA)
colin shaw created SPARK-11022:
--

 Summary: Spark Worker process find Memory leak after long time 
running
 Key: SPARK-11022
 URL: https://issues.apache.org/jira/browse/SPARK-11022
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: colin shaw


Worker process often down,while there were not any abnormal task,just crash 
without anymessage, after added "-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=${SPARK_HOME}/logs", a dump file show there is "17,010 
instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0xe2abfcc8" occupy 496,706,920 (96.14%) 
bytes. "
and all the instance were stored in a "org.apache.spark.deploy.worker.Worker" 
instance, the finishedExecutors field hold many ExecutorRunner.

The codes(Worker.scala) shows finishedExecutors just "finishedExecutors(fullId) 
= executor" and "finishedExecutors.values.toList",there is no action which 
remove the Executor,all were stored in memory,so after long time 
running,crashed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11028) When planning queries without partial aggregation support, we should try to use TungstenAggregate.

2015-10-09 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951210#comment-14951210
 ] 

Josh Rosen commented on SPARK-11028:


[~yhuai], if we fix SPARK-10992 first then will we still need to do this? Will 
it still be the case that _some_ HiveUDAFs don't support partial aggregation, 
requiring this?

> When planning queries without partial aggregation support, we should try to 
> use TungstenAggregate.
> --
>
> Key: SPARK-11028
> URL: https://issues.apache.org/jira/browse/SPARK-11028
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> With SPARK-11017, we can run DeclarativeAggregate Functions in 
> TungstenAggregate. So, when we plan queries having functions that do not 
> support partial aggregation, we can use TungstenAggregate whenever possible. 
> The reason that we only use SortBasedAggregate is that HiveUDAF is the only 
> function that does not support partial aggregation and it is a 
> DeclarativeAggregate function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-09 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951212#comment-14951212
 ] 

Josh Rosen commented on SPARK-9241:
---

[~yhuai], [~rxin], would you like to update this ticket based on recent 
discussions?

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10535) Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark

2015-10-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10535.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8700
[https://github.com/apache/spark/pull/8700]

> Support for recommendUsersForProducts and recommendProductsForUsers  in 
> matrix factorization model for PySpark
> --
>
> Key: SPARK-10535
> URL: https://issues.apache.org/jira/browse/SPARK-10535
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Vladimir Vladimirov
>Assignee: Vladimir Vladimirov
> Fix For: 1.6.0
>
>
> Scala and Java API provides recommendUsersForProducts 
> recommendProductsForUsers methods, but PySpark MLlib API doesn't have them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-10-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10858.

   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.2

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11037) Cleanup Option usage in JdbcUtils

2015-10-09 Thread Rick Hillegas (JIRA)
Rick Hillegas created SPARK-11037:
-

 Summary: Cleanup Option usage in JdbcUtils
 Key: SPARK-11037
 URL: https://issues.apache.org/jira/browse/SPARK-11037
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.5.1
Reporter: Rick Hillegas
Priority: Trivial


The following issue came up in the review of the pull request for SPARK-10855 
(https://github.com/apache/spark/pull/8982): We should use Option(...) instead 
of Some(...) because the former handles null arguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10984) Simplify *MemoryManager class structure

2015-10-09 Thread Bowen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951343#comment-14951343
 ] 

Bowen Zhang commented on SPARK-10984:
-

[~andrewor14], sure, assign that to me.

> Simplify *MemoryManager class structure
> ---
>
> Key: SPARK-10984
> URL: https://issues.apache.org/jira/browse/SPARK-10984
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Josh Rosen
>
> This is a refactoring task.
> After SPARK-10956 gets merged, we will have the following:
> - MemoryManager
> - StaticMemoryManager
> - ExecutorMemoryManager
> - TaskMemoryManager
> - ShuffleMemoryManager
> This is pretty confusing. The goal is to merge ShuffleMemoryManager and 
> ExecutorMemoryManager and move them into the top-level MemoryManager abstract 
> class. Then TaskMemoryManager should be renamed something else and used by 
> MemoryManager, such that the new hierarchy becomes:
> - MemoryManager
> - StaticMemoryManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11035) Launcher: allow apps to be launched in-process

2015-10-09 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-11035:
--

 Summary: Launcher: allow apps to be launched in-process
 Key: SPARK-11035
 URL: https://issues.apache.org/jira/browse/SPARK-11035
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin


The launcher library is currently restricted to launching apps as child 
processes. That is fine for a lot of cases, especially if the app is running in 
client mode.

But in certain cases, especially launching in cluster mode, it's more efficient 
to avoid launching a new process, since that process won't be doing much.

We should add support for launching apps in process, even if restricted to 
cluster mode at first. This will require some rework of the launch paths to 
avoid using system properties to propagate configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11039) Document all UI "retained*" configurations

2015-10-09 Thread Nick Pritchard (JIRA)
Nick Pritchard created SPARK-11039:
--

 Summary: Document all UI "retained*" configurations
 Key: SPARK-11039
 URL: https://issues.apache.org/jira/browse/SPARK-11039
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Web UI
Affects Versions: 1.5.1
Reporter: Nick Pritchard
Priority: Trivial


Most are documented except these:
- spark.sql.ui.retainedExecutions
- spark.streaming.ui.retainedBatches

They are really helpful for managing the memory usage of the driver application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-09 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951290#comment-14951290
 ] 

Yin Huai commented on SPARK-9241:
-

Yeah. When we compile the query, we can split the queries with multiple 
distinct columns to multiple queries. Every query evaluates a single distinct 
aggregation. Then, we can join the results using the group by keys as the join 
keys. In the join, we need to use null safe equality as the condition. Right 
now, we need to have another optimization to make it work efficiently.

Here is an example,
{code}
SELECT COUNT(DISTINCT a), COUNT(DISTINCT b), c FROM t GROUP BY c
{code}
will be rewritten to
{code}
SELECT x.count_a, y.count_b, x.c
FROM
  (SELECT COUNT(DISTINCT a) count_a FROM t GROUP BY c) x JOIN
  (SELECT COUNT(DISTINCT b) count_b FROM t GROUP BY c) y 
  ON coalesce(x.c, 0) = coalesce(y.c, 0) AND x.c <=> y.c
{code}

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8673) Launcher: add support for monitoring launched applications

2015-10-09 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-8673.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 7052
[https://github.com/apache/spark/pull/7052]

> Launcher: add support for monitoring launched applications
> --
>
> Key: SPARK-8673
> URL: https://issues.apache.org/jira/browse/SPARK-8673
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.6.0
>
>
> See parent bug for details.
> This task covers adding the groundwork for being able to communicate with the 
> launched Spark application and provide ways for the code using the launcher 
> library to interact with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11017) Support ImperativeAggregates in TungstenAggregate

2015-10-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11017:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

> Support ImperativeAggregates in TungstenAggregate
> -
>
> Key: SPARK-11017
> URL: https://issues.apache.org/jira/browse/SPARK-11017
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The TungstenAggregate operator currently only supports DeclarativeAggregate 
> functions (i.e. expression-based aggregates); we should extend it to also 
> support ImperativeAggregate functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

2015-10-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-11009:
--

Assignee: Davies Liu

> RowNumber in HiveContext returns negative values in cluster mode
> 
>
> Key: SPARK-11009
> URL: https://issues.apache.org/jira/browse/SPARK-11009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Standalone cluster mode. No hadoop/hive is present in 
> the environment (no hive-site.xml), only using HiveContext. Spark build as 
> with hadoop 2.6.0. Default spark configuration variables. cluster has 4 
> nodes, but happens with n nodes as well.
>Reporter: Saif Addin Ellafi
>Assignee: Davies Liu
>
> This issue happens when submitting the job into a standalone cluster. Have 
> not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 
> does not fix the issue. Also tried having only one node in the cluster, with 
> same result. Other shuffle configuration changes do not alter the results 
> either.
> The issue does NOT happen in --master local[*].
> val ws = Window.
> partitionBy("client_id").
> orderBy("date")
>  
> val nm = "repeatMe"
> df.select(df.col("*"), rowNumber().over(ws).as(nm))
>  
> 
> df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>  
> --->
>  
> Long, DateType, Int
> [219483904822,2006-06-01,-1863462909]
> [219483904822,2006-09-01,-1863462909]
> [219483904822,2007-01-01,-1863462909]
> [219483904822,2007-08-01,-1863462909]
> [219483904822,2007-07-01,-1863462909]
> [192489238423,2007-07-01,-1863462774]
> [192489238423,2007-02-01,-1863462774]
> [192489238423,2006-11-01,-1863462774]
> [192489238423,2006-08-01,-1863462774]
> [192489238423,2007-08-01,-1863462774]
> [192489238423,2006-09-01,-1863462774]
> [192489238423,2007-03-01,-1863462774]
> [192489238423,2006-10-01,-1863462774]
> [192489238423,2007-05-01,-1863462774]
> [192489238423,2006-06-01,-1863462774]
> [192489238423,2006-12-01,-1863462774]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10988) Reduce duplication in Aggregate2's expression rewriting logic

2015-10-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10988:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-4366

> Reduce duplication in Aggregate2's expression rewriting logic
> -
>
> Key: SPARK-10988
> URL: https://issues.apache.org/jira/browse/SPARK-10988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> In `aggregate/utils.scala`, there is a substantial amount of duplication in 
> the expression-rewriting logic. As a prerequisite to supporting imperative 
> aggregate functions in `TungstenAggregate`, we should refactor this file so 
> that the same expression-rewriting logic is used for both `SortAggregate` and 
> `TungstenAggregate`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity

2015-10-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10941:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

> .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve 
> code clarity
> --
>
> Key: SPARK-10941
> URL: https://issues.apache.org/jira/browse/SPARK-10941
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> Spark SQL's new AlgebraicAggregate interface is confusingly named.
> AlgebraicAggregate inherits from AggregateFunction2, adds a new set of 
> methods, then effectively bans the use of the inherited methods. This is 
> really confusing. I think that it's an anti-pattern / bad code smell if you 
> end up inheriting and wanting to remove methods inherited from the superclass.
> I think that we should re-name this class and should refactor the class 
> hierarchy so that there's a clear distinction between which parts of the code 
> work with imperative aggregate functions vs. expression-based aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4429) Build for Scala 2.11 using sbt fails.

2015-10-09 Thread Peter Halliday (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951203#comment-14951203
 ] 

Peter Halliday commented on SPARK-4429:
---

I'm wondering where this is at?  

> Build for Scala 2.11 using sbt fails.
> -
>
> Key: SPARK-4429
> URL: https://issues.apache.org/jira/browse/SPARK-4429
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.2.0
>
>
> I tried to build for Scala 2.11 using sbt with the following command:
> {quote}
> $ sbt/sbt -Dscala-2.11 assembly
> {quote}
> but it ends with the following error messages:
> {quote}
> \[error\] (streaming-kafka/*:update) sbt.ResolveException: unresolved 
> dependency: org.apache.kafka#kafka_2.11;0.8.0: not found
> \[error\] (catalyst/*:update) sbt.ResolveException: unresolved dependency: 
> org.scalamacros#quasiquotes_2.11;2.0.1: not found
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException

2015-10-09 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951238#comment-14951238
 ] 

Charles Allen commented on SPARK-8142:
--

I had a similar failure as topic and solved it by setting 
"spark.executor.userClassPathFirst" to "false" and 
"spark.driver.userClassPathFirst" to "false"

> Spark Job Fails with ResultTask ClassCastException
> --
>
> Key: SPARK-8142
> URL: https://issues.apache.org/jira/browse/SPARK-8142
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Dev Lakhani
>
> When running a Spark Job, I get no failures in the application code 
> whatsoever but a weird ResultTask Class exception. In my job, I create a RDD 
> from HBase and for each partition do a REST call on an API, using a REST 
> client.  This has worked in IntelliJ but when I deploy to a cluster using 
> spark-submit.sh I get :
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, host): java.lang.ClassCastException: 
> org.apache.spark.scheduler.ResultTask cannot be cast to 
> org.apache.spark.scheduler.Task
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> These are the configs I set to override the spark classpath because I want to 
> use my own glassfish jersey version:
>  
> sparkConf.set("spark.driver.userClassPathFirst","true");
> sparkConf.set("spark.executor.userClassPathFirst","true");
> I see no other warnings or errors in any of the logs.
> Unfortunately I cannot post my code, but please ask me questions that will 
> help debug the issue. Using spark 1.3.1 hadoop 2.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10930) History "Stages" page "duration" can be confusing

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10930:


Assignee: (was: Apache Spark)

> History "Stages" page "duration" can be confusing
> -
>
> Key: SPARK-10930
> URL: https://issues.apache.org/jira/browse/SPARK-10930
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> The spark history server, "stages" page shows each stage submitted time and 
> the duration.  The duration can be confusing since the time it actually 
> starts tasks might be much later then its submitted if its waiting on 
> previous stages.  This makes it hard to figure out which stages were really 
> slow without clicking into each stage.
> It would be nice to perhaps have a first task launched time or processing 
> time spent in each stage to easily be able to find the slow stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11039) Document all UI "retained*" configurations

2015-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951282#comment-14951282
 ] 

Apache Spark commented on SPARK-11039:
--

User 'pnpritchard' has created a pull request for this issue:
https://github.com/apache/spark/pull/9052

> Document all UI "retained*" configurations
> --
>
> Key: SPARK-11039
> URL: https://issues.apache.org/jira/browse/SPARK-11039
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Web UI
>Affects Versions: 1.5.1
>Reporter: Nick Pritchard
>Priority: Trivial
>
> Most are documented except these:
> - spark.sql.ui.retainedExecutions
> - spark.streaming.ui.retainedBatches
> They are really helpful for managing the memory usage of the driver 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11040) SaslRpcHandler does not delegate all methods to underlying handler

2015-10-09 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-11040:
--

 Summary: SaslRpcHandler does not delegate all methods to 
underlying handler
 Key: SPARK-11040
 URL: https://issues.apache.org/jira/browse/SPARK-11040
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin


{{SaslRpcHandler}} only delegates {{receive}} and {{getStreamManager}}, so when 
SASL is enabled, other events will be missed by apps.

This affects other version too, but I think these events aren't actually used 
there. They'll be used by the new rpc backend in 1.6, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10855) Add a JDBC dialect for Apache Derby

2015-10-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10855.
-
   Resolution: Fixed
 Assignee: Rick Hillegas
Fix Version/s: 1.6.0

> Add a JDBC dialect for Apache  Derby
> 
>
> Key: SPARK-10855
> URL: https://issues.apache.org/jira/browse/SPARK-10855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Rick Hillegas
>Assignee: Rick Hillegas
>Priority: Minor
> Fix For: 1.6.0
>
>
> In particular, it would be good if the dialect could handle Derby's 
> user-defined types. The following script fails:
> {noformat}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> // the following script was used to create a Derby table
> // which has a column of user-defined type:
> // 
> // create type properties external name 'java.util.Properties' language java;
> // 
> // create function systemProperties() returns properties
> // language java parameter style java no sql
> // external name 'java.lang.System.getProperties';
> // 
> // create table propertiesTable( props properties );
> // 
> // insert into propertiesTable values ( null ), ( systemProperties() );
> // 
> // select * from propertiesTable;
> // cannot handle a table which has a column of type 
> java.sql.Types.JAVA_OBJECT:
> //
> // java.sql.SQLException: Unsupported type 2000
> //
> val df = sqlContext.read.format("jdbc").options( 
>   Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
>   "dbtable" -> "app.propertiesTable")).load()
> // shutdown the Derby engine
> val shutdown = sqlContext.read.format("jdbc").options( 
>   Map("url" -> "jdbc:derby:;shutdown=true",
>   "dbtable" -> "")).load()
> exit()
> {noformat}
> The inability to handle user-defined types probably affects other databases 
> besides Derby.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11036) AttributeReference should not be created outside driver

2015-10-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-11036:
--

 Summary: AttributeReference should not be created outside driver
 Key: SPARK-11036
 URL: https://issues.apache.org/jira/browse/SPARK-11036
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


if AttributeReference is created in executor, the id could be the same as 
others created in driver. We should have a way to ban that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11009:


Assignee: Apache Spark

> RowNumber in HiveContext returns negative values in cluster mode
> 
>
> Key: SPARK-11009
> URL: https://issues.apache.org/jira/browse/SPARK-11009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Standalone cluster mode. No hadoop/hive is present in 
> the environment (no hive-site.xml), only using HiveContext. Spark build as 
> with hadoop 2.6.0. Default spark configuration variables. cluster has 4 
> nodes, but happens with n nodes as well.
>Reporter: Saif Addin Ellafi
>Assignee: Apache Spark
>
> This issue happens when submitting the job into a standalone cluster. Have 
> not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 
> does not fix the issue. Also tried having only one node in the cluster, with 
> same result. Other shuffle configuration changes do not alter the results 
> either.
> The issue does NOT happen in --master local[*].
> val ws = Window.
> partitionBy("client_id").
> orderBy("date")
>  
> val nm = "repeatMe"
> df.select(df.col("*"), rowNumber().over(ws).as(nm))
>  
> 
> df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>  
> --->
>  
> Long, DateType, Int
> [219483904822,2006-06-01,-1863462909]
> [219483904822,2006-09-01,-1863462909]
> [219483904822,2007-01-01,-1863462909]
> [219483904822,2007-08-01,-1863462909]
> [219483904822,2007-07-01,-1863462909]
> [192489238423,2007-07-01,-1863462774]
> [192489238423,2007-02-01,-1863462774]
> [192489238423,2006-11-01,-1863462774]
> [192489238423,2006-08-01,-1863462774]
> [192489238423,2007-08-01,-1863462774]
> [192489238423,2006-09-01,-1863462774]
> [192489238423,2007-03-01,-1863462774]
> [192489238423,2006-10-01,-1863462774]
> [192489238423,2007-05-01,-1863462774]
> [192489238423,2006-06-01,-1863462774]
> [192489238423,2006-12-01,-1863462774]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

2015-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951202#comment-14951202
 ] 

Apache Spark commented on SPARK-11009:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9050

> RowNumber in HiveContext returns negative values in cluster mode
> 
>
> Key: SPARK-11009
> URL: https://issues.apache.org/jira/browse/SPARK-11009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Standalone cluster mode. No hadoop/hive is present in 
> the environment (no hive-site.xml), only using HiveContext. Spark build as 
> with hadoop 2.6.0. Default spark configuration variables. cluster has 4 
> nodes, but happens with n nodes as well.
>Reporter: Saif Addin Ellafi
>
> This issue happens when submitting the job into a standalone cluster. Have 
> not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 
> does not fix the issue. Also tried having only one node in the cluster, with 
> same result. Other shuffle configuration changes do not alter the results 
> either.
> The issue does NOT happen in --master local[*].
> val ws = Window.
> partitionBy("client_id").
> orderBy("date")
>  
> val nm = "repeatMe"
> df.select(df.col("*"), rowNumber().over(ws).as(nm))
>  
> 
> df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>  
> --->
>  
> Long, DateType, Int
> [219483904822,2006-06-01,-1863462909]
> [219483904822,2006-09-01,-1863462909]
> [219483904822,2007-01-01,-1863462909]
> [219483904822,2007-08-01,-1863462909]
> [219483904822,2007-07-01,-1863462909]
> [192489238423,2007-07-01,-1863462774]
> [192489238423,2007-02-01,-1863462774]
> [192489238423,2006-11-01,-1863462774]
> [192489238423,2006-08-01,-1863462774]
> [192489238423,2007-08-01,-1863462774]
> [192489238423,2006-09-01,-1863462774]
> [192489238423,2007-03-01,-1863462774]
> [192489238423,2006-10-01,-1863462774]
> [192489238423,2007-05-01,-1863462774]
> [192489238423,2006-06-01,-1863462774]
> [192489238423,2006-12-01,-1863462774]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

2015-10-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11009:


Assignee: (was: Apache Spark)

> RowNumber in HiveContext returns negative values in cluster mode
> 
>
> Key: SPARK-11009
> URL: https://issues.apache.org/jira/browse/SPARK-11009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Standalone cluster mode. No hadoop/hive is present in 
> the environment (no hive-site.xml), only using HiveContext. Spark build as 
> with hadoop 2.6.0. Default spark configuration variables. cluster has 4 
> nodes, but happens with n nodes as well.
>Reporter: Saif Addin Ellafi
>
> This issue happens when submitting the job into a standalone cluster. Have 
> not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 
> does not fix the issue. Also tried having only one node in the cluster, with 
> same result. Other shuffle configuration changes do not alter the results 
> either.
> The issue does NOT happen in --master local[*].
> val ws = Window.
> partitionBy("client_id").
> orderBy("date")
>  
> val nm = "repeatMe"
> df.select(df.col("*"), rowNumber().over(ws).as(nm))
>  
> 
> df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>  
> --->
>  
> Long, DateType, Int
> [219483904822,2006-06-01,-1863462909]
> [219483904822,2006-09-01,-1863462909]
> [219483904822,2007-01-01,-1863462909]
> [219483904822,2007-08-01,-1863462909]
> [219483904822,2007-07-01,-1863462909]
> [192489238423,2007-07-01,-1863462774]
> [192489238423,2007-02-01,-1863462774]
> [192489238423,2006-11-01,-1863462774]
> [192489238423,2006-08-01,-1863462774]
> [192489238423,2007-08-01,-1863462774]
> [192489238423,2006-09-01,-1863462774]
> [192489238423,2007-03-01,-1863462774]
> [192489238423,2006-10-01,-1863462774]
> [192489238423,2007-05-01,-1863462774]
> [192489238423,2006-06-01,-1863462774]
> [192489238423,2006-12-01,-1863462774]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10167) We need to explicitly use transformDown when rewrite aggregation results

2015-10-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10167.

   Resolution: Fixed
 Assignee: Josh Rosen
Fix Version/s: 1.6.0

I changed {{transform}} to {{transformDown}} as part of my refactorings in 
SPARK-10988, so I'm going to mark this as resolved.

> We need to explicitly use transformDown when rewrite aggregation results
> 
>
> Key: SPARK-10167
> URL: https://issues.apache.org/jira/browse/SPARK-10167
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 1.6.0
>
>
> Right now, we use transformDown explicitly at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/utils.scala#L105
>  and 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/utils.scala#L130.
>  We also need to be very clear on using transformDown at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/utils.scala#L300
>  and 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/utils.scala#L334
>  (right now transform means transformDown). The reason we need to use 
> transformDown is when we rewrite final aggregate results, we should always 
> match aggregate functions first. If we use transformUp, it is possible that 
> we match grouping expression first if we use grouping expressions as children 
> of aggregate functions.
> There is nothing wrong with our master. We just want to make sure we will not 
> have bugs if we change the behavior of transform (change it from 
> transformDown to Up.), which I think is very unlikely (but just incase).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

2015-10-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11009:
-
Target Version/s: 1.5.2, 1.6.0
Priority: Blocker  (was: Major)

> RowNumber in HiveContext returns negative values in cluster mode
> 
>
> Key: SPARK-11009
> URL: https://issues.apache.org/jira/browse/SPARK-11009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Standalone cluster mode. No hadoop/hive is present in 
> the environment (no hive-site.xml), only using HiveContext. Spark build as 
> with hadoop 2.6.0. Default spark configuration variables. cluster has 4 
> nodes, but happens with n nodes as well.
>Reporter: Saif Addin Ellafi
>Assignee: Davies Liu
>Priority: Blocker
>
> This issue happens when submitting the job into a standalone cluster. Have 
> not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 
> does not fix the issue. Also tried having only one node in the cluster, with 
> same result. Other shuffle configuration changes do not alter the results 
> either.
> The issue does NOT happen in --master local[*].
> val ws = Window.
> partitionBy("client_id").
> orderBy("date")
>  
> val nm = "repeatMe"
> df.select(df.col("*"), rowNumber().over(ws).as(nm))
>  
> 
> df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>  
> --->
>  
> Long, DateType, Int
> [219483904822,2006-06-01,-1863462909]
> [219483904822,2006-09-01,-1863462909]
> [219483904822,2007-01-01,-1863462909]
> [219483904822,2007-08-01,-1863462909]
> [219483904822,2007-07-01,-1863462909]
> [192489238423,2007-07-01,-1863462774]
> [192489238423,2007-02-01,-1863462774]
> [192489238423,2006-11-01,-1863462774]
> [192489238423,2006-08-01,-1863462774]
> [192489238423,2007-08-01,-1863462774]
> [192489238423,2006-09-01,-1863462774]
> [192489238423,2007-03-01,-1863462774]
> [192489238423,2006-10-01,-1863462774]
> [192489238423,2007-05-01,-1863462774]
> [192489238423,2006-06-01,-1863462774]
> [192489238423,2006-12-01,-1863462774]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10970) Executors overload Hive metastore by making massive connections at execution time

2015-10-09 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park resolved SPARK-10970.
---
Resolution: Fixed

Closing the jira because this is fixed by SPARK-10679.

SPARK-10679 addresses a different issue, but it also fixes this issue as a 
byproduct.

> Executors overload Hive metastore by making massive connections at execution 
> time
> -
>
> Key: SPARK-10970
> URL: https://issues.apache.org/jira/browse/SPARK-10970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hive 1.2, Spark on YARN
>Reporter: Cheolsoo Park
>Priority: Critical
>
> This is a regression in Spark 1.5, more specifically after upgrading Hive 
> dependency to 1.2.
> HIVE-2573 introduced a new feature that allows users to register functions in 
> session. The problem is that it added a [static code 
> block|https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L164-L170]
>  to Hive.java-
> {code}
> // register all permanent functions. need improvement
> static {
>   try {
> reloadFunctions();
>   } catch (Exception e) {
> LOG.warn("Failed to access metastore. This class should not accessed in 
> runtime.",e);
>   }
> }
> {code}
> This code block is executed by every Spark executor in cluster when HadoopRDD 
> tries to access to JobConf. So if Spark job has a high parallelism (eg 
> 1000+), executors will hammer the HCat server causing it to go down in the 
> worst case.
> Here is the stack trace that I took in executor when it makes a connection to 
> Hive metastore-
> {code}
> 15/10/06 19:26:05 WARN conf.HiveConf: HiveConf of name hive.optimize.s3.query 
> does not exist
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> java.lang.Thread.getStackTrace(Thread.java:1589)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:803)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:782)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:347)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.spark.sql.hive.HadoopTableReader$anonfun$17.apply(TableReader.scala:322)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.spark.sql.hive.HadoopTableReader$anonfun$17.apply(TableReader.scala:322)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.spark.rdd.HadoopRDD$anonfun$getJobConf$6.apply(HadoopRDD.scala:179)
> 15/10/06 19:26:05 INFO hive.metastore: XXX: 
> org.apache.spark.rdd.HadoopRDD$anonfun$getJobConf$6.apply(HadoopRDD.scala:179)
> 

[jira] [Commented] (SPARK-11013) SparkPlan may mistakenly register child plan's accumulators for SQL metrics

2015-10-09 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951606#comment-14951606
 ] 

Shixiong Zhu commented on SPARK-11013:
--

I see. So we implement something like {{LongMinAccumulableParam}}, we can use 
`stringValue` to display "-" for {{None}}. What do you think?

> SparkPlan may mistakenly register child plan's accumulators for SQL metrics
> ---
>
> Key: SPARK-11013
> URL: https://issues.apache.org/jira/browse/SPARK-11013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> The reason is that: when we call RDD API inside SparkPlan, we are very likely 
> to reference the SparkPlan in the closure and thus serialize and transfer a 
> SparkPlan tree to executor side. When we deserialize it, the accumulators in 
> child SparkPlan are also deserialized and registered, and always report zero 
> value.
> This is not a problem currently because we only have one operation to 
> aggregate the accumulators: add. However, if we wanna support more complex 
> metric like min, the extra zero values will lead to wrong result.
> Take TungstenAggregate as an example, I logged "stageId, partitionId, 
> accumName, accumId" when an accumulator is deserialized and registered, and 
> logged the "accumId -> accumValue" map when a task ends. The output is:
> {code}
> scala> val df = Seq(1 -> "a", 2 -> "b").toDF("a", "b").groupBy().count()
> df: org.apache.spark.sql.DataFrame = [count: bigint]
> scala> df.collect
> register: 0 0 Some(number of input rows) 4
> register: 0 0 Some(number of output rows) 5
> register: 1 0 Some(number of input rows) 4
> register: 1 0 Some(number of output rows) 5
> register: 1 0 Some(number of input rows) 2
> register: 1 0 Some(number of output rows) 3
> Map(5 -> 1, 4 -> 2, 6 -> 4458496)
> Map(5 -> 0, 2 -> 1, 7 -> 4458496, 3 -> 1, 4 -> 0)
> res0: Array[org.apache.spark.sql.Row] = Array([2])
> {code}
> The best choice is to avoid serialize and deserialize a SparkPlan tree, which 
> can be achieved by LocalNode.
> Or we can do some workaround to fix this serialization problem for the 
> problematic SparkPlans like TungstenAggregate, TungstenSort.
> Or we can improve the SQL metrics framework to make it more robust to this 
> case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11013) SparkPlan may mistakenly register child plan's accumulators for SQL metrics

2015-10-09 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951606#comment-14951606
 ] 

Shixiong Zhu edited comment on SPARK-11013 at 10/10/15 5:12 AM:


I see. So if we implement something like {{LongMinAccumulableParam}}, we can 
use `stringValue` to display "-" for {{None}}. What do you think?


was (Author: zsxwing):
I see. So we implement something like {{LongMinAccumulableParam}}, we can use 
`stringValue` to display "-" for {{None}}. What do you think?

> SparkPlan may mistakenly register child plan's accumulators for SQL metrics
> ---
>
> Key: SPARK-11013
> URL: https://issues.apache.org/jira/browse/SPARK-11013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> The reason is that: when we call RDD API inside SparkPlan, we are very likely 
> to reference the SparkPlan in the closure and thus serialize and transfer a 
> SparkPlan tree to executor side. When we deserialize it, the accumulators in 
> child SparkPlan are also deserialized and registered, and always report zero 
> value.
> This is not a problem currently because we only have one operation to 
> aggregate the accumulators: add. However, if we wanna support more complex 
> metric like min, the extra zero values will lead to wrong result.
> Take TungstenAggregate as an example, I logged "stageId, partitionId, 
> accumName, accumId" when an accumulator is deserialized and registered, and 
> logged the "accumId -> accumValue" map when a task ends. The output is:
> {code}
> scala> val df = Seq(1 -> "a", 2 -> "b").toDF("a", "b").groupBy().count()
> df: org.apache.spark.sql.DataFrame = [count: bigint]
> scala> df.collect
> register: 0 0 Some(number of input rows) 4
> register: 0 0 Some(number of output rows) 5
> register: 1 0 Some(number of input rows) 4
> register: 1 0 Some(number of output rows) 5
> register: 1 0 Some(number of input rows) 2
> register: 1 0 Some(number of output rows) 3
> Map(5 -> 1, 4 -> 2, 6 -> 4458496)
> Map(5 -> 0, 2 -> 1, 7 -> 4458496, 3 -> 1, 4 -> 0)
> res0: Array[org.apache.spark.sql.Row] = Array([2])
> {code}
> The best choice is to avoid serialize and deserialize a SparkPlan tree, which 
> can be achieved by LocalNode.
> Or we can do some workaround to fix this serialization problem for the 
> problematic SparkPlans like TungstenAggregate, TungstenSort.
> Or we can improve the SQL metrics framework to make it more robust to this 
> case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10927) Spark history uses the application name instead of the ID

2015-10-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Baptiste Onofré resolved SPARK-10927.
--
Resolution: Duplicate

> Spark history uses the application name instead of the ID
> -
>
> Key: SPARK-10927
> URL: https://issues.apache.org/jira/browse/SPARK-10927
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Jean-Baptiste Onofré
>
> Setting spark.eventLog.enabled to true, and a folder location for 
> spark.eventLog.dir provides the history UI for completed jobs.
> It works fine for jobs without arguments, but if the job expects some 
> arguments (like JavaWordCount which expects the source file location), the UI 
> is not possible to provide application details:
> {code}
> Application history not found (app-20151005185136-0002)
> No event logs found for application JavaWordCount in file:/tmp/spark. Did you 
> specify the correct logging directory?
> {code}
> However, in /tmp/spark, the file app-20151005185136-0002 is there. It seems 
> that the UI uses the application name (JavaWordCount) instead of the 
> application ID (app-20151005185136-0002) to get history details.
> I will work on a fix around that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-10-09 Thread Jack Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951593#comment-14951593
 ] 

Jack Hu commented on SPARK-6847:


Hi [~glyton.camilleri]
You can check whether there are two dstreams in the DAG need to be checkpointed 
(updateStateByKey, reduceByKeyAndWindow), it yes, you can workaround this to 
use some output for the previous DStream which needs to checkpointed. 

{code}
val d1 = input.updateStateByKey(func)
val d2 = d1.map(...).updateStateByKey(func)
d2.foreachRDD(rdd => print(rdd.count))
/// workaround the stack over flow listed in this JIRA
d1.foreachRDD(rdd => rdd.foreach(_ => Unit))
{code}


> Stack overflow on updateStateByKey which followed by a dstream with 
> checkpoint set
> --
>
> Key: SPARK-6847
> URL: https://issues.apache.org/jira/browse/SPARK-6847
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Jack Hu
>  Labels: StackOverflowError, Streaming
>
> The issue happens with the following sample code: uses {{updateStateByKey}} 
> followed by a {{map}} with checkpoint interval 10 seconds
> {code}
> val sparkConf = new SparkConf().setAppName("test")
> val streamingContext = new StreamingContext(sparkConf, Seconds(10))
> streamingContext.checkpoint("""checkpoint""")
> val source = streamingContext.socketTextStream("localhost", )
> val updatedResult = source.map(
> (1,_)).updateStateByKey(
> (newlist : Seq[String], oldstate : Option[String]) => 
> newlist.headOption.orElse(oldstate))
> updatedResult.map(_._2)
> .checkpoint(Seconds(10))
> .foreachRDD((rdd, t) => {
>   println("Deep: " + rdd.toDebugString.split("\n").length)
>   println(t.toString() + ": " + rdd.collect.length)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> From the output, we can see that the dependency will be increasing time over 
> time, the {{updateStateByKey}} never get check-pointed,  and finally, the 
> stack overflow will happen. 
> Note:
> * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but 
> not the {{updateStateByKey}} 
> * If remove the {{checkpoint(Seconds(10))}} from the map result ( 
> {{updatedResult.map(_._2)}} ), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6613) Starting stream from checkpoint causes Streaming tab to throw error

2015-10-09 Thread Yutao SUN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951610#comment-14951610
 ] 

Yutao SUN commented on SPARK-6613:
--

Same issue in 1.5.0

> Starting stream from checkpoint causes Streaming tab to throw error
> ---
>
> Key: SPARK-6613
> URL: https://issues.apache.org/jira/browse/SPARK-6613
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1, 1.2.2, 1.3.1
>Reporter: Marius Soutier
>
> When continuing my streaming job from a checkpoint, the job runs, but the 
> Streaming tab in the standard UI initially no longer works (browser just 
> shows HTTP ERROR: 500). Sometimes  it gets back to normal after a while, and 
> sometimes it stays in this state permanently.
> Stacktrace:
> WARN org.eclipse.jetty.servlet.ServletHandler: /streaming/
> java.util.NoSuchElementException: key not found: 0
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:58)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:151)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1$$anonfun$apply$5.apply(StreamingJobProgressListener.scala:150)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:150)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener$$anonfun$lastReceivedBatchRecords$1.apply(StreamingJobProgressListener.scala:149)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.lastReceivedBatchRecords(StreamingJobProgressListener.scala:149)
>   at 
> org.apache.spark.streaming.ui.StreamingPage.generateReceiverStats(StreamingPage.scala:82)
>   at 
> org.apache.spark.streaming.ui.StreamingPage.render(StreamingPage.scala:43)
>   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
>   at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:68)
>   at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:68)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at 

  1   2   >