[jira] [Commented] (SPARK-6844) Memory leak occurs when register temp table with cache table on
[ https://issues.apache.org/jira/browse/SPARK-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497627#comment-14497627 ] Jack Hu commented on SPARK-6844: Hi, [~marmbrus] Do we have a plan to port this to 1.3.X branch? Memory leak occurs when register temp table with cache table on --- Key: SPARK-6844 URL: https://issues.apache.org/jira/browse/SPARK-6844 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jack Hu Labels: Memory, SQL Fix For: 1.4.0 There is a memory leak in register temp table with cache on This is the simple code to reproduce this issue: {code} val sparkConf = new SparkConf().setAppName(LeakTest) val sparkContext = new SparkContext(sparkConf) val sqlContext = new SQLContext(sparkContext) val tableName = tmp val jsonrdd = sparkContext.textFile(sample.json) var loopCount = 1L while(true) { sqlContext.jsonRDD(jsonrdd).registerTempTable(tableName) sqlContext.cacheTable(tableName) println(L: +loopCount + R: + sqlContext.sql(select count(*) from tmp).count()) sqlContext.uncacheTable(tableName) loopCount += 1 } {code} The cause is that the {{InMemoryRelation}}. {{InMemoryColumnarTableScan}} uses the accumulator ({{InMemoryRelation.batchStats}},{{InMemoryColumnarTableScan.readPartitions}}, {{InMemoryColumnarTableScan.readBatches}} ) to get some information from partitions or for test. These accumulators will register itself into a static map in {{Accumulators.originals}} and never get cleaned up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6933) Thrift Server couldn't strip .inprogress suffix after being stopped
[ https://issues.apache.org/jira/browse/SPARK-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang closed SPARK-6933. --- Resolution: Duplicate Thrift Server couldn't strip .inprogress suffix after being stopped --- Key: SPARK-6933 URL: https://issues.apache.org/jira/browse/SPARK-6933 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Tao Wang When I stop the thrift server using stop-thriftserver.sh, there comes the exception: 15/04/15 21:48:53 INFO Utils: path = /tmp/spark-f05dd451-46a8-47d0-836b-a25004f87ed9/blockmgr-971f5b1c-33ed-4be6-ac63-2fbb739bc649, already present as root for deletion. 15/04/15 21:48:53 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1197) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1985) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1946) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) ... 17 more 15/04/15 21:48:53 INFO ThriftCLIService: Thrift server has stopped 15/04/15 21:48:53 INFO AbstractService: Service:ThriftBinaryCLIService is stopped. 15/04/15 21:48:53 INFO HiveMetaStore: 1: Shutting down the object store... 15/04/15 21:48:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=Shutting down the object store... 15/04/15 21:48:53 INFO HiveMetaStore: 1: Metastore shutdown complete. 15/04/15 21:48:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=Metastore shutdown complete. 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 15/04/15 21:48:53 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 15/04/15 21:48:53 INFO ContextHandler:
[jira] [Created] (SPARK-6959) Support for datetime comparisions in filter for dataframes in pyspark
cynepia created SPARK-6959: -- Summary: Support for datetime comparisions in filter for dataframes in pyspark Key: SPARK-6959 URL: https://issues.apache.org/jira/browse/SPARK-6959 Project: Spark Issue Type: Improvement Reporter: cynepia Currently, in filter strings can be passed for comparision with a date type column. But this does not address the case where dates may be passed in different formats. We should have support for datetime.date or some standard format for comparision with date type columns. Currently, df.filter(df.Datecol datetime.date(2015,1,1)).show() is not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6893) Better handling of pipeline parameters in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6893. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5534 [https://github.com/apache/spark/pull/5534] Better handling of pipeline parameters in PySpark - Key: SPARK-6893 URL: https://issues.apache.org/jira/browse/SPARK-6893 Project: Spark Issue Type: Sub-task Components: PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 This is SPARK-5957 for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.
[ https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6955: - Priority: Minor (was: Major) Affects Version/s: (was: 1.4.0) Do not let Yarn Shuffle Server retry its server port. - Key: SPARK-6955 URL: https://issues.apache.org/jira/browse/SPARK-6955 Project: Spark Issue Type: Bug Components: Shuffle, YARN Reporter: SaintBacchus Priority: Minor It's better to let the NodeManager get down rather than take a port retry when `spark.shuffle.service.port` has been conflicted during starting the Spark Yarn Shuffle Server, because the retry mechanism will make the inconsistency of shuffle port and also make client fail to find the port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weizhong updated SPARK-6869: Description: From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so ship pyspark archives to executors by Yarn with --py-files. The pyspark archives name must contains spark-pyspark. 1st: zip pyspark to spark-pyspark_2.10.zip 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files spark-pyspark_2.10.zip app.py args was:From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in spark-env.sh) to executor so that executor python process can read pyspark file from local file system rather than from assembly jar. Summary: Add pyspark archives path to PYTHONPATH (was: Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node) Add pyspark archives path to PYTHONPATH --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Reporter: Weizhong Priority: Minor From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so ship pyspark archives to executors by Yarn with --py-files. The pyspark archives name must contains spark-pyspark. 1st: zip pyspark to spark-pyspark_2.10.zip 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files spark-pyspark_2.10.zip app.py args -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4783: - Priority: Minor (was: Major) System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria Assignee: Sean Owen Priority: Minor Fix For: 1.4.0 A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4783. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5492 [https://github.com/apache/spark/pull/5492] System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria Fix For: 1.4.0 A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark
[ https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-4783: Assignee: Sean Owen System.exit() calls in SparkContext disrupt applications embedding Spark Key: SPARK-4783 URL: https://issues.apache.org/jira/browse/SPARK-4783 Project: Spark Issue Type: Bug Components: Spark Core Reporter: David Semeria Assignee: Sean Owen Fix For: 1.4.0 A common architectural choice for integrating Spark within a larger application is to employ a gateway to handle Spark jobs. The gateway is a server which contains one or more long-running sparkcontexts. A typical server is created with the following pseudo code: var continue = true while (continue){ try { server.run() } catch (e) { continue = log_and_examine_error(e) } The problem is that sparkcontext frequently calls System.exit when it encounters a problem which means the server can only be re-spawned at the process level, which is much more messy than the simple code above. Therefore, I believe it makes sense to replace all System.exit calls in sparkcontext with the throwing of a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6528) IDF transformer
[ https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6528: - Assignee: Xusen Yin IDF transformer --- Key: SPARK-6528 URL: https://issues.apache.org/jira/browse/SPARK-6528 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin Assignee: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state
[ https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4194: - Target Version/s: (was: 1.2.0) Affects Version/s: 1.2.0 1.3.0 Assignee: Marcelo Vanzin Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state - Key: SPARK-4194 URL: https://issues.apache.org/jira/browse/SPARK-4194 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Marcelo Vanzin Priority: Critical Fix For: 1.4.0 The SparkContext and SparkEnv constructors instantiate a bunch of objects that may need to be cleaned up after they're no longer needed. If an exception is thrown during SparkContext or SparkEnv construction (e.g. due to a bad configuration setting), then objects created earlier in the constructor may not be properly cleaned up. This is unlikely to cause problems for batch jobs submitted through {{spark-submit}}, since failure to construct SparkContext will probably cause the JVM to exit, but it is a potentially serious issue in interactive environments where a user might attempt to create SparkContext with some configuration, fail due to an error, and re-attempt the creation with new settings. In this case, resources from the previous creation attempt might not have been cleaned up and could lead to confusing errors (especially if the old, leaked resources share global state with the new SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state
[ https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4194. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5335 [https://github.com/apache/spark/pull/5335] Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state - Key: SPARK-4194 URL: https://issues.apache.org/jira/browse/SPARK-4194 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Priority: Critical Fix For: 1.4.0 The SparkContext and SparkEnv constructors instantiate a bunch of objects that may need to be cleaned up after they're no longer needed. If an exception is thrown during SparkContext or SparkEnv construction (e.g. due to a bad configuration setting), then objects created earlier in the constructor may not be properly cleaned up. This is unlikely to cause problems for batch jobs submitted through {{spark-submit}}, since failure to construct SparkContext will probably cause the JVM to exit, but it is a potentially serious issue in interactive environments where a user might attempt to create SparkContext with some configuration, fail due to an error, and re-attempt the creation with new settings. In this case, resources from the previous creation attempt might not have been cleaned up and could lead to confusing errors (especially if the old, leaked resources share global state with the new SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497701#comment-14497701 ] pin_zhang commented on SPARK-6923: -- In spark1.1.0 client with the jdbc api to get the table schema age(bigint), id(string) while in spark1.3.0 {name=col, type=arraystring} That's not expected. ArrayListMap results = new ArrayList(); DatabaseMetaData meta = cnn.getMetaData(); rsColumns = meta.getColumns(database, null, table, null); while (rsColumns.next()) { Map col = new HashMap(); col.put(name, rsColumns.getString(COLUMN_NAME)); String typeName = rsColumns.getString(TYPE_NAME); col.put(type, typeName); results.add(col); } rsColumns.close(); Get invalid hive table columns after save DataFrame to hive table - Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497876#comment-14497876 ] Yu Ishikawa commented on SPARK-6935: [~oleksii.mdr] Thank you for replying. I guess this issue hasn't been resolved yet, although I read it at the master branch. I understand you are using 1.2 release. I mean, I would like to confirm whether the option names which you suggested are wrong or not. The script at 1.2.0 has `\-\-master-instance-type` too. https://github.com/apache/spark/blob/v1.2.1/ec2/spark_ec2.py#L80 Alright. This is a small problem. I understand what you meant. Thank you for your suggestion. spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497852#comment-14497852 ] Yu Ishikawa commented on SPARK-6935: That's reasonable. Spark v1.3.0 has the `\-\-master-instance-type` option, not `\-\-instance-type-master`. Do you mean that we should add other new options, deprecating the current option? And I feel like refactoring the script because it has many functions with long lines. So it is a little hard to maintain the source code. Just a comment. spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5886) Add StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5886: - Description: `StringIndexer` takes a column of string labels (raw categories) and outputs an integer column with labels indexed by their frequency. {code} va li = new StringIndexer() .setInputCol(country) .setOutputCol(countryIndex) {code} In the output column, we should store the label to index map as an ML attribute. The index should be ordered by frequency, where the most frequent label gets index 0, to enhance sparsity. We can discuss whether this should index multiple columns at the same time. was: `LabelIndexer` takes a column of labels (raw categories) and outputs an integer column with labels indexed by their frequency. {code} va li = new LabelIndexer() .setInputCol(country) .setOutputCol(countryIndex) {code} In the output column, we should store the label to index map as an ML attribute. The index should be ordered by frequency, where the most frequent label gets index 0, to enhance sparsity. We can discuss whether this should index multiple columns at the same time. Add StringIndexer - Key: SPARK-5886 URL: https://issues.apache.org/jira/browse/SPARK-5886 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 `StringIndexer` takes a column of string labels (raw categories) and outputs an integer column with labels indexed by their frequency. {code} va li = new StringIndexer() .setInputCol(country) .setOutputCol(countryIndex) {code} In the output column, we should store the label to index map as an ML attribute. The index should be ordered by frequency, where the most frequent label gets index 0, to enhance sparsity. We can discuss whether this should index multiple columns at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5886) Add StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5886: - Summary: Add StringIndexer (was: Add LabelIndexer) Add StringIndexer - Key: SPARK-5886 URL: https://issues.apache.org/jira/browse/SPARK-5886 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 `LabelIndexer` takes a column of labels (raw categories) and outputs an integer column with labels indexed by their frequency. {code} va li = new LabelIndexer() .setInputCol(country) .setOutputCol(countryIndex) {code} In the output column, we should store the label to index map as an ML attribute. The index should be ordered by frequency, where the most frequent label gets index 0, to enhance sparsity. We can discuss whether this should index multiple columns at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497840#comment-14497840 ] uncleGen edited comment on SPARK-6932 at 4/16/15 9:38 AM: -- There are two ways to use Parameter Server, one is an independent and standalone Parameter Server Service, and the another one is to integrate it into Spark. If we adopt the first one, we may need to maintain a long-running Parameter Server cluster, and Spark can access the Parameter Server seamlessly through some configuration without any code-broken in Spark core. IMHO, it is not an efficient way. For us, big model training is just one kind of job. We want to run a Spark PS job just like other spark jobs, and launch a Parameter Server Cluster in job dynamically. So, if we want to use it dynamically, we may modify current spark core more or less. Well, just as [~chouqin], it is just a prototype, much more problems need to be resolved. was (Author: unclegen): There are two ways to use Parameter Server, one is an independent and standalone Parameter Server Service, and the another one is to be integrated into Spark. If we adopt the first one, we may need to maintain a long-running Parameter Server cluster, and Spark can access the Parameter Server seamlessly through some configuration without any code-broken in Spark core. IMHO, it is not an efficient way. For us, big model training is just one kind of job. We want to run a Spark PS job just like other spark jobs, and launch a Parameter Server Cluster in job dynamically. So, if we want to use it dynamically, we may modify current spark core more or less. Well, just as [~chouqin], it is just a prototype, much more problems need to be resolved. A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc`
[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497840#comment-14497840 ] uncleGen commented on SPARK-6932: - There are two ways to use Parameter Server, one is an independent and standalone Parameter Server Service, and the another one is to be integrated into Spark. If we adopt the first one, we may need to maintain a long-running Parameter Server cluster, and Spark can access the Parameter Server seamlessly through some configuration without any code-broken in Spark core. IMHO, it is not an efficient way. For us, big model training is just one kind of job. We want to run a Spark PS job just like other spark jobs, and launch a Parameter Server Cluster in job dynamically. So, if we want to use it dynamically, we may modify current spark core more or less. Well, just as [~chouqin], it is just a prototype, much more problems need to be resolved. A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc` to reduce these `delta`s frist. def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit // update multiple parameters at the same time, use the same `reduceFunc`. def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = T: Unit // advance clock to indicate that current iteration is finished. def clock(): Unit // block until all workers have reached this line of code. def sync(): Unit {code} *`PSContext` provides following functions to use on driver:* {code} // load parameters from existing rdd. def loadPSModel[T](model: RDD[String, T]) // fetch parameters from parameter server to construct model. def fetchPSModel[T](keys: Array[String]): Array[T] {code} *A new function has been add to `RDD` to run parameter server tasks:* {code} // run the provided
[jira] [Comment Edited] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497840#comment-14497840 ] uncleGen edited comment on SPARK-6932 at 4/16/15 9:39 AM: -- There are two ways to use Parameter Server, one is an independent and standalone Parameter Server Service, and the another one is to integrate it into Spark. If we adopt the first one, we may need to maintain a long-running Parameter Server cluster, and Spark can access the Parameter Server seamlessly through some configurations without any code-broken in Spark core. IMHO, it is not an efficient way. For us, big model training is just one kind of job. We want to run a Spark PS job just like other spark jobs, and launch a Parameter Server Cluster in job dynamically. So, if we want to use it dynamically, we may modify current spark core more or less. Well, just as [~chouqin], it is just a prototype, much more problems need to be resolved. was (Author: unclegen): There are two ways to use Parameter Server, one is an independent and standalone Parameter Server Service, and the another one is to integrate it into Spark. If we adopt the first one, we may need to maintain a long-running Parameter Server cluster, and Spark can access the Parameter Server seamlessly through some configuration without any code-broken in Spark core. IMHO, it is not an efficient way. For us, big model training is just one kind of job. We want to run a Spark PS job just like other spark jobs, and launch a Parameter Server Cluster in job dynamically. So, if we want to use it dynamically, we may modify current spark core more or less. Well, just as [~chouqin], it is just a prototype, much more problems need to be resolved. A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc`
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497861#comment-14497861 ] Oleksii Mandrychenko commented on SPARK-6935: - I didn't know there was a `--master-instance-type` option in the 1.3 release. I am still using 1.2 release, which doesn't have it. I guess if it is present then this ticket can be closed as fixed in 1.3. Refactoring should probably be another ticket IMO spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6935. -- Resolution: Not A Problem Ah right, it is already there as --master-instance-type spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497885#comment-14497885 ] Oleksii Mandrychenko commented on SPARK-6935: - Oh... It's actually there in 1.2 release as well. I did not notice it, because documentation is missing on this point https://spark.apache.org/docs/1.2.1/ec2-scripts.html Perhaps somebody can go through the available options in the script and double check that documentation has matching entries. spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497887#comment-14497887 ] Apache Spark commented on SPARK-6635: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5541 DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at
[jira] [Assigned] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6635: --- Assignee: Apache Spark DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497966#comment-14497966 ] Sean Owen commented on SPARK-6935: -- [~yuu.ishik...@gmail.com] What do you want to work on? it looks like this option already exists. spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497989#comment-14497989 ] Sean Owen commented on SPARK-6935: -- I don't know if it's that important to update the 1.2 docs since there may be no additional 1.2.x release after 1.2.2 coming out ... today. It's also in the usage message already. spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6635: --- Assignee: (was: Apache Spark) DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[jira] [Commented] (SPARK-2734) DROP TABLE should also uncache table
[ https://issues.apache.org/jira/browse/SPARK-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497913#comment-14497913 ] Arush Kharbanda commented on SPARK-2734: This issue is occurring for me, i am using Spark - 1.3.0. DROP TABLE should also uncache table Key: SPARK-2734 URL: https://issues.apache.org/jira/browse/SPARK-2734 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Fix For: 1.1.0 Steps to reproduce: {code} hql(CREATE TABLE test(a INT)) hql(CACHE TABLE test) hql(DROP TABLE test) hql(SELECT * FROM test) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3284) saveAsParquetFile not working on windows
[ https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498019#comment-14498019 ] Bogdan Niculescu edited comment on SPARK-3284 at 4/16/15 1:02 PM: -- I get the same type of exception in Spark 1.3.0 under Windows when trying to save to a parquet file. Here is my code : {code} case class Person(name: String, age: Int) object DataFrameTest extends App { val conf = new SparkConf().setMaster(local[4]).setAppName(ParquetTest) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val persons = List(Person(a, 1), Person(b, 2)) val rdd = sc.parallelize(persons) val dataFrame = sqlContext.createDataFrame(rdd) dataFrame.saveAsParquetFile(test.parquet) } {code} The exception that I'm seeing is : {code} Exception in thread main java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at org.apache.hadoop.util.Shell.run(Shell.java:379) ... at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123) at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922) at sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:19) at sparkTest.DataFrameTest$delayedInit$body.apply(DataFrameTest.scala:9) {code} was (Author: bogdannb): I get the same type of exception in Spark 1.3.0 under Windows when trying to save to a parquet file. Here is my code : {code} case class Person(name: String, age: Int) object DataFrameTest extends App { val conf = new SparkConf().setMaster(local[4]).setAppName(ParquetTest) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val persons = List(Person(a, 1), Person(b, 2)) val rdd = sc.parallelize(persons) val dataFrame = sqlContext.createDataFrame(rdd) dataFrame.saveAsParquetFile(test.parquet) } {code} The exception that I'm seeing is : Exception in thread main java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at org.apache.hadoop.util.Shell.run(Shell.java:379) ... at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123) at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922) at sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:19) at sparkTest.DataFrameTest$delayedInit$body.apply(DataFrameTest.scala:9) saveAsParquetFile not working on windows Key: SPARK-3284 URL: https://issues.apache.org/jira/browse/SPARK-3284 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.0.2 Environment: Windows Reporter: Pravesh Jain Priority: Minor {code} object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } {code} gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3937) ç
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-3937: Summary: ç (was: Unsafe memory access inside of Snappy library) ç - Key: SPARK-3937 URL: https://issues.apache.org/jira/browse/SPARK-3937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Patrick Wendell This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't have much information about this other than the stack trace. However, it was concerning enough I figured I should post it. {code} java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code org.xerial.snappy.SnappyNative.rawUncompress(Native Method) org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) org.xerial.snappy.Snappy.uncompress(Snappy.java:480) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497882#comment-14497882 ] Yu Ishikawa edited comment on SPARK-6935 at 4/16/15 10:38 AM: -- [~sowen] I would like to work this issue. Please assign me to it? However, I don't know how to test `ec2/spark_ec2.py`. If you know the way, would you tell me that? was (Author: yuu.ishik...@gmail.com): [~sowen] I would like to work this issue. Please assign me to it? However, I don't know how to test `ec2/spark_ec2.py`. If do you know the way, would you tell me that? spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497899#comment-14497899 ] Taro L. Saito commented on SPARK-3937: -- I think this error occurs when wrong memory position or corrupted data is read by snappy. I would like to check this binary data read by SnappyInputStream. Unsafe memory access inside of Snappy library - Key: SPARK-3937 URL: https://issues.apache.org/jira/browse/SPARK-3937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Patrick Wendell This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't have much information about this other than the stack trace. However, it was concerning enough I figured I should post it. {code} java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code org.xerial.snappy.SnappyNative.rawUncompress(Native Method) org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) org.xerial.snappy.Snappy.uncompress(Snappy.java:480) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497987#comment-14497987 ] Yu Ishikawa commented on SPARK-6935: Oh sorry. I misunderstood we should add --slave-instance-type option, instead of --instances-type. But, as Oleksii said, I think we should modify the documentation at v1.2.x. It should be another ticket. Thanks spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-6932: Component/s: Spark Core A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib, Spark Core Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc` to reduce these `delta`s frist. def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit // update multiple parameters at the same time, use the same `reduceFunc`. def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = T: Unit // advance clock to indicate that current iteration is finished. def clock(): Unit // block until all workers have reached this line of code. def sync(): Unit {code} *`PSContext` provides following functions to use on driver:* {code} // load parameters from existing rdd. def loadPSModel[T](model: RDD[String, T]) // fetch parameters from parameter server to construct model. def fetchPSModel[T](keys: Array[String]): Array[T] {code} *A new function has been add to `RDD` to run parameter server tasks:* {code} // run the provided `func` on each partition of this RDD. // This function can use data of this partition(the first argument) // and a parameter server client(the second argument). // See the following Logistic Regression for an example. def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U] {code} h2. Example Here is an example of using our prototype to implement logistic regression: {code:title=LogisticRegression.scala|borderStyle=solid} def train( sc: SparkContext, input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): LogisticRegressionModel = { // initialize weights val numFeatures = input.map(_.features.size).first() val initialWeights = new Array[Double](numFeatures)
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497882#comment-14497882 ] Yu Ishikawa commented on SPARK-6935: [~sowen] I would like to work this issue. Please assign me to it? However, I don't know how to test `ec2/spark_ec2.py`. If do you know the way, would you tell me that? spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497840#comment-14497840 ] uncleGen edited comment on SPARK-6932 at 4/16/15 12:56 PM: --- There are two ways to use Parameter Server, one is an independent and standalone Parameter Server Service, and the another one is to integrate it into Spark. If we adopt the first one, we may need to maintain a long-running Parameter Server cluster, and Spark can access the Parameter Server seamlessly through some configurations without any code-broken in Spark core. IMHO, it is not an efficient way. For us, big model training is just one kind of job. We want to run a Spark PS job just like other spark jobs, and launch a Parameter Server Cluster in job dynamically. So, if we want to use it dynamically, we may modify current spark core more or less. Well, just as [~chouqin] said, it is just a prototype, much more problems need to be resolved. was (Author: unclegen): There are two ways to use Parameter Server, one is an independent and standalone Parameter Server Service, and the another one is to integrate it into Spark. If we adopt the first one, we may need to maintain a long-running Parameter Server cluster, and Spark can access the Parameter Server seamlessly through some configurations without any code-broken in Spark core. IMHO, it is not an efficient way. For us, big model training is just one kind of job. We want to run a Spark PS job just like other spark jobs, and launch a Parameter Server Cluster in job dynamically. So, if we want to use it dynamically, we may modify current spark core more or less. Well, just as [~chouqin], it is just a prototype, much more problems need to be resolved. A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use
[jira] [Updated] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-3937: Summary: Unsafe memory access inside of Snappy library (was: ç) Unsafe memory access inside of Snappy library - Key: SPARK-3937 URL: https://issues.apache.org/jira/browse/SPARK-3937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Patrick Wendell This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't have much information about this other than the stack trace. However, it was concerning enough I figured I should post it. {code} java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code org.xerial.snappy.SnappyNative.rawUncompress(Native Method) org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) org.xerial.snappy.Snappy.uncompress(Snappy.java:480) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-6935) spark/spark-ec2.py add parameters to give different instance types for master and slaves
[ https://issues.apache.org/jira/browse/SPARK-6935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498001#comment-14498001 ] Yu Ishikawa commented on SPARK-6935: I got it. I agree. Thank you for letting me know. spark/spark-ec2.py add parameters to give different instance types for master and slaves Key: SPARK-6935 URL: https://issues.apache.org/jira/browse/SPARK-6935 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Oleksii Mandrychenko Priority: Minor Labels: easyfix Original Estimate: 24h Remaining Estimate: 24h I want to start a cluster where I give beefy AWS instances to slaves, such as memory-optimised R3, but master is not really performing much number crunching work. So it is a waste to allocate a powerful instance for master, where a regular one would suffice. Suggested syntax: {code} sh spark-ec2 --instance-type-slave=instance_type # applies to slaves only --instance-type-master=instance_type# applies to master only --instance-type=instance_type # default, applies to both # in real world sh spark-ec2 --instance-type-slave=r3.2xlarge --instance-type-master=c3.xlarge {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3284) saveAsParquetFile not working on windows
[ https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498019#comment-14498019 ] Bogdan Niculescu commented on SPARK-3284: - I get the same type of exception in Spark 1.3.0 under Windows when trying to save to a parquet file. Here is my code : case class Person(name: String, age: Int) object DataFrameTest extends App { val conf = new SparkConf().setMaster(local[4]).setAppName(ParquetTest) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val persons = List(Person(a, 1), Person(b, 2)) val rdd = sc.parallelize(persons) val dataFrame = sqlContext.createDataFrame(rdd) dataFrame.saveAsParquetFile(test.parquet) } The exception that I'm seeing is : Exception in thread main java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at org.apache.hadoop.util.Shell.run(Shell.java:379) ... at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123) at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922) at sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:19) at sparkTest.DataFrameTest$delayedInit$body.apply(DataFrameTest.scala:9) saveAsParquetFile not working on windows Key: SPARK-3284 URL: https://issues.apache.org/jira/browse/SPARK-3284 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.0.2 Environment: Windows Reporter: Pravesh Jain Priority: Minor {code} object parquet { case class Person(name: String, age: Int) def main(args: Array[String]) { val sparkConf = new SparkConf().setMaster(local).setAppName(HdfsWordCount) val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD val people = sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) val parquetFile = sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet) } } {code} gives the error Exception in thread main java.lang.NullPointerException at org.apache.spark.parquet$.main(parquet.scala:16) which is the line saveAsParquetFile. This works fine in linux but using in eclipse in windows gives the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6960) Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment.
Michah Lerner created SPARK-6960: Summary: Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment. Key: SPARK-6960 URL: https://issues.apache.org/jira/browse/SPARK-6960 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Reporter: Michah Lerner Priority: Minor Fix For: 1.3.1 Hard-coded value scala version 2.10 in multiple files, including: docs/_plugins/copy_api_dirs.rb, docs/_config.yml and spark-1.3.0/project/project/SparkPluginBuild.scala. [error] /path/spark-1.3.0/project/project/SparkPluginBuild.scala:33: can't expand macros compiled by previous versions of Scala The following generally builds successfully, except when the -Pnetlib-lgpl option is added to the mvn build, provided there is a manual edit of the scala version. mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -DskipTests clean package sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 unidoc cd docs jekyll b -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6962: - Priority: Major (was: Blocker) Until this is more vetted, reducing from Blocker. It's hard to get any further without more info, like what was stuck where. Do you have a stack dump? Or any code to reproduce? This may be a duplicate of https://issues.apache.org/jira/browse/SPARK-4395 or https://issues.apache.org/jira/browse/SPARK-5060 for example; it's worth searching JIRA first. Spark gets stuck on a step, hangs forever - jobs do not complete Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6962: - Component/s: (was: Spark Core) SQL Spark gets stuck on a step, hangs forever - jobs do not complete Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.
[ https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6694. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5345 [https://github.com/apache/spark/pull/5345] SparkSQL CLI must be able to specify an option --database on the command line. -- Key: SPARK-6694 URL: https://issues.apache.org/jira/browse/SPARK-6694 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Jin Adachi Priority: Critical Fix For: 1.4.0 SparkSQL CLI has an option --database as follows. But, the option --database is ignored. {code:} $ spark-sql --help : CLI options: : --database databasename Specify the database to use ``` {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6959) Support for datetime comparisions in filter for dataframes in pyspark
[ https://issues.apache.org/jira/browse/SPARK-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6959: - Component/s: SQL Support for datetime comparisions in filter for dataframes in pyspark - Key: SPARK-6959 URL: https://issues.apache.org/jira/browse/SPARK-6959 Project: Spark Issue Type: Improvement Components: SQL Reporter: cynepia Currently, in filter strings can be passed for comparision with a date type column. But this does not address the case where dates may be passed in different formats. We should have support for datetime.date or some standard format for comparision with date type columns. Currently, df.filter(df.Datecol datetime.date(2015,1,1)).show() is not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498233#comment-14498233 ] Jon Chase edited comment on SPARK-6962 at 4/16/15 4:17 PM: --- Attaching the stack dumps I took when Spark is hanging. was (Author: jonchase): Here are the stack dumps I took when Spark is hanging. Spark gets stuck on a step, hangs forever - jobs do not complete Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6675) HiveContext setConf is not stable
[ https://issues.apache.org/jira/browse/SPARK-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6675: -- Priority: Critical (was: Major) HiveContext setConf is not stable - Key: SPARK-6675 URL: https://issues.apache.org/jira/browse/SPARK-6675 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: AWS ec2 xlarge2 cluster launched by spark's script Reporter: Hao Ren Priority: Critical I find HiveContext.setConf does not work correctly. Here are some code snippets showing the problem: snippet 1: {code} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{SparkConf, SparkContext} object Main extends App { val conf = new SparkConf() .setAppName(context-test) .setMaster(local[8]) val sc = new SparkContext(conf) val hc = new HiveContext(sc) hc.setConf(spark.sql.shuffle.partitions, 10) hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test) hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println } {code} Results: (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) (spark.sql.shuffle.partitions,10) snippet 2: {code} ... hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test) hc.setConf(spark.sql.shuffle.partitions, 10) hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println ... {code} Results: (hive.metastore.warehouse.dir,/user/hive/warehouse) (spark.sql.shuffle.partitions,10) You can see that I just permuted the two setConf call, then that leads to two different Hive configuration. It seems that HiveContext can not set a new value on hive.metastore.warehouse.dir key in one or the first setConf call. You need another setConf call before changing hive.metastore.warehouse.dir. For example, set hive.metastore.warehouse.dir twice and the snippet 1 snippet 3: {code} ... hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test) hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test) hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println ... {code} Results: (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33) I have also tested the released 1.3.0 (htag = 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498185#comment-14498185 ] Reynold Xin edited comment on SPARK-6635 at 4/16/15 3:38 PM: - Should we throw an exception if the name is identical, or just replace it? For example, what happens if the user does {code} df.select(df(*), lit(1).as(col)) {code} If df already contains a column named col? was (Author: rxin): Should we throw an exception if the name is identical, or just replace it? DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498189#comment-14498189 ] Guoqiang Li commented on SPARK-3937: [~joshrosen] This bug occurs every time. I'm not sure whether I can use the local-cluster to reproduce the bug. If it is successful, I'll post the code. Unsafe memory access inside of Snappy library - Key: SPARK-3937 URL: https://issues.apache.org/jira/browse/SPARK-3937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Patrick Wendell This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't have much information about this other than the stack trace. However, it was concerning enough I figured I should post it. {code} java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code org.xerial.snappy.SnappyNative.rawUncompress(Native Method) org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) org.xerial.snappy.Snappy.uncompress(Snappy.java:480) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498232#comment-14498232 ] Jon Chase commented on SPARK-6962: -- I think it's different from SPARK-4395, as calling/omitting .cache() doesn't have any effect. Also, once it hangs, I've never seen it finish (even after waiting many hours). Also different from SPARK-5060, I believe, as the web UI accurately reports the remaining tasks as unfinished (or in progress in case of the ones running when the hang occurs). Here's my original post from the email thread: === Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), executor memory 20GB, driver memory 10GB I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread out over roughly 2,000 Parquet files and my queries frequently hang. Simple queries like select count(*) from ... on the entire data set work ok. Slightly more demanding ones with group by's and some aggregate functions (percentile_approx, avg, etc.) work ok as well, as long as I have some criteria in my where clause to keep the number of rows down. Once I hit some limit on query complexity and rows processed, my queries start to hang. I've left them for up to an hour without seeing any progress. No OOM's either - the job is just stuck. I've tried setting spark.sql.shuffle.partitions to 400 and even 800, but with the same results: usually near the end of the tasks (like 780 of 800 complete), progress just stops: 15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 788.0 in stage 1.0 (TID 1618) in 800 ms on ip-10-209-22-211.eu-west-1.compute.internal (748/800) 15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 793.0 in stage 1.0 (TID 1623) in 622 ms on ip-10-105-12-41.eu-west-1.compute.internal (749/800) 15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 797.0 in stage 1.0 (TID 1627) in 616 ms on ip-10-90-2-201.eu-west-1.compute.internal (750/800) 15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 799.0 in stage 1.0 (TID 1629) in 611 ms on ip-10-90-2-201.eu-west-1.compute.internal (751/800) 15/03/26 20:53:29 INFO scheduler.TaskSetManager: Finished task 795.0 in stage 1.0 (TID 1625) in 669 ms on ip-10-105-12-41.eu-west-1.compute.internal (752/800) ^^^ this is where it stays forever Looking at the Spark UI, several of the executors still list active tasks. I do see that the Shuffle Read for executors that don't have any tasks remaining is around 100MB, whereas it's more like 10MB for the executors that still have tasks. The first stage, mapPartitions, always completes fine. It's the second stage (takeOrdered), that hangs. I've had this issue in 1.2.0 and 1.2.1 as well as 1.3.0. I've also encountered it when using JSON files (instead of Parquet). Spark gets stuck on a step, hangs forever - jobs do not complete Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498185#comment-14498185 ] Reynold Xin commented on SPARK-6635: Should we throw an exception if the name is identical, or just replace it? DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at
[jira] [Updated] (SPARK-6675) HiveContext setConf is not stable
[ https://issues.apache.org/jira/browse/SPARK-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6675: -- Shepherd: Cheng Lian HiveContext setConf is not stable - Key: SPARK-6675 URL: https://issues.apache.org/jira/browse/SPARK-6675 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: AWS ec2 xlarge2 cluster launched by spark's script Reporter: Hao Ren Priority: Critical I find HiveContext.setConf does not work correctly. Here are some code snippets showing the problem: snippet 1: {code} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{SparkConf, SparkContext} object Main extends App { val conf = new SparkConf() .setAppName(context-test) .setMaster(local[8]) val sc = new SparkContext(conf) val hc = new HiveContext(sc) hc.setConf(spark.sql.shuffle.partitions, 10) hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test) hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println } {code} Results: (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) (spark.sql.shuffle.partitions,10) snippet 2: {code} ... hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test) hc.setConf(spark.sql.shuffle.partitions, 10) hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println ... {code} Results: (hive.metastore.warehouse.dir,/user/hive/warehouse) (spark.sql.shuffle.partitions,10) You can see that I just permuted the two setConf call, then that leads to two different Hive configuration. It seems that HiveContext can not set a new value on hive.metastore.warehouse.dir key in one or the first setConf call. You need another setConf call before changing hive.metastore.warehouse.dir. For example, set hive.metastore.warehouse.dir twice and the snippet 1 snippet 3: {code} ... hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test) hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test) hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println ... {code} Results: (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33) I have also tested the released 1.3.0 (htag = 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6953) Speedup tests of PySpark, reduce logging
[ https://issues.apache.org/jira/browse/SPARK-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6953: - Component/s: PySpark Speedup tests of PySpark, reduce logging Key: SPARK-6953 URL: https://issues.apache.org/jira/browse/SPARK-6953 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Right now, it take about 30 minutes to complete the PySpark tests with Python 2.6, python 3.4 and PyPy. It's better to decrease it. Also when run pyspark/tests.py, the logging is pretty scaring (lots of exceptions), it will be nice to mute the exception when it's expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6962) Spark gets stuck on a step, hangs forever - jobs do not complete
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Chase updated SPARK-6962: - Attachment: jstacks.txt Here are the stack dumps I took when Spark is hanging. Spark gets stuck on a step, hangs forever - jobs do not complete Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6963) Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint
Andrew Or created SPARK-6963: Summary: Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint Key: SPARK-6963 URL: https://issues.apache.org/jira/browse/SPARK-6963 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Guoqiang Li Priority: Critical {code} sbt.ForkMain$ForkError: fs.exists(org.apache.spark.rdd.RDDCheckpointData.rddCheckpointDataPath(ContextCleanerSuite.this.sc, rddId).get) was true at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply$mcV$sp(ContextCleanerSuite.scala:252) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfter$$super$runTest(ContextCleanerSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfterEach$$super$runTest(ContextCleanerSuite.scala:46) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.ContextCleanerSuiteBase.runTest(ContextCleanerSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6963) Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6963: - Description: Observed on an unrelated streaming PR https://github.com/apache/spark/pull/5428 https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30389/ {code} sbt.ForkMain$ForkError: fs.exists(org.apache.spark.rdd.RDDCheckpointData.rddCheckpointDataPath(ContextCleanerSuite.this.sc, rddId).get) was true at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply$mcV$sp(ContextCleanerSuite.scala:252) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfter$$super$runTest(ContextCleanerSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfterEach$$super$runTest(ContextCleanerSuite.scala:46) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.ContextCleanerSuiteBase.runTest(ContextCleanerSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) {code} was: {code} sbt.ForkMain$ForkError: fs.exists(org.apache.spark.rdd.RDDCheckpointData.rddCheckpointDataPath(ContextCleanerSuite.this.sc, rddId).get) was true at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply$mcV$sp(ContextCleanerSuite.scala:252) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at
[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env
[ https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498305#comment-14498305 ] Paul Wu commented on SPARK-6936: You are right: I used spark-hive_2.10 instead of spark-hive_2.11 in my building . Sorry I deleted my comment before reading your comment. Thanks, SQLContext.sql() caused deadlock in multi-thread env Key: SPARK-6936 URL: https://issues.apache.org/jira/browse/SPARK-6936 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: JDK 1.8.x, RedHat Linux version 2.6.32-431.23.3.el6.x86_64 (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014 Reporter: Paul Wu Labels: deadlock, sql, threading Doing (the same query) in more than one threads with SQLConext.sql may lead to deadlock. Here is a way to reproduce it (since this is multi-thread issue, the reproduction may or may not be so easy). 1. Register a relatively big table. 2. Create two different classes and in the classes, do the same query in a method and put the results in a set and print out the set size. 3. Create two threads to use an object from each class in the run method. Start the threads. For my tests, it can have a deadlock just in a few runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-4897: -- Priority: Blocker (was: Minor) Python 3 support Key: SPARK-4897 URL: https://issues.apache.org/jira/browse/SPARK-4897 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh Rosen Assignee: Davies Liu Priority: Blocker It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type help, copyright, credits or license for more information. Traceback (most recent call last): File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, in module import pyspark File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, in module from pyspark import accumulators File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, line 97, in module from pyspark.cloudpickle import CloudPickler File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 120, in module class CloudPickler(pickle.Pickler): File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6963) Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6963: - Labels: flaky-test (was: ) Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint -- Key: SPARK-6963 URL: https://issues.apache.org/jira/browse/SPARK-6963 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Guoqiang Li Priority: Critical Labels: flaky-test {code} sbt.ForkMain$ForkError: fs.exists(org.apache.spark.rdd.RDDCheckpointData.rddCheckpointDataPath(ContextCleanerSuite.this.sc, rddId).get) was true at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply$mcV$sp(ContextCleanerSuite.scala:252) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209) at org.apache.spark.ContextCleanerSuite$$anonfun$9.apply(ContextCleanerSuite.scala:209) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfter$$super$runTest(ContextCleanerSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.ContextCleanerSuiteBase.org$scalatest$BeforeAndAfterEach$$super$runTest(ContextCleanerSuite.scala:46) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.ContextCleanerSuiteBase.runTest(ContextCleanerSuite.scala:46) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498329#comment-14498329 ] Joseph K. Bradley commented on SPARK-6635: -- [~rxin] Your select statement does highlight how the correct behavior is ambiguous. What is the best way to replace 1 column while leaving all others the same? That seems like a useful operation. (Right now, I can only think of getting all of the column names, removing the one being replaced, and using that list in a select statement.) DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Updated] (SPARK-4081) Categorical feature indexing
[ https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4081: - Description: **Updated Description** Decision Trees and tree ensembles require that categorical features be indexed 0,1,2 There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical). Proposed functionality: * This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. * This can also map categorical feature values to 0-based indices. This is implemented in the spark.ml package for the Pipelines API, and it stores the indexes as column metadata. was: DecisionTree and RandomForest require that categorical features and labels be indexed 0,1,2 There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical). Proposed functionality: * This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. * This can also map categorical feature values to 0-based indices. Usage: {code} val myData1: RDD[Vector] = ... val myData2: RDD[Vector] = ... val datasetIndexer = new DatasetIndexer(maxCategories) datasetIndexer.fit(myData1) val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1) datasetIndexer.fit(myData2) val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2) val categoricalFeaturesInfo: Map[Double, Int] = datasetIndexer.getCategoricalFeatureIndexes() {code} Categorical feature indexing Key: SPARK-4081 URL: https://issues.apache.org/jira/browse/SPARK-4081 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Fix For: 1.4.0 **Updated Description** Decision Trees and tree ensembles require that categorical features be indexed 0,1,2 There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical). Proposed functionality: * This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. * This can also map categorical feature values to 0-based indices. This is implemented in the spark.ml package for the Pipelines API, and it stores the indexes as column metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env
[ https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498247#comment-14498247 ] Paul Wu commented on SPARK-6936: Not sure about HiveContext. I tried to do the following program and I got exception (env: JDK 1.8/ Spark 1.3). Why did I get the error on HiveContext? -- exec-maven-plugin:1.2.1:exec (default-cli) @ Spark-Sample --- Exception in thread main java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object; at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:574) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:245) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:234) at org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92) at org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92) at
[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env
[ https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498258#comment-14498258 ] Michael Armbrust commented on SPARK-6936: - NoSuchMethodError almost always means you are mixing incompatible versions of libraries (in this case probably scala?) on your classpath. SQLContext.sql() caused deadlock in multi-thread env Key: SPARK-6936 URL: https://issues.apache.org/jira/browse/SPARK-6936 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: JDK 1.8.x, RedHat Linux version 2.6.32-431.23.3.el6.x86_64 (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014 Reporter: Paul Wu Labels: deadlock, sql, threading Doing (the same query) in more than one threads with SQLConext.sql may lead to deadlock. Here is a way to reproduce it (since this is multi-thread issue, the reproduction may or may not be so easy). 1. Register a relatively big table. 2. Create two different classes and in the classes, do the same query in a method and put the results in a set and print out the set size. 3. Create two threads to use an object from each class in the run method. Start the threads. For my tests, it can have a deadlock just in a few runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6923) Get invalid hive table columns after save DataFrame to hive table
[ https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498271#comment-14498271 ] Michael Armbrust commented on SPARK-6923: - Only Spark 1.3 has the ability to read tables that are creates with the datasource api. Get invalid hive table columns after save DataFrame to hive table - Key: SPARK-6923 URL: https://issues.apache.org/jira/browse/SPARK-6923 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: pin_zhang HiveContext hctx = new HiveContext(sc); ListString sample = new ArrayListString(); sample.add( {\id\: \id_1\, \age\:1} ); RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd(); DataFrame df = hctx.jsonRDD(sampleRDD); String table=test; df.saveAsTable(table, json,SaveMode.Overwrite); Table t = hctx.catalog().client().getTable(table); System.out.println( t.getCols()); -- With the code above to save DataFrame to hive table, Get table cols returns one column named 'col' [FieldSchema(name:col, type:arraystring, comment:from deserializer)] Expected return fields schema id, age. This results in the jdbc API cannot retrieves the table columns via ResultSet DatabaseMetaData.getColumns(String catalog, String schemaPattern,String tableNamePattern, String columnNamePattern) But resultset metadata for query select * from test contains fields id, age. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env
[ https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Wu updated SPARK-6936: --- Comment: was deleted (was: Not sure about HiveContext. I tried to do the following program and I got exception (env: JDK 1.8/ Spark 1.3). Why did I get the error on HiveContext? -- exec-maven-plugin:1.2.1:exec (default-cli) @ Spark-Sample --- Exception in thread main java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object; at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:574) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:245) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:234) at org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92) at org.apache.spark.sql.hive.HiveContext$$anonfun$sql$1.apply(HiveContext.scala:92) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92) at
[jira] [Commented] (SPARK-6960) Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment.
[ https://issues.apache.org/jira/browse/SPARK-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498057#comment-14498057 ] Sean Owen commented on SPARK-6960: -- Micah the build doesn't work at all for 2.11 unless you use the script to translate all the hardcoded occurrences. Did you apply this script? Should be under dev. Docs plugins probably don't matter as there is no separate docs build for 2.11 Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment. - Key: SPARK-6960 URL: https://issues.apache.org/jira/browse/SPARK-6960 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Reporter: Michah Lerner Priority: Minor Fix For: 1.3.1 Hard-coded value scala version 2.10 in multiple files, including: docs/_plugins/copy_api_dirs.rb, docs/_config.yml and spark-1.3.0/project/project/SparkPluginBuild.scala. [error] /path/spark-1.3.0/project/project/SparkPluginBuild.scala:33: can't expand macros compiled by previous versions of Scala The following generally builds successfully, except when the -Pnetlib-lgpl option is added to the mvn build, provided there is a manual edit of the scala version. mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -DskipTests clean package sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 unidoc cd docs jekyll b -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6960) Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment.
[ https://issues.apache.org/jira/browse/SPARK-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6960: - Target Version/s: (was: 1.3.1) Fix Version/s: (was: 1.3.1) Dont set fix or target version; 1.3.1 is out the door anyway. Hardcoded scala version in multiple files. SparkluginBuild.scala, as well as docs/ with self-deprecating comment. - Key: SPARK-6960 URL: https://issues.apache.org/jira/browse/SPARK-6960 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Reporter: Michah Lerner Priority: Minor Hard-coded value scala version 2.10 in multiple files, including: docs/_plugins/copy_api_dirs.rb, docs/_config.yml and spark-1.3.0/project/project/SparkPluginBuild.scala. [error] /path/spark-1.3.0/project/project/SparkPluginBuild.scala:33: can't expand macros compiled by previous versions of Scala The following generally builds successfully, except when the -Pnetlib-lgpl option is added to the mvn build, provided there is a manual edit of the scala version. mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -DskipTests clean package sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 unidoc cd docs jekyll b -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6964) Support Cancellation in the Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6964: Description: There is already a hook in {{ExecuteStatementOperation}}, we just need to connect it to the job group cancellation support we already have and make sure the various drivers support it. (was: There is already a hook in ) Support Cancellation in the Thrift Server - Key: SPARK-6964 URL: https://issues.apache.org/jira/browse/SPARK-6964 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical There is already a hook in {{ExecuteStatementOperation}}, we just need to connect it to the job group cancellation support we already have and make sure the various drivers support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.
[ https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6955: - Affects Version/s: 1.2.0 Do not let Yarn Shuffle Server retry its server port. - Key: SPARK-6955 URL: https://issues.apache.org/jira/browse/SPARK-6955 Project: Spark Issue Type: Bug Components: Shuffle, YARN Affects Versions: 1.2.0 Reporter: SaintBacchus Priority: Minor It's better to let the NodeManager get down rather than take a port retry when `spark.shuffle.service.port` has been conflicted during starting the Spark Yarn Shuffle Server, because the retry mechanism will make the inconsistency of shuffle port and also make client fail to find the port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.
[ https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6955: - Assignee: SaintBacchus Do not let Yarn Shuffle Server retry its server port. - Key: SPARK-6955 URL: https://issues.apache.org/jira/browse/SPARK-6955 Project: Spark Issue Type: Bug Components: Shuffle, YARN Affects Versions: 1.2.0 Reporter: SaintBacchus Assignee: SaintBacchus Priority: Minor It's better to let the NodeManager get down rather than take a port retry when `spark.shuffle.service.port` has been conflicted during starting the Spark Yarn Shuffle Server, because the retry mechanism will make the inconsistency of shuffle port and also make client fail to find the port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed
[ https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498390#comment-14498390 ] Matt Cheah edited comment on SPARK-6950 at 4/16/15 5:57 PM: There's one way I could reproduce this locally, but I can't confirm if this is what is happening in production. I had to force a race condition to occur for it. Basically sometimes the Spark Master can attempt to rebuild the UI before the event logging listener renames the event log file removing the .inprogress extension. If the spark master reaches that point first before the event logging listener renames the file, it will never check again if the file is renamed and never build the UI. The SparkContext.stop() method requests the eventLogger to stop, which may not execute until after the Master has called rebuildSparkUi() for the completed application. However I think the fix to SPARK-6107 will inadvertently also solve this issue. I'll try applying that patch. was (Author: mcheah): There's one way I could reproduce this locally, but I can't confirm if this is what is happening in production. I had to force a race condition to occur for it. Basically sometimes the Spark Master can attempt to rebuild the UI before the event logging listener renames the event log file removing the .inprogress extension. If the spark master reaches that point first before the event logging listener renames the file, it will never check again if the file is renamed and never build the UI. The SparkContext.stop() method requests the eventLogger to stop, which may not execute until after the Master has called rebuildSparkUi() for the completed application. However I think the fix to SPARK-6107 will inadvertently also solve this issue. Spark master UI believes some applications are in progress when they are actually completed --- Key: SPARK-6950 URL: https://issues.apache.org/jira/browse/SPARK-6950 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Matt Cheah In Spark 1.2.x, I was able to set my spark event log directory to be a different location from the default, and after the job finishes, I can replay the UI by clicking on the appropriate link under Completed Applications. Now, on a non-deterministic basis (but seems to happen most of the time), when I click on the link under Completed Applications, I instead get a webpage that says: Application history not found (app-20150415052927-0014) Application myApp is still in progress. I am able to view the application's UI using the Spark history server, so something regressed in the Spark master code between 1.2 and 1.3, but that regression does not apply in the history server use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed
[ https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498390#comment-14498390 ] Matt Cheah commented on SPARK-6950: --- There's one way I could reproduce this locally, but I can't confirm if this is what is happening in production. I had to force a race condition to occur for it. Basically sometimes the Spark Master can attempt to rebuild the UI before the event logging listener renames the event log file removing the .inprogress extension. If the spark master reaches that point first before the event logging listener renames the file, it will never check again if the file is renamed and never build the UI. The SparkContext.stop() method requests the eventLogger to stop, which may not execute until after the Master has called rebuildSparkUi() for the completed application. However I think the fix to SPARK-6107 will inadvertently also solve this issue. Spark master UI believes some applications are in progress when they are actually completed --- Key: SPARK-6950 URL: https://issues.apache.org/jira/browse/SPARK-6950 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Matt Cheah In Spark 1.2.x, I was able to set my spark event log directory to be a different location from the default, and after the job finishes, I can replay the UI by clicking on the appropriate link under Completed Applications. Now, on a non-deterministic basis (but seems to happen most of the time), when I click on the link under Completed Applications, I instead get a webpage that says: Application history not found (app-20150415052927-0014) Application myApp is still in progress. I am able to view the application's UI using the Spark history server, so something regressed in the Spark master code between 1.2 and 1.3, but that regression does not apply in the history server use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6964) Support Cancellation in the Thrift Server
Michael Armbrust created SPARK-6964: --- Summary: Support Cancellation in the Thrift Server Key: SPARK-6964 URL: https://issues.apache.org/jira/browse/SPARK-6964 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1442: Priority: Blocker (was: Critical) Add Window function support --- Key: SPARK-1442 URL: https://issues.apache.org/jira/browse/SPARK-1442 Project: Spark Issue Type: New Feature Components: SQL Reporter: Chengxiang Li Priority: Blocker Attachments: Window Function.pdf similiar to Hive, add window function support for catalyst. https://issues.apache.org/jira/browse/HIVE-4197 https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6964) Support Cancellation in the Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6964: Target Version/s: 1.4.0 Support Cancellation in the Thrift Server - Key: SPARK-6964 URL: https://issues.apache.org/jira/browse/SPARK-6964 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6964) Support Cancellation in the Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6964: Description: There is already a hook in Support Cancellation in the Thrift Server - Key: SPARK-6964 URL: https://issues.apache.org/jira/browse/SPARK-6964 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Priority: Critical There is already a hook in -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6965) StringIndexer should convert input to Strings
Joseph K. Bradley created SPARK-6965: Summary: StringIndexer should convert input to Strings Key: SPARK-6965 URL: https://issues.apache.org/jira/browse/SPARK-6965 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor StringIndexer should convert non-String input types to String. That way, it can handle any basic types such as Int, Double, etc.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6955) Do not let Yarn Shuffle Server retry its server port.
[ https://issues.apache.org/jira/browse/SPARK-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6955: - Target Version/s: 1.4.0 Do not let Yarn Shuffle Server retry its server port. - Key: SPARK-6955 URL: https://issues.apache.org/jira/browse/SPARK-6955 Project: Spark Issue Type: Bug Components: Shuffle, YARN Affects Versions: 1.2.0 Reporter: SaintBacchus Priority: Minor It's better to let the NodeManager get down rather than take a port retry when `spark.shuffle.service.port` has been conflicted during starting the Spark Yarn Shuffle Server, because the retry mechanism will make the inconsistency of shuffle port and also make client fail to find the port. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2734) DROP TABLE should also uncache table
[ https://issues.apache.org/jira/browse/SPARK-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498403#comment-14498403 ] Michael Armbrust commented on SPARK-2734: - How do you know it occurring? What queries are you running? DROP TABLE should also uncache table Key: SPARK-2734 URL: https://issues.apache.org/jira/browse/SPARK-2734 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Fix For: 1.1.0 Steps to reproduce: {code} hql(CREATE TABLE test(a INT)) hql(CACHE TABLE test) hql(DROP TABLE test) hql(SELECT * FROM test) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6940) PySpark ML.Tuning Wrappers are missing
[ https://issues.apache.org/jira/browse/SPARK-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498663#comment-14498663 ] Joseph K. Bradley commented on SPARK-6940: -- [~omede] Can you please coordinate with [~punya] who also opened up a similar ticket? As far as conflicting JIRAs/PRs, you should primarily be aware of [SPARK-5874] which [~mengxr] is working on. PySpark ML.Tuning Wrappers are missing -- Key: SPARK-6940 URL: https://issues.apache.org/jira/browse/SPARK-6940 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz PySpark doesn't currently have wrappers for any of the ML.Tuning classes: CrossValidator, CrossValidatorModel, ParamGridBuilder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed
[ https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498555#comment-14498555 ] Matt Cheah edited comment on SPARK-6950 at 4/16/15 7:32 PM: This is no longer an issue on the tip of branch-1.3, as: (1) Fixing SPARK-6107 would make any .inprogress event logs be parsed, and (2) I believe somewhere between the released 1.3 and the current 1.3, the event logging listener in the SparkContext was forced to stop before the Spark master UI was notified that the application completed. I'll close this out for now but will re-open it if I continue to see the issue after 1.3.1 is released and I update to it. was (Author: mcheah): This is no longer an issue on the tip of branch-1.3, as: (1) Fixing SPARK-6107 would make any .inprogress event logs be parsed, and (2) I believe somewhere between the released 1.3 and the current 1.3, the event log stopping was forced to stop before the Spark master UI was notified that the application completed. I'll close this out for now but will re-open it if I continue to see the issue after 1.3.1 is released and I update to it. Spark master UI believes some applications are in progress when they are actually completed --- Key: SPARK-6950 URL: https://issues.apache.org/jira/browse/SPARK-6950 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Matt Cheah Fix For: 1.3.1 In Spark 1.2.x, I was able to set my spark event log directory to be a different location from the default, and after the job finishes, I can replay the UI by clicking on the appropriate link under Completed Applications. Now, on a non-deterministic basis (but seems to happen most of the time), when I click on the link under Completed Applications, I instead get a webpage that says: Application history not found (app-20150415052927-0014) Application myApp is still in progress. I am able to view the application's UI using the Spark history server, so something regressed in the Spark master code between 1.2 and 1.3, but that regression does not apply in the history server use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6947) Make ml.tuning accessible from Python API
[ https://issues.apache.org/jira/browse/SPARK-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6947: - Fix Version/s: (was: SPARK-6940) Make ml.tuning accessible from Python API - Key: SPARK-6947 URL: https://issues.apache.org/jira/browse/SPARK-6947 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Punya Biswal {{CrossValidator}} and {{ParamGridBuilder}} should be available for use in PySpark-based ML pipelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6844) Memory leak occurs when register temp table with cache table on
[ https://issues.apache.org/jira/browse/SPARK-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498587#comment-14498587 ] Michael Armbrust commented on SPARK-6844: - I was not planning to. I do not think that it is a regression from 1.2 and it is a little risky to backport changes to the way we initialize cached relations. Memory leak occurs when register temp table with cache table on --- Key: SPARK-6844 URL: https://issues.apache.org/jira/browse/SPARK-6844 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jack Hu Labels: Memory, SQL Fix For: 1.4.0 There is a memory leak in register temp table with cache on This is the simple code to reproduce this issue: {code} val sparkConf = new SparkConf().setAppName(LeakTest) val sparkContext = new SparkContext(sparkConf) val sqlContext = new SQLContext(sparkContext) val tableName = tmp val jsonrdd = sparkContext.textFile(sample.json) var loopCount = 1L while(true) { sqlContext.jsonRDD(jsonrdd).registerTempTable(tableName) sqlContext.cacheTable(tableName) println(L: +loopCount + R: + sqlContext.sql(select count(*) from tmp).count()) sqlContext.uncacheTable(tableName) loopCount += 1 } {code} The cause is that the {{InMemoryRelation}}. {{InMemoryColumnarTableScan}} uses the accumulator ({{InMemoryRelation.batchStats}},{{InMemoryColumnarTableScan.readPartitions}}, {{InMemoryColumnarTableScan.readBatches}} ) to get some information from partitions or for test. These accumulators will register itself into a static map in {{Accumulators.originals}} and never get cleaned up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-6857. -- Resolution: Not A Problem Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley **UPDATE**: Closing this JIRA since a better fix will be better UDT support. See discussion in comments. If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6857: - Description: **UPDATE**: Closing this JIRA since a better fix will be better UDT support. See discussion in comments. If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} was: If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley **UPDATE**: Closing this JIRA since a better fix will be better UDT support. See discussion in comments. If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k,
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498725#comment-14498725 ] Reynold Xin commented on SPARK-6635: cc [~marmbrus] to chime in. I think about it more, and withName should probably overwrite an existing column (or maybe with an argument to control the behavior?). However, we might want a broader discussion about column names also. DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at
[jira] [Created] (SPARK-6970) Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable
John Muller created SPARK-6970: -- Summary: Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable Key: SPARK-6970 URL: https://issues.apache.org/jira/browse/SPARK-6970 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 1.3.0 Reporter: John Muller The save options on DataFrames are not easily discerned: [ResolvedDataSource.apply|https://github.com/apache/spark/blob/b75b3070740803480d235b0c9a86673721344f30/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala#L222] is where the pattern match occurs: {code:title=ddl.scala|borderStyle=solid} case dataSource: SchemaRelationProvider = dataSource.createRelation(sqlContext, new CaseInsensitiveMap(options), schema) {code} Implementing classes are currently: TableScanSuite, JSONRelation, and newParquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5427) Add support for floor function in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-5427: -- Description: floor() function is supported in Hive SQL. This issue is to add floor() function to Spark SQL. Related thread: http://search-hadoop.com/m/JW1q563fc22 was: floor() function is supported in Hive SQL. This issue is to add floor() function to Spark SQL. Related thread: http://search-hadoop.com/m/JW1q563fc22 Add support for floor function in Spark SQL --- Key: SPARK-5427 URL: https://issues.apache.org/jira/browse/SPARK-5427 Project: Spark Issue Type: Improvement Components: SQL Reporter: Ted Yu Labels: math floor() function is supported in Hive SQL. This issue is to add floor() function to Spark SQL. Related thread: http://search-hadoop.com/m/JW1q563fc22 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498593#comment-14498593 ] Davies Liu commented on SPARK-6857: --- It's not good that we use array or numpy.array as part of API, we can not change it right now. I'd like to suggest to use Vector as part of API in ml, and support conversion from/to numpy.array easy and fast. numpy/scipy is only useful for mllib/ml, it's better to keep them out of the scope of SQL. Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley **UPDATE**: Closing this JIRA since a better fix will be better UDT support. See discussion in comments. If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498736#comment-14498736 ] Joseph K. Bradley commented on SPARK-6635: -- Btw, when I say that seems like a useful operation, I really mean that I wanted to do that for MLlib. DataFrame.withColumn can create columns with identical names Key: SPARK-6635 URL: https://issues.apache.org/jira/browse/SPARK-6635 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley DataFrame lets you create multiple columns with the same name, which causes problems when you try to refer to columns by name. Proposal: If a column is added to a DataFrame with a column of the same name, then the new column should replace the old column. {code} scala val df = sc.parallelize(Array(1,2,3)).toDF(x) df: org.apache.spark.sql.DataFrame = [x: int] scala val df3 = df.withColumn(x, df(x) + 1) df3: org.apache.spark.sql.DataFrame = [x: int, x: int] scala df3.collect() res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) scala df3(x) org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: x, x.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC.init(console:41) at $iwC.init(console:43) at init(console:45) at .init(console:49) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at
[jira] [Comment Edited] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed
[ https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498555#comment-14498555 ] Matt Cheah edited comment on SPARK-6950 at 4/16/15 7:31 PM: This is no longer an issue on the tip of branch-1.3, as: (1) Fixing SPARK-6107 would make any .inprogress event logs be parsed, and (2) I believe somewhere between the released 1.3 and the current 1.3, the event log stopping was forced to stop before the Spark master UI was notified that the application completed. I'll close this out for now but will re-open it if I continue to see the issue after 1.3.1 is released and I update to it. was (Author: mcheah): This is no longer an issue on the tip of branch-1.3, as: (1) Fixing SPARK-6107 would make any .inprogress event logs be parsed, and (2) I believe somewhere between the released 1.3 and the current 1.3, the event log stopping was forced to stop before the Spark master UI was notified that the application completed. Spark master UI believes some applications are in progress when they are actually completed --- Key: SPARK-6950 URL: https://issues.apache.org/jira/browse/SPARK-6950 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Matt Cheah Fix For: 1.3.1 In Spark 1.2.x, I was able to set my spark event log directory to be a different location from the default, and after the job finishes, I can replay the UI by clicking on the appropriate link under Completed Applications. Now, on a non-deterministic basis (but seems to happen most of the time), when I click on the link under Completed Applications, I instead get a webpage that says: Application history not found (app-20150415052927-0014) Application myApp is still in progress. I am able to view the application's UI using the Spark history server, so something regressed in the Spark master code between 1.2 and 1.3, but that regression does not apply in the history server use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed
[ https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah resolved SPARK-6950. --- Resolution: Cannot Reproduce Fix Version/s: 1.3.1 Spark master UI believes some applications are in progress when they are actually completed --- Key: SPARK-6950 URL: https://issues.apache.org/jira/browse/SPARK-6950 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Matt Cheah Fix For: 1.3.1 In Spark 1.2.x, I was able to set my spark event log directory to be a different location from the default, and after the job finishes, I can replay the UI by clicking on the appropriate link under Completed Applications. Now, on a non-deterministic basis (but seems to happen most of the time), when I click on the link under Completed Applications, I instead get a webpage that says: Application history not found (app-20150415052927-0014) Application myApp is still in progress. I am able to view the application's UI using the Spark history server, so something regressed in the Spark master code between 1.2 and 1.3, but that regression does not apply in the history server use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6962) Netty BlockTransferService hangs in the middle of SQL query
[ https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-6962: -- Summary: Netty BlockTransferService hangs in the middle of SQL query (was: Spark gets stuck on a step, hangs forever - jobs do not complete) Netty BlockTransferService hangs in the middle of SQL query --- Key: SPARK-6962 URL: https://issues.apache.org/jira/browse/SPARK-6962 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Jon Chase Attachments: jstacks.txt Spark SQL queries (though this seems to be a Spark Core issue - I'm just using queries in the REPL to surface this, so I mention Spark SQL) hang indefinitely under certain (not totally understood) circumstances. This is resolved by setting spark.shuffle.blockTransferService=nio, which seems to point to netty as the issue. Netty was set as the default for the block transport layer in 1.2.0, which is when this issue started. Setting the service to nio allows queries to complete normally. I do not see this problem when running queries over smaller (~20 5MB files) datasets. When I increase the scope to include more data (several hundred ~5MB files), the queries will get through several steps but eventuall hang indefinitely. Here's the email chain regarding this issue, including stack traces: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com For context, here's the announcement regarding the block transfer service change: http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498651#comment-14498651 ] Joseph K. Bradley commented on SPARK-6857: -- Based on past discussions with [~mengxr], ML should use numpy and scipy types, rather than re-implementing all of that functionality. Supporting numpy and scipy types in SQL does not actually mean having numpy or scipy code in SQL. It would mean: * Extending UDTs so users can registers their own UDTs with the SQLContext. * Adding UDTs for numpy and scipy types in MLlib. * Allowing users to import or call something which registers those MLlib UDTs with SQL. Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley **UPDATE**: Closing this JIRA since a better fix will be better UDT support. See discussion in comments. If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6969) Refresh the cached table when REFRESH TABLE is used
Yin Huai created SPARK-6969: --- Summary: Refresh the cached table when REFRESH TABLE is used Key: SPARK-6969 URL: https://issues.apache.org/jira/browse/SPARK-6969 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Right now, {{REFRESH TABLE}} only invalidate the metadata of a table. If a table is cached and new files are added manually to the table, users still see the cached data after {{REFRESH TABLE}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6966) JDBC datasources use Class.forName to load driver
[ https://issues.apache.org/jira/browse/SPARK-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498802#comment-14498802 ] Apache Spark commented on SPARK-6966: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/5543 JDBC datasources use Class.forName to load driver - Key: SPARK-6966 URL: https://issues.apache.org/jira/browse/SPARK-6966 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6966) JDBC datasources use Class.forName to load driver
[ https://issues.apache.org/jira/browse/SPARK-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6966: --- Assignee: Apache Spark (was: Michael Armbrust) JDBC datasources use Class.forName to load driver - Key: SPARK-6966 URL: https://issues.apache.org/jira/browse/SPARK-6966 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Apache Spark Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6966) JDBC datasources use Class.forName to load driver
[ https://issues.apache.org/jira/browse/SPARK-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6966: --- Assignee: Michael Armbrust (was: Apache Spark) JDBC datasources use Class.forName to load driver - Key: SPARK-6966 URL: https://issues.apache.org/jira/browse/SPARK-6966 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6967) Internal DateType not handled correctly in caching
[ https://issues.apache.org/jira/browse/SPARK-6967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6967: Target Version/s: 1.3.2, 1.4.0 Internal DateType not handled correctly in caching -- Key: SPARK-6967 URL: https://issues.apache.org/jira/browse/SPARK-6967 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Adrian Wang Priority: Blocker From the user list. It looks like data is not implemented correctly in in-memory caching. We should also check the JDBC datasource support for date. {code} Stack trace of an exception being reported since upgrade to 1.3.0: java.lang.ClassCastException: java.sql.Date cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:105) ~[scala-library-2.11.6.jar:na] at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(rows.scala:83) ~[spark-catalyst_2.11-1.3.0.jar:1.3.0] at org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:191) ~[spark-sql_2.11-1.3.0.jar:1.3.0] at org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56) ~[spark-sql_2.11-1.3.0.jar:1.3.0] at org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87) ~[spark-sql_2.11-1.3.0.jar:1.3.0] at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) ~[spark-sql_2.11-1.3.0.jar:1.3.0] at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87) ~[spark-sql_2.11-1.3.0.jar:1.3.0] at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:135) ~[spark-sql_2.11-1.3.0.jar:1.3.0] at {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498508#comment-14498508 ] Joseph K. Bradley commented on SPARK-6857: -- [~davies] Yes, that OK with me. It's a bit inconsistent: * In MLlib, we want to encourage users to use numpy and scipy types, rather than the mllib.linalg.* types. * In SQL, it's better if users use Python types or mllib.linalg.* types (for which UDTs handle the conversion). Perhaps the best fix will be better UDTs: If we can register any type (such as numpy.array) with the SQLContext as a UDT, then users will be able to use numpy and scipy types everywhere. I hope we can add that support before too long. Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org