[jira] [Created] (SPARK-11524) Support SparkR with Mesos cluster
Sun Rui created SPARK-11524: --- Summary: Support SparkR with Mesos cluster Key: SPARK-11524 URL: https://issues.apache.org/jira/browse/SPARK-11524 Project: Spark Issue Type: New Feature Components: SparkR Affects Versions: 1.5.1 Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11507) Error thrown when using BlockMatrix.add
[ https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991261#comment-14991261 ] yuhao yang commented on SPARK-11507: Looking into it. Should be a bug. Breeze may remove the extra end after addition. > Error thrown when using BlockMatrix.add > --- > > Key: SPARK-11507 > URL: https://issues.apache.org/jira/browse/SPARK-11507 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.5.0 > Environment: Mac/local machine, EC2 > Scala >Reporter: Kareem Alhazred >Priority: Minor > > In certain situations when adding two block matrices, I get an error > regarding colPtr and the operation fails. External issue URL includes full > error and code for reproducing the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11507) Error thrown when using BlockMatrix.add
[ https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991261#comment-14991261 ] yuhao yang edited comment on SPARK-11507 at 11/5/15 7:21 AM: - Looking into it. Should be a bug. Breeze may remove the extra end in colPtr after addition. was (Author: yuhaoyan): Looking into it. Should be a bug. Breeze may remove the extra end after addition. > Error thrown when using BlockMatrix.add > --- > > Key: SPARK-11507 > URL: https://issues.apache.org/jira/browse/SPARK-11507 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.5.0 > Environment: Mac/local machine, EC2 > Scala >Reporter: Kareem Alhazred >Priority: Minor > > In certain situations when adding two block matrices, I get an error > regarding colPtr and the operation fails. External issue URL includes full > error and code for reproducing the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11475) DataFrame API saveAsTable() does not work well for HDFS HA
[ https://issues.apache.org/jira/browse/SPARK-11475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991258#comment-14991258 ] zhangxiongfei commented on SPARK-11475: --- Hi [~rekhajoshm] Thanks for pointing out my wrong Hive meta configuration. This is not a Spark issue. > DataFrame API saveAsTable() does not work well for HDFS HA > -- > > Key: SPARK-11475 > URL: https://issues.apache.org/jira/browse/SPARK-11475 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: Hadoop 2.4 & Spark 1.5.1 >Reporter: zhangxiongfei > Attachments: dataFrame_saveAsTable.txt, hdfs-site.xml, hive-site.xml > > > I was trying to save a DF to Hive using following code: > {quote} > sqlContext.range(1L,1000L,2L,2).coalesce(1).saveAsTable("dataframeTable") > {quote} > But got below exception: > {quote} > arning: there were 1 deprecation warning(s); re-run with -deprecation for > details > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1610) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1193) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3516) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:785) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo( > {quote} > *My Hive configuration is* : > {quote} > > hive.metastore.warehouse.dir > */apps/hive/warehouse* > > {quote} > It seems that the hdfs HA is not configured,then I tried below code: > {quote} > sqlContext.range(1L,1000L,2L,2).coalesce(1).saveAsParquetFile("hdfs://bitautodmp/apps/hive/warehouse/dataframeTable") > {quote} > I could verified that API *saveAsParquetFile* worked well by following > commands: > {quote} > *hadoop fs -ls /apps/hive/warehouse/dataframeTable* > Found 4 items > -rw-r--r-- 3 zhangxf hdfs 0 2015-11-03 17:57 > */apps/hive/warehouse/dataframeTable/_SUCCESS* > -rw-r--r-- 3 zhangxf hdfs199 2015-11-03 17:57 > */apps/hive/warehouse/dataframeTable/_common_metadata* > -rw-r--r-- 3 zhangxf hdfs325 2015-11-03 17:57 > */apps/hive/warehouse/dataframeTable/_metadata* > -rw-r--r-- 3 zhangxf hdfs 1098 2015-11-03 17:57 > */apps/hive/warehouse/dataframeTable/part-r-0-a05a9bf3-b2a6-40e5-b180-818efb2a0f54.gz.parquet* > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>
[ https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991239#comment-14991239 ] Apache Spark commented on SPARK-4557: - User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/9488 > Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a > Function<..., Void> > --- > > Key: SPARK-4557 > URL: https://issues.apache.org/jira/browse/SPARK-4557 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Alexis Seigneurin >Priority: Minor > Labels: starter > > In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You > have to write: > {code:java} > .foreachRDD(items -> { > ...; > return null; > }); > {code} > Instead of: > {code:java} > .foreachRDD(items -> ...); > {code} > This is because the foreachRDD method accepts a Function, Void> > instead of a VoidFunction>. This would make sense to change it > to a VoidFunction as, in Spark's API, the foreach method already accepts a > VoidFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>
[ https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4557: --- Assignee: (was: Apache Spark) > Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a > Function<..., Void> > --- > > Key: SPARK-4557 > URL: https://issues.apache.org/jira/browse/SPARK-4557 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Alexis Seigneurin >Priority: Minor > Labels: starter > > In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You > have to write: > {code:java} > .foreachRDD(items -> { > ...; > return null; > }); > {code} > Instead of: > {code:java} > .foreachRDD(items -> ...); > {code} > This is because the foreachRDD method accepts a Function, Void> > instead of a VoidFunction>. This would make sense to change it > to a VoidFunction as, in Spark's API, the foreach method already accepts a > VoidFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>
[ https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4557: --- Assignee: Apache Spark > Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a > Function<..., Void> > --- > > Key: SPARK-4557 > URL: https://issues.apache.org/jira/browse/SPARK-4557 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Alexis Seigneurin >Assignee: Apache Spark >Priority: Minor > Labels: starter > > In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You > have to write: > {code:java} > .foreachRDD(items -> { > ...; > return null; > }); > {code} > Instead of: > {code:java} > .foreachRDD(items -> ...); > {code} > This is because the foreachRDD method accepts a Function, Void> > instead of a VoidFunction>. This would make sense to change it > to a VoidFunction as, in Spark's API, the foreach method already accepts a > VoidFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11523) spark_partition_id() considered invalid function
Simeon Simeonov created SPARK-11523: --- Summary: spark_partition_id() considered invalid function Key: SPARK-11523 URL: https://issues.apache.org/jira/browse/SPARK-11523 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Simeon Simeonov {{spark_partition_id()}} works correctly in top-level {{SELECT}} statements but is not recognized in {{SELECT}} statements that define views. It seems DDL processing vs. execution in Spark SQL use two different parsers and/or environments. In the following examples, instead of the {{test_data}} table you can use any defined table name. A top-level statement works: {code} scala> ctx.sql("select spark_partition_id() as partition_id from test_data").show ++ |partition_id| ++ | 0| ... | 0| ++ only showing top 20 rows {code} The same query in a view definition fails with {{Invalid function 'spark_partition_id'}}. {code} scala> ctx.sql("create view test_view as select spark_partition_id() as partition_id from test_data") 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as select spark_partition_id() as partition_id from test_data 15/11/05 01:05:38 INFO ParseDriver: Parse Completed 15/11/05 01:05:38 INFO PerfLogger: 15/11/05 01:05:38 INFO PerfLogger: 15/11/05 01:05:38 INFO PerfLogger: 15/11/05 01:05:38 INFO PerfLogger: 15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as select spark_partition_id() as partition_id from test_data 15/11/05 01:05:38 INFO ParseDriver: Parse Completed 15/11/05 01:05:38 INFO PerfLogger: 15/11/05 01:05:38 INFO PerfLogger: 15/11/05 01:05:38 INFO CalcitePlanner: Starting Semantic Analysis 15/11/05 01:05:38 INFO CalcitePlanner: Creating view default.test_view position=12 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_database: default 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_database: default 15/11/05 01:05:38 INFO CalcitePlanner: Completed phase 1 of Semantic Analysis 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for source tables 15/11/05 01:05:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_data 15/11/05 01:05:38 INFO audit: ugi=sim ip=unknown-ip-addr cmd=get_table : db=default tbl=test_data 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for subqueries 15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for destination tables 15/11/05 01:05:38 INFO Context: New scratch dir is hdfs://localhost:9000/tmp/hive/sim/3fce9b7e-011f-4632-b673-e29067779fa0/hive_2015-11-05_01-05-38_518_4526721093949438849-1 15/11/05 01:05:38 INFO CalcitePlanner: Completed getting MetaData in Semantic Analysis 15/11/05 01:05:38 INFO BaseSemanticAnalyzer: Not invoking CBO because the statement doesn't have QUERY or EXPLAIN as root and not a CTAS; has create view 15/11/05 01:05:38 ERROR Driver: FAILED: SemanticException [Error 10011]: Line 1:32 Invalid function 'spark_partition_id' org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32 Invalid function 'spark_partition_id' at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:925) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1265) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:205) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:149) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10512) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:10468) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3840) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3619) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:8956) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:8911) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9756) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genP
[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI
[ https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991208#comment-14991208 ] Apache Spark commented on SPARK-2533: - User 'jbonofre' has created a pull request for this issue: https://github.com/apache/spark/pull/9117 > Show summary of locality level of completed tasks in the each stage page of > web UI > -- > > Key: SPARK-2533 > URL: https://issues.apache.org/jira/browse/SPARK-2533 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Masayoshi TSUZUKI >Priority: Minor > > When the number of tasks is very large, it is impossible to know how many > tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the > stage page of web UI. It would be better to show the summary of task locality > level in web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10729) word2vec model save for python
[ https://issues.apache.org/jira/browse/SPARK-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991206#comment-14991206 ] Yu Ishikawa commented on SPARK-10729: - Sorry, the cause isn't `@inherit_doc`. I misunderstood. Anyway, we should discuss the documentation. > word2vec model save for python > -- > > Key: SPARK-10729 > URL: https://issues.apache.org/jira/browse/SPARK-10729 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.1, 1.5.0 >Reporter: Joseph A Gartner III > Fix For: 1.5.0 > > > The ability to save a word2vec model has not been ported to python, and would > be extremely useful to have given the long training period. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI
[ https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991197#comment-14991197 ] Jean-Baptiste Onofré commented on SPARK-2533: - New clean PR. > Show summary of locality level of completed tasks in the each stage page of > web UI > -- > > Key: SPARK-2533 > URL: https://issues.apache.org/jira/browse/SPARK-2533 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Masayoshi TSUZUKI >Priority: Minor > > When the number of tasks is very large, it is impossible to know how many > tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the > stage page of web UI. It would be better to show the summary of task locality > level in web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI
[ https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991187#comment-14991187 ] Apache Spark commented on SPARK-2533: - User 'jbonofre' has created a pull request for this issue: https://github.com/apache/spark/pull/9487 > Show summary of locality level of completed tasks in the each stage page of > web UI > -- > > Key: SPARK-2533 > URL: https://issues.apache.org/jira/browse/SPARK-2533 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Masayoshi TSUZUKI >Priority: Minor > > When the number of tasks is very large, it is impossible to know how many > tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the > stage page of web UI. It would be better to show the summary of task locality > level in web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver
[ https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991160#comment-14991160 ] Jean-Baptiste Onofré commented on SPARK-11193: -- Hi Phil, I'm testing a fix on Kryo right now. I'm testing with different use case. I'm keeping you posted. > Spark 1.5+ Kinesis Streaming - ClassCastException when starting > KinesisReceiver > --- > > Key: SPARK-11193 > URL: https://issues.apache.org/jira/browse/SPARK-11193 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.0, 1.5.1 >Reporter: Phil Kallos > Attachments: screen.png > > > After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis > Spark Streaming application, and am being consistently greeted with this > exception: > java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast > to scala.collection.mutable.SynchronizedMap > at > org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) > at > org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at > org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Worth noting that I am able to reproduce this issue locally, and also on > Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0). > Also, I am not able to run the included kinesis-asl example. > Built locally using: > git checkout v1.5.1 > mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package > Example run command: > bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector > https://kinesis.us-east-1.amazonaws.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11486) TungstenAggregate may fail when switching to sort-based aggregation when there are string in grouping columns and no aggregation buffer columns
[ https://issues.apache.org/jira/browse/SPARK-11486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-11486. Resolution: Fixed Issue resolved by pull request 9383 [https://github.com/apache/spark/pull/9383] > TungstenAggregate may fail when switching to sort-based aggregation when > there are string in grouping columns and no aggregation buffer columns > --- > > Key: SPARK-11486 > URL: https://issues.apache.org/jira/browse/SPARK-11486 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Josh Rosen >Priority: Blocker > Fix For: 1.6.0 > > > This was discovered by [~davies]: > {code} > java.lang.UnsupportedOperationException > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:193) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(generated.java:40) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:643) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:517) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:779) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:128) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:137) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:137) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/10/28 23:25:08 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3, > runningTasks: 0 > {code} > See discussion at > https://github.com/apache/spark/pull/9383#issuecomment-153466959 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11425) Improve hybrid aggregation (sort-based after hash-based)
[ https://issues.apache.org/jira/browse/SPARK-11425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-11425. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9383 [https://github.com/apache/spark/pull/9383] > Improve hybrid aggregation (sort-based after hash-based) > > > Key: SPARK-11425 > URL: https://issues.apache.org/jira/browse/SPARK-11425 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Davies Liu > Fix For: 1.6.0 > > > After aggregation, the dataset could be smaller than inputs, so it's better > to do hash based aggregation for all inputs, then using sort based > aggregation to merge them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-11500: - Description: When executing {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()}} The order of columns is not deterministic, showing up in a different order sometimes. This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which {{ParquetRelation}} extends as you know). When {{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. So, after retrieving the list of leaf files including {{_metadata}} and {{_common_metadata}}, this starts to merge (separately and if necessary) the {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different column order having the leading columns (of the first file) which the other files do not have. I think this can be resolved by using {{LinkedHashSet}}. in a simple view, If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which column shows first since It is not deterministic. 1. Read file list (A and B) 2. Not deterministic order of (A and B or B and A) as I said. 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and A), (which maybe also should be {{reduceOptionRight}} or {{reduceOptionLeft}}). 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B and A. was: When executing {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()}} The order of columns is not deterministic, showing up in a different order sometimes. This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which {{ParquetRelation}} extends as you know). When {{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. So, after retrieving the list of leaf files including {{_metadata}} and {{_common_metadata}}, this starts to merge (separately and if necessary) the {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different column order having the leading columns (of the first file) which the other files do not have. I think this can be resolved by using {{LinkedHashSet}}. > Not deterministic order of columns when using merging schemas. > -- > > Key: SPARK-11500 > URL: https://issues.apache.org/jira/browse/SPARK-11500 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > > When executing > {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, > pathTwo).printSchema()}} > The order of columns is not deterministic, showing up in a different order > sometimes. > This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which > {{ParquetRelation}} extends as you know). When > {{FileStatusCache.listLeafFiles()}} is called, this returns > {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. > So, after retrieving the list of leaf files including {{_metadata}} and > {{_common_metadata}}, this starts to merge (separately and if necessary) the > {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in > {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different > column order having the leading columns (of the first file) which the other > files do not have. > I think this can be resolved by using {{LinkedHashSet}}. > in a simple view, > If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which > column shows first since It is not deterministic. > 1. Read file list (A and B) > 2. Not deterministic order of (A and B or B and A) as I said. > 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and > A), (which maybe also should be {{reduceOptionRight}} or > {{reduceOptionLeft}}). > 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B > and A. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11514) Pass random seed to spark.ml DecisionTree*
[ https://issues.apache.org/jira/browse/SPARK-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11514: Assignee: Apache Spark (was: Yu Ishikawa) > Pass random seed to spark.ml DecisionTree* > -- > > Key: SPARK-11514 > URL: https://issues.apache.org/jira/browse/SPARK-11514 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11514) Pass random seed to spark.ml DecisionTree*
[ https://issues.apache.org/jira/browse/SPARK-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991117#comment-14991117 ] Apache Spark commented on SPARK-11514: -- User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/9486 > Pass random seed to spark.ml DecisionTree* > -- > > Key: SPARK-11514 > URL: https://issues.apache.org/jira/browse/SPARK-11514 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11514) Pass random seed to spark.ml DecisionTree*
[ https://issues.apache.org/jira/browse/SPARK-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11514: Assignee: Yu Ishikawa (was: Apache Spark) > Pass random seed to spark.ml DecisionTree* > -- > > Key: SPARK-11514 > URL: https://issues.apache.org/jira/browse/SPARK-11514 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10838) Repeat to join one DataFrame twice,there will be AnalysisException.
[ https://issues.apache.org/jira/browse/SPARK-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991108#comment-14991108 ] Xiao Li commented on SPARK-10838: - The fix is ready. Writing unit test cases now. > Repeat to join one DataFrame twice,there will be AnalysisException. > --- > > Key: SPARK-10838 > URL: https://issues.apache.org/jira/browse/SPARK-10838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Yun Zhao > > The detail of exception is: > {quote} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) col_a#1 missing from col_a#0,col_b#2,col_a#3,col_b#4 in operator > !Join Inner, Some((col_b#2 = col_a#1)); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:554) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:521) > {quote} > The related codes are: > {quote} > import org.apache.spark.sql.SQLContext > import org.apache.spark.\{SparkContext, SparkConf} > object DFJoinTest extends App \{ > case class Foo(col_a: String) > case class Bar(col_a: String, col_b: String) > val sc = new SparkContext(new > SparkConf().setMaster("local").setAppName("DFJoinTest")) > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df1 = sc.parallelize(Array("1")).map(_.split(",")).map(p => > Foo(p(0))).toDF() > val df2 = sc.parallelize(Array("1,1")).map(_.split(",")).map(p => Bar(p(0), > p(1))).toDF() > val df3 = df1.join(df2, df1("col_a") === df2("col_a")).select(df1("col_a"), > $"col_b") > df3.join(df2, df3("col_b") === df2("col_a")).show() > // val df4 = df2.as("df4") > // df3.join(df4, df3("col_b") === df4("col_a")).show() > // df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show() > sc.stop() > } > {quote} > When uses > {quote} > val df4 = df2.as("df4") > df3.join(df4, df3("col_b") === df4("col_a")).show() > {quote} > there's errors,but when uses > {quote} > df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show() > {quote} > it's normal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11522) input_file_name() returns "" for external tables
Simeon Simeonov created SPARK-11522: --- Summary: input_file_name() returns "" for external tables Key: SPARK-11522 URL: https://issues.apache.org/jira/browse/SPARK-11522 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Simeon Simeonov Given an external table definition where the data consists of many CSV files, {{input_file_path()}} returns empty strings. Table definition: {code} CREATE EXTERNAL TABLE external_test(page_id INT, impressions INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"", "escapeChar"= "\\" ) LOCATION 'file:///Users/sim/spark/test/external_test' {code} Query: {code} sql("SELECT input_file_name() as file FROM external_test").show {code} Output: {code} ++ |file| ++ || || ... || ++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11521) LinearRegressionSummary needs to clarify which metrics are weighted
Joseph K. Bradley created SPARK-11521: - Summary: LinearRegressionSummary needs to clarify which metrics are weighted Key: SPARK-11521 URL: https://issues.apache.org/jira/browse/SPARK-11521 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Priority: Critical Some metrics in the summary are weighted (e.g., devianceResiduals), but the ones computed via RegressionMetrics are not. This should be documented very clearly (unless this gets fixed before the next release in [SPARK-11520]). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11520) RegressionMetrics should support instance weights
Joseph K. Bradley created SPARK-11520: - Summary: RegressionMetrics should support instance weights Key: SPARK-11520 URL: https://issues.apache.org/jira/browse/SPARK-11520 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley This will be important to improve LinearRegressionSummary, which currently has a mix of weighted and unweighted metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11519) Spark MemoryStore with hadoop SequenceFile cache the values is same record.
xukaiqiang created SPARK-11519: -- Summary: Spark MemoryStore with hadoop SequenceFile cache the values is same record. Key: SPARK-11519 URL: https://issues.apache.org/jira/browse/SPARK-11519 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Environment: jdk.1.7.0, spark1.1.0, hadoop2.3.0 Reporter: xukaiqiang use spark create newAPIHadoopFile which is SequenceFile format, when use spark memory cache, the cache save the same java object . read hadoop file with SequenceFileRecordReader save as NewHadoopRDD. the kv values as : [1, com.data.analysis.domain.RecordObject@54cdb594] [2, com.data.analysis.domain.RecordObject@54cdb594] [3, com.data.analysis.domain.RecordObject@54cdb594] although the value is the same java object , but i am sure the context is not same . jsut use spark memory cache, the MemoryStore vector save all records, but the value is the last vlaue from newHadoopRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11473) R-like summary statistics with intercept for OLS via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991022#comment-14991022 ] Apache Spark commented on SPARK-11473: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/9485 > R-like summary statistics with intercept for OLS via normal equation solver > --- > > Key: SPARK-11473 > URL: https://issues.apache.org/jira/browse/SPARK-11473 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang > > SPARK-9836 has provided R-like summary statistics for coefficients, we should > also add this statistics for intercept. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11473) R-like summary statistics with intercept for OLS via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11473: Assignee: (was: Apache Spark) > R-like summary statistics with intercept for OLS via normal equation solver > --- > > Key: SPARK-11473 > URL: https://issues.apache.org/jira/browse/SPARK-11473 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang > > SPARK-9836 has provided R-like summary statistics for coefficients, we should > also add this statistics for intercept. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11473) R-like summary statistics with intercept for OLS via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11473: Assignee: Apache Spark > R-like summary statistics with intercept for OLS via normal equation solver > --- > > Key: SPARK-11473 > URL: https://issues.apache.org/jira/browse/SPARK-11473 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark > > SPARK-9836 has provided R-like summary statistics for coefficients, we should > also add this statistics for intercept. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991018#comment-14991018 ] yuhao yang commented on SPARK-9273: --- Hi [~avulanov]. I've refactored the CNN in [https://github.com/hhbyyh/mCNN/tree/master/src/communityInterface] according to ANN interface. Also I made an example in [Driver.scala| https://github.com/hhbyyh/mCNN/blob/master/src/communityInterface/Driver.scala] to demo how it can be combined with existing layers to resolve MNist. Right now I'm trying to optimize the Convolution operation, referring to [http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/]. Appreciate if you can shed some light :-). > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.
Cele Liu created SPARK-11518: Summary: The script spark-submit.cmd can not handle spark directory with space. Key: SPARK-11518 URL: https://issues.apache.org/jira/browse/SPARK-11518 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Cele Liu Unzip the spark into D:\Program Files\Spark, when we submit the app, we got error: 'D:\Program' is not recognized as an internal or external command, operable program or batch file. In spark-submit.cmd, the script does not handle space: cmd /V /E /C %~dp0spark-submit2.cmd %* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel
[ https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990974#comment-14990974 ] yuhao yang commented on SPARK-10809: working on this. > Single-document topicDistributions method for LocalLDAModel > --- > > Key: SPARK-10809 > URL: https://issues.apache.org/jira/browse/SPARK-10809 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > We could provide a single-document topicDistributions method for > LocalLDAModel to allow for quick queries which avoid RDD operations. > Currently, the user must use an RDD of documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel
[ https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990975#comment-14990975 ] Apache Spark commented on SPARK-10809: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/9484 > Single-document topicDistributions method for LocalLDAModel > --- > > Key: SPARK-10809 > URL: https://issues.apache.org/jira/browse/SPARK-10809 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > We could provide a single-document topicDistributions method for > LocalLDAModel to allow for quick queries which avoid RDD operations. > Currently, the user must use an RDD of documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel
[ https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10809: Assignee: (was: Apache Spark) > Single-document topicDistributions method for LocalLDAModel > --- > > Key: SPARK-10809 > URL: https://issues.apache.org/jira/browse/SPARK-10809 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > We could provide a single-document topicDistributions method for > LocalLDAModel to allow for quick queries which avoid RDD operations. > Currently, the user must use an RDD of documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel
[ https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10809: Assignee: Apache Spark > Single-document topicDistributions method for LocalLDAModel > --- > > Key: SPARK-10809 > URL: https://issues.apache.org/jira/browse/SPARK-10809 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > We could provide a single-document topicDistributions method for > LocalLDAModel to allow for quick queries which avoid RDD operations. > Currently, the user must use an RDD of documents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11517) Calc partitions in parallel for multiple partitions table
[ https://issues.apache.org/jira/browse/SPARK-11517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990962#comment-14990962 ] Apache Spark commented on SPARK-11517: -- User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/9483 > Calc partitions in parallel for multiple partitions table > - > > Key: SPARK-11517 > URL: https://issues.apache.org/jira/browse/SPARK-11517 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: zhichao-li >Priority: Minor > > Currently we calculate the getPartitions for each "hive partition" in > sequence way, it would be faster if we can parallel this on driver side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11517) Calc partitions in parallel for multiple partitions table
[ https://issues.apache.org/jira/browse/SPARK-11517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11517: Assignee: (was: Apache Spark) > Calc partitions in parallel for multiple partitions table > - > > Key: SPARK-11517 > URL: https://issues.apache.org/jira/browse/SPARK-11517 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: zhichao-li >Priority: Minor > > Currently we calculate the getPartitions for each "hive partition" in > sequence way, it would be faster if we can parallel this on driver side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11517) Calc partitions in parallel for multiple partitions table
[ https://issues.apache.org/jira/browse/SPARK-11517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11517: Assignee: Apache Spark > Calc partitions in parallel for multiple partitions table > - > > Key: SPARK-11517 > URL: https://issues.apache.org/jira/browse/SPARK-11517 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: zhichao-li >Assignee: Apache Spark >Priority: Minor > > Currently we calculate the getPartitions for each "hive partition" in > sequence way, it would be faster if we can parallel this on driver side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11517) Calc partitions in parallel for multiple partitions table
zhichao-li created SPARK-11517: -- Summary: Calc partitions in parallel for multiple partitions table Key: SPARK-11517 URL: https://issues.apache.org/jira/browse/SPARK-11517 Project: Spark Issue Type: Improvement Components: SQL Reporter: zhichao-li Priority: Minor Currently we calculate the getPartitions for each "hive partition" in sequence way, it would be faster if we can parallel this on driver side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11516) Spark application cannot be found from JSON API even though it exists
Matt Cheah created SPARK-11516: -- Summary: Spark application cannot be found from JSON API even though it exists Key: SPARK-11516 URL: https://issues.apache.org/jira/browse/SPARK-11516 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.1, 1.4.1 Reporter: Matt Cheah I'm running a Spark standalone cluster on my mac and playing with the JSON API. I start both the master and the worker daemons, then start a Spark shell. When I hit {code} http://localhost:8080/api/v1/applications {code} I get back {code} [ { "id" : "app-20151104181347-", "name" : "Spark shell", "attempts" : [ { "startTime" : "2015-11-05T02:13:47.980GMT", "endTime" : "1969-12-31T23:59:59.999GMT", "sparkUser" : "mcheah", "completed" : false } ] } ] {code} But when I hit {code} http://localhost:8080/api/v1/applications/app-20151104181347-/executors {code} To look for executor data for the job, I get back {code} no such app: app-20151104181347- {code} even though the application exists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11453) append data to partitioned table will messes up the result
[ https://issues.apache.org/jira/browse/SPARK-11453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990951#comment-14990951 ] Apache Spark commented on SPARK-11453: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/9482 > append data to partitioned table will messes up the result > -- > > Key: SPARK-11453 > URL: https://issues.apache.org/jira/browse/SPARK-11453 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10785) Scale QuantileDiscretizer using distributed binning
[ https://issues.apache.org/jira/browse/SPARK-10785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990948#comment-14990948 ] holdenk commented on SPARK-10785: - So looking at the tree work it looks like just did a grouByKey for each column index - which isn't very useful if we've only got a single column (although its quite possible I miss read some of that). I can do something useful though with just a single column (just does a sort on the RDD and uses the sorted RDD for the quantiles) if that sounds like what we are looking for? > Scale QuantileDiscretizer using distributed binning > --- > > Key: SPARK-10785 > URL: https://issues.apache.org/jira/browse/SPARK-10785 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > [SPARK-10064] improves binning in decision trees by distributing the > computation. QuantileDiscretizer should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10785) Scale QuantileDiscretizer using distributed binning
[ https://issues.apache.org/jira/browse/SPARK-10785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990930#comment-14990930 ] Joseph K. Bradley commented on SPARK-10785: --- Yes, we should sample still. Extensions to multiple input columns is a different issue, and we can do that later if needed. Other than that, this should be analogous to the tree work. > Scale QuantileDiscretizer using distributed binning > --- > > Key: SPARK-10785 > URL: https://issues.apache.org/jira/browse/SPARK-10785 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > [SPARK-10064] improves binning in decision trees by distributing the > computation. QuantileDiscretizer should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11515) QuantileDiscretizer should take random seed
Joseph K. Bradley created SPARK-11515: - Summary: QuantileDiscretizer should take random seed Key: SPARK-11515 URL: https://issues.apache.org/jira/browse/SPARK-11515 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Priority: Minor QuantileDiscretizer takes a random sample to select bins. It currently does not specify a seed for the XORShiftRandom, but it should take a seed by extending the HasSeed Param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10745) Separate configs between shuffle and RPC
[ https://issues.apache.org/jira/browse/SPARK-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10745: Assignee: Apache Spark > Separate configs between shuffle and RPC > > > Key: SPARK-10745 > URL: https://issues.apache.org/jira/browse/SPARK-10745 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Apache Spark > > SPARK-6028 uses network module to implement RPC. However, there are some > configurations named with `spark.shuffle` prefix in the network module. We > should refactor them and make sure the user can control them in shuffle and > RPC separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10745) Separate configs between shuffle and RPC
[ https://issues.apache.org/jira/browse/SPARK-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10745: Assignee: (was: Apache Spark) > Separate configs between shuffle and RPC > > > Key: SPARK-10745 > URL: https://issues.apache.org/jira/browse/SPARK-10745 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu > > SPARK-6028 uses network module to implement RPC. However, there are some > configurations named with `spark.shuffle` prefix in the network module. We > should refactor them and make sure the user can control them in shuffle and > RPC separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10745) Separate configs between shuffle and RPC
[ https://issues.apache.org/jira/browse/SPARK-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990926#comment-14990926 ] Apache Spark commented on SPARK-10745: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/9481 > Separate configs between shuffle and RPC > > > Key: SPARK-10745 > URL: https://issues.apache.org/jira/browse/SPARK-10745 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu > > SPARK-6028 uses network module to implement RPC. However, there are some > configurations named with `spark.shuffle` prefix in the network module. We > should refactor them and make sure the user can control them in shuffle and > RPC separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label
[ https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990917#comment-14990917 ] Joseph K. Bradley commented on SPARK-7425: -- The VectorUDT usage for features should be a separate issue from this JIRA (which is for the label). > spark.ml Predictor should support other numeric types for label > --- > > Key: SPARK-7425 > URL: https://issues.apache.org/jira/browse/SPARK-7425 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > Labels: starter > > Currently, the Predictor abstraction expects the input labelCol type to be > DoubleType, but we should support other numeric types. This will involve > updating the PredictorParams.validateAndTransformSchema method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9465) Could not read parquet table after recreating it with the same table name
[ https://issues.apache.org/jira/browse/SPARK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-9465: -- Comment: was deleted (was: I can not recreate the issue on 1.5.1 or 1.6.0.. {code} scala> sqlContext.sql("create table parquet_table stored as parquet as select 1") scala> sqlContext.sql("select * from parquet_table").show +---+ |_c0| +---+ | 1| +---+ scala> val df = sqlContext.sql("select * from parquet_a") scala> df.show +---+ |_c0| +---+ | 2| +---+ scala> df.write.mode(SaveMode.Overwrite).saveAsTable("parquet_table") scala> sqlContext.sql("select * from parquet_table").show +---+ |_c0| +---+ | 2| +---+ {code} The data from `parquet_a` overwrites the data in `parquet_table`) > Could not read parquet table after recreating it with the same table name > - > > Key: SPARK-9465 > URL: https://issues.apache.org/jira/browse/SPARK-9465 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: StanZhai > > I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet > table after recreating it, we can reproduce the error as following: > {code} > // hc is an instance of HiveContext > hc.sql("select * from b").show() // this is ok and b is a parquet > table > val df = hc.sql("select * from a") > df.write.mode(SaveMode.Overwrite).saveAsTable("b") > hc.sql("select * from b").show() // got error > {code} > The error is: > {code} > java.io.FileNotFoundException: File does not exist: > /user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218) > > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214) > > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214) > > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:20
[jira] [Commented] (SPARK-9465) Could not read parquet table after recreating it with the same table name
[ https://issues.apache.org/jira/browse/SPARK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990909#comment-14990909 ] Xin Wu commented on SPARK-9465: --- I can not recreate the issue on 1.5.1 or 1.6.0.. {code} scala> sqlContext.sql("create table parquet_table stored as parquet as select 1") scala> sqlContext.sql("select * from parquet_table").show +---+ |_c0| +---+ | 1| +---+ scala> val df = sqlContext.sql("select * from parquet_a") scala> df.show +---+ |_c0| +---+ | 2| +---+ scala> df.write.mode(SaveMode.Overwrite).saveAsTable("parquet_table") scala> sqlContext.sql("select * from parquet_table").show +---+ |_c0| +---+ | 2| +---+ {code} The data from `parquet_a` overwrites the data in `parquet_table` > Could not read parquet table after recreating it with the same table name > - > > Key: SPARK-9465 > URL: https://issues.apache.org/jira/browse/SPARK-9465 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: StanZhai > > I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet > table after recreating it, we can reproduce the error as following: > {code} > // hc is an instance of HiveContext > hc.sql("select * from b").show() // this is ok and b is a parquet > table > val df = hc.sql("select * from a") > df.write.mode(SaveMode.Overwrite).saveAsTable("b") > hc.sql("select * from b").show() // got error > {code} > The error is: > {code} > java.io.FileNotFoundException: File does not exist: > /user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218) > > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214) > > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214) > > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(Dist
[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml RandomForest findSplitsBins
[ https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990899#comment-14990899 ] Joseph K. Bradley commented on SPARK-9722: -- Great, thank you! > Pass random seed to spark.ml RandomForest findSplitsBins > > > Key: SPARK-9722 > URL: https://issues.apache.org/jira/browse/SPARK-9722 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Trivial > Fix For: 1.6.0 > > > Trees use XORShiftRandom when binning continuous features. Currently, they > use a fixed seed of 1. They should accept a random seed param and use that > instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10371) Optimize sequential projections
[ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10371: Assignee: Apache Spark > Optimize sequential projections > --- > > Key: SPARK-10371 > URL: https://issues.apache.org/jira/browse/SPARK-10371 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > In ML pipelines, each transformer/estimator appends new columns to the input > DataFrame. For example, it might produce DataFrames like the following > columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), > and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c > and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. > It would be nice to detect this pattern and re-use intermediate values. > {code} > val input = sqlContext.range(10) > val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * > 2) > output.explain(true) > == Parsed Logical Plan == > 'Project [*,('x * 2) AS y#254] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Analyzed Logical Plan == > id: bigint, x: bigint, y: bigint > Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Optimized Logical Plan == > Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Physical Plan == > TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS > y#254L] > Scan PhysicalRDD[id#252L] > Code Generation: true > input: org.apache.spark.sql.DataFrame = [id: bigint] > output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10371) Optimize sequential projections
[ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990894#comment-14990894 ] Apache Spark commented on SPARK-10371: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/9480 > Optimize sequential projections > --- > > Key: SPARK-10371 > URL: https://issues.apache.org/jira/browse/SPARK-10371 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > > In ML pipelines, each transformer/estimator appends new columns to the input > DataFrame. For example, it might produce DataFrames like the following > columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), > and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c > and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. > It would be nice to detect this pattern and re-use intermediate values. > {code} > val input = sqlContext.range(10) > val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * > 2) > output.explain(true) > == Parsed Logical Plan == > 'Project [*,('x * 2) AS y#254] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Analyzed Logical Plan == > id: bigint, x: bigint, y: bigint > Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Optimized Logical Plan == > Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Physical Plan == > TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS > y#254L] > Scan PhysicalRDD[id#252L] > Code Generation: true > input: org.apache.spark.sql.DataFrame = [id: bigint] > output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10371) Optimize sequential projections
[ https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10371: Assignee: (was: Apache Spark) > Optimize sequential projections > --- > > Key: SPARK-10371 > URL: https://issues.apache.org/jira/browse/SPARK-10371 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > > In ML pipelines, each transformer/estimator appends new columns to the input > DataFrame. For example, it might produce DataFrames like the following > columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), > and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c > and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used. > It would be nice to detect this pattern and re-use intermediate values. > {code} > val input = sqlContext.range(10) > val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * > 2) > output.explain(true) > == Parsed Logical Plan == > 'Project [*,('x * 2) AS y#254] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Analyzed Logical Plan == > id: bigint, x: bigint, y: bigint > Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L] > Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Optimized Logical Plan == > Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L] > LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30 > == Physical Plan == > TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS > y#254L] > Scan PhysicalRDD[id#252L] > Code Generation: true > input: org.apache.spark.sql.DataFrame = [id: bigint] > output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9465) Could not read parquet table after recreating it with the same table name
[ https://issues.apache.org/jira/browse/SPARK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990890#comment-14990890 ] Xin Wu commented on SPARK-9465: --- I tried on both 1.5.1 and 1.6.0, I can not recreate the issue {code} scala> sqlContext.sql("create table parquet_table stored as parquet as select 1") scala> sqlContext.sql("select * from parquet_table").show +---+ |_c0| +---+ | 1| +---+ scala> sqlContext.sql("create table parquet_d stored as parquet as select 10 ") scala> val df = sqlContext.sql("select * from parquet_d") scala> df.show 15/11/04 17:14:05 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl +---+ |_c0| +---+ | 10| +---+ scala> df.write.mode(SaveMode.Overwrite).saveAsTable("parquet_table") scala> sqlContext.sql("select * from parquet_table").show 15/11/04 17:15:26 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl +---+ |_c0| +---+ | 10| +---+ {code} I have table `parquet_table` having one record with value `1` first and created `parquet_d` having one record with value `10` and overwrite the DataFrame of `parquet_d` to table `parquet_table`.. The select of `parquet_table` returns the correct result. > Could not read parquet table after recreating it with the same table name > - > > Key: SPARK-9465 > URL: https://issues.apache.org/jira/browse/SPARK-9465 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: StanZhai > > I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet > table after recreating it, we can reproduce the error as following: > {code} > // hc is an instance of HiveContext > hc.sql("select * from b").show() // this is ok and b is a parquet > table > val df = hc.sql("select * from a") > df.write.mode(SaveMode.Overwrite).saveAsTable("b") > hc.sql("select * from b").show() // got error > {code} > The error is: > {code} > java.io.FileNotFoundException: File does not exist: > /user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) > at > o
[jira] [Resolved] (SPARK-11307) Reduce memory consumption of OutputCommitCoordinator bookkeeping structures
[ https://issues.apache.org/jira/browse/SPARK-11307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-11307. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9274 [https://github.com/apache/spark/pull/9274] > Reduce memory consumption of OutputCommitCoordinator bookkeeping structures > --- > > Key: SPARK-11307 > URL: https://issues.apache.org/jira/browse/SPARK-11307 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.0 > > > OutputCommitCoordinator uses a map in a place where an array would suffice, > increasing its memory consumption for result stages with millions of tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11398) unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql
[ https://issues.apache.org/jira/browse/SPARK-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-11398. Resolution: Fixed Fix Version/s: 1.6.0 > unnecessary def dialectClassName in HiveContext, and misleading dialect conf > at the start of spark-sql > -- > > Key: SPARK-11398 > URL: https://issues.apache.org/jira/browse/SPARK-11398 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhenhua Wang >Priority: Minor > Fix For: 1.6.0 > > > 1. def dialectClassName in HiveContext is unnecessary. > In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new > HiveQLDialect(this); > else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it > calls dialectClassName, which is overriden in HiveContext and still return > super.dialectClassName. > So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of > def dialectClassName in HiveContext. > 2. When we start bin/spark-sql, the default context is HiveContext, and the > corresponding dialect is hiveql. > However, if we type "set spark.sql.dialect;", the result is "sql", which is > inconsistent with the actual dialect and is misleading. For example, we can > use sql like "create table" which is only allowed in hiveql, but this dialect > conf shows it's "sql". > Although this problem will not cause any execution error, it's misleading to > spark sql users. Therefore I think we should fix it. > In this pr, instead of overriding def dialect in conf of HiveContext, I set > the SQLConf.DIALECT directly as "hiveql", such that result of "set > spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can > still use "sql" as the dialect in HiveContext through "set > spark.sql.dialect=sql". Then the conf.dialect in HiveContext will become sql. > Because in SQLConf, def dialect = getConf(), and now the dialect in > "settings" becomes "sql". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml RandomForest findSplitsBins
[ https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990877#comment-14990877 ] Yu Ishikawa commented on SPARK-9722: [~josephkb] I'll add a seed Param to {{DecisionTreeClassifier}} and {{DecisionTreeRegressor}}. > Pass random seed to spark.ml RandomForest findSplitsBins > > > Key: SPARK-9722 > URL: https://issues.apache.org/jira/browse/SPARK-9722 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Trivial > Fix For: 1.6.0 > > > Trees use XORShiftRandom when binning continuous features. Currently, they > use a fixed seed of 1. They should accept a random seed param and use that > instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11512) Bucket Join
[ https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990868#comment-14990868 ] Cheng Hao commented on SPARK-11512: --- We need to support the "bucket" for DataSource API. > Bucket Join > --- > > Key: SPARK-11512 > URL: https://issues.apache.org/jira/browse/SPARK-11512 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > > Sort merge join on two datasets on the file system that have already been > partitioned the same with the same number of partitions and sorted within > each partition, and we don't need to sort it again while join with the > sorted/partitioned keys > This functionality exists in > - Hive (hive.optimize.bucketmapjoin.sortedmerge) > - Pig (USING 'merge') > - MapReduce (CompositeInputFormat) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11512) Bucket Join
[ https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990867#comment-14990867 ] Cheng Hao commented on SPARK-11512: --- Oh, yes, but SPARK-5292 is only about to support the Hive bucket, but in a generic way, we need to add support the bucket for Data Source API. Anyway, I will add a link with that jira issue. > Bucket Join > --- > > Key: SPARK-11512 > URL: https://issues.apache.org/jira/browse/SPARK-11512 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > > Sort merge join on two datasets on the file system that have already been > partitioned the same with the same number of partitions and sorted within > each partition, and we don't need to sort it again while join with the > sorted/partitioned keys > This functionality exists in > - Hive (hive.optimize.bucketmapjoin.sortedmerge) > - Pig (USING 'merge') > - MapReduce (CompositeInputFormat) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9722) Pass random seed to spark.ml RandomForest findSplitsBins
[ https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9722: - Summary: Pass random seed to spark.ml RandomForest findSplitsBins (was: Pass random seed to spark.ml DecisionTree*) > Pass random seed to spark.ml RandomForest findSplitsBins > > > Key: SPARK-9722 > URL: https://issues.apache.org/jira/browse/SPARK-9722 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Trivial > Fix For: 1.6.0 > > > Trees use XORShiftRandom when binning continuous features. Currently, they > use a fixed seed of 1. They should accept a random seed param and use that > instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml DecisionTree*
[ https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990866#comment-14990866 ] Joseph K. Bradley commented on SPARK-9722: -- [~yuu.ishik...@gmail.com] Thanks for the PR! Sorry I was slow to get to it. Could you please add a seed Param to DecisionTreeClassifier and DecisionTreeRegressor? I'll create a new JIRA and link it here. I hope we can squeeze it into 1.6. If we can't, then we should check to make sure XORShiftRandom behaves nicely when given a seed of 0 (which is the behavior after this PR's JIRA). > Pass random seed to spark.ml DecisionTree* > -- > > Key: SPARK-9722 > URL: https://issues.apache.org/jira/browse/SPARK-9722 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Trivial > Fix For: 1.6.0 > > > Trees use XORShiftRandom when binning continuous features. Currently, they > use a fixed seed of 1. They should accept a random seed param and use that > instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11514) Pass random seed to spark.ml DecisionTree*
Joseph K. Bradley created SPARK-11514: - Summary: Pass random seed to spark.ml DecisionTree* Key: SPARK-11514 URL: https://issues.apache.org/jira/browse/SPARK-11514 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Assignee: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11512) Bucket Join
[ https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990861#comment-14990861 ] Marcelo Vanzin commented on SPARK-11512: Isn't this the same as in SPARK-5292? > Bucket Join > --- > > Key: SPARK-11512 > URL: https://issues.apache.org/jira/browse/SPARK-11512 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > > Sort merge join on two datasets on the file system that have already been > partitioned the same with the same number of partitions and sorted within > each partition, and we don't need to sort it again while join with the > sorted/partitioned keys > This functionality exists in > - Hive (hive.optimize.bucketmapjoin.sortedmerge) > - Pig (USING 'merge') > - MapReduce (CompositeInputFormat) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11491) Use Scala 2.10.5
[ https://issues.apache.org/jira/browse/SPARK-11491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11491. - Resolution: Fixed Fix Version/s: 1.6.0 > Use Scala 2.10.5 > > > Key: SPARK-11491 > URL: https://issues.apache.org/jira/browse/SPARK-11491 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 1.6.0 > > > Spark should build against Scala 2.10.5, since that includes a fix for > Scaladoc: https://issues.scala-lang.org/browse/SI-8479 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11513) Remove the internal implicit conversion from LogicalPlan to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11513: Assignee: Reynold Xin (was: Apache Spark) > Remove the internal implicit conversion from LogicalPlan to DataFrame > - > > Key: SPARK-11513 > URL: https://issues.apache.org/jira/browse/SPARK-11513 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > DataFrame has an internal implicit conversion that turns a LogicalPlan into a > DataFrame. This has been fairly confusing to a few new contributors. Since it > doesn't buy us much, we should just remove that implicit conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11513) Remove the internal implicit conversion from LogicalPlan to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11513: Assignee: Apache Spark (was: Reynold Xin) > Remove the internal implicit conversion from LogicalPlan to DataFrame > - > > Key: SPARK-11513 > URL: https://issues.apache.org/jira/browse/SPARK-11513 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > DataFrame has an internal implicit conversion that turns a LogicalPlan into a > DataFrame. This has been fairly confusing to a few new contributors. Since it > doesn't buy us much, we should just remove that implicit conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11513) Remove the internal implicit conversion from LogicalPlan to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990853#comment-14990853 ] Apache Spark commented on SPARK-11513: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/9479 > Remove the internal implicit conversion from LogicalPlan to DataFrame > - > > Key: SPARK-11513 > URL: https://issues.apache.org/jira/browse/SPARK-11513 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > DataFrame has an internal implicit conversion that turns a LogicalPlan into a > DataFrame. This has been fairly confusing to a few new contributors. Since it > doesn't buy us much, we should just remove that implicit conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11513) Remove the internal implicit conversion from LogicalPlan to DataFrame
Reynold Xin created SPARK-11513: --- Summary: Remove the internal implicit conversion from LogicalPlan to DataFrame Key: SPARK-11513 URL: https://issues.apache.org/jira/browse/SPARK-11513 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin DataFrame has an internal implicit conversion that turns a LogicalPlan into a DataFrame. This has been fairly confusing to a few new contributors. Since it doesn't buy us much, we should just remove that implicit conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990851#comment-14990851 ] Abhishek commented on SPARK-10309: -- Is there any work around for this issue. We migrated from 1.1 to 1.5 and our jobs heavily depends on join. I have been trying to get rid of this exception but no luck. If someone can at least point where in the code might be the issue? I tried doing few joins on DF instead of SQL context but that also didn't help. Sometime job succeeds (like 5%). > Some tasks failed with Unable to acquire memory > --- > > Key: SPARK-10309 > URL: https://issues.apache.org/jira/browse/SPARK-10309 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu > > *=== Update ===* > This is caused by a mismatch between > `Runtime.getRuntime.availableProcessors()` and the number of active tasks in > `ShuffleMemoryManager`. A quick reproduction is the following: > {code} > // My machine only has 8 cores > $ bin/spark-shell --master local[32] > scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b") > scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count() > Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:120) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50) > {code} > *=== Original ===* > While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on > executor): > {code} > java.io.IOException: Unable to acquire 33554432 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The task could finished after retry. -- This m
[jira] [Created] (SPARK-11512) Bucket Join
Cheng Hao created SPARK-11512: - Summary: Bucket Join Key: SPARK-11512 URL: https://issues.apache.org/jira/browse/SPARK-11512 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Sort merge join on two datasets on the file system that have already been partitioned the same with the same number of partitions and sorted within each partition, and we don't need to sort it again while join with the sorted/partitioned keys This functionality exists in - Hive (hive.optimize.bucketmapjoin.sortedmerge) - Pig (USING 'merge') - MapReduce (CompositeInputFormat) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11510) Remove some SQL aggregation tests
[ https://issues.apache.org/jira/browse/SPARK-11510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11510. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9475 [https://github.com/apache/spark/pull/9475] > Remove some SQL aggregation tests > - > > Key: SPARK-11510 > URL: https://issues.apache.org/jira/browse/SPARK-11510 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.6.0 > > > We have some aggregate function tests in both DataFrameAggregateSuite and > SQLQuerySuite. The two have almost the same coverage and we should just > remove the SQL one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10387) Code generation for decision tree
[ https://issues.apache.org/jira/browse/SPARK-10387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990809#comment-14990809 ] holdenk commented on SPARK-10387: - Progress - although I'm a little uncertain of what the best API is for this. I'm thinking that we need to chose one of these: 1) Pick a threshold number of trees above which we skip codegen 2) Make #1 configurable through spark context 3) Provide an explicit codeGen or toCodeGen on the model which the user can call. 4) Something else entirely What are peoples thoughts? I'm probably going to proceed with #1 for now and just try and get something working end to end as a starting point but I'd appreciate peoples thoughts on how best to expose this functionality. > Code generation for decision tree > - > > Key: SPARK-10387 > URL: https://issues.apache.org/jira/browse/SPARK-10387 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: DB Tsai > > Provide code generation for decision tree and tree ensembles. Let's first > discuss the design and then create new JIRAs for tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters
[ https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-6001. -- Resolution: Fixed Assignee: Yu Ishikawa Fix Version/s: 1.5.0 Yep, thanks for pinging! > K-Means clusterer should return the assignments of input points to clusters > --- > > Key: SPARK-6001 > URL: https://issues.apache.org/jira/browse/SPARK-6001 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.1 >Reporter: Derrick Burns >Assignee: Yu Ishikawa >Priority: Minor > Fix For: 1.5.0 > > > The K-Means clusterer returns a KMeansModel that contains the cluster > centers. However, when available, I suggest that the K-Means clusterer also > return an RDD of the assignments of the input data to the clusters. While the > assignments can be computed given the KMeansModel, why not return assignments > if they are available to save re-computation costs. > The K-means implementation at > https://github.com/derrickburns/generalized-kmeans-clustering returns the > assignments when available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7332) RpcCallContext.sender has a different name from the original sender's name
[ https://issues.apache.org/jira/browse/SPARK-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu closed SPARK-7332. --- Resolution: Won't Fix They are internal APIs and not exposed to the user. > RpcCallContext.sender has a different name from the original sender's name > -- > > Key: SPARK-7332 > URL: https://issues.apache.org/jira/browse/SPARK-7332 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Qiping Li >Assignee: Shixiong Zhu >Priority: Critical > > In the function {{receiveAndReply}} of {{RpcEndpoint}}, we get the sender of > the received message through {{context.sender}}. But this doesn't work > because we don't get the right {{RpcEndpointRef}}. It's name is different > from the original sender's name, so the path is different. > Here is the code to test it: > {code} > case class Greeting(who: String) > class GreetingActor(override val rpcEnv: RpcEnv) extends RpcEndpoint with > Logging { > override def receiveAndReply(context: RpcCallContext) : > PartialFunction[Any, Unit] = { > case Greeting(who) => > logInfo("Hello " + who) > logInfo(s"${context.sender.name}") > } > } > class ToSend(override val rpcEnv: RpcEnv, greeting: RpcEndpointRef) extends > RpcEndpoint with Logging { > override def onStart(): Unit = { > logInfo(s"${self.name}") > greeting.ask(Greeting("Charlie Parker")) > } > } > object RpcEndpointNameTest { > def main(args: Array[String]): Unit = { > val actorSystemName = "driver" > val conf = new SparkConf > val rpcEnv = RpcEnv.create(actorSystemName, "localhost", 0, conf, new > SecurityManager(conf)) > val greeter = rpcEnv.setupEndpoint("greeter", new GreetingActor(rpcEnv)) > rpcEnv.setupEndpoint("toSend", new ToSend(rpcEnv, greeter)) > } > } > {code} > The result was: > {code} > toSend > Hello Charlie Parker > $a > {code} > I test the above code using akka with the following code: > {code} > case class Greeting(who: String) > class GreetingActor extends Actor with ActorLogging { > def receive = { > case Greeting(who) => > println("Hello " + who) > println(s"${sender.path} ${sender.path.name}") > } > } > class ToSend(greeting: ActorRef) extends Actor with ActorLogging { > override def preStart(): Unit = { > println(s"${self.path} ${self.path.name}") > greeting ! Greeting("Charlie Parker") > } > def receive = { > case _ => > log.info("here") > } > } > object HelloWorld { > def main(args: Array[String]): Unit = { > val system = ActorSystem("MySystem") > val greeter = system.actorOf(Props[GreetingActor], name = "greeter") > println(s"${greeter.path} ${greeter.path.name}") > val system2 = ActorSystem("MySystem2") > system2.actorOf(Props(classOf[ToSend], greeter), name = "toSend_2") > } > } > {code} > And the result was: > {code} > akka://MySystem/user/greeter greeter > akka://MySystem2/user/toSend_2 toSend_2 > Hello Charlie Parker > akka://MySystem2/user/toSend_2 toSend_2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990795#comment-14990795 ] Reynold Xin commented on SPARK-11303: - This made it into 1.5.2. > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny >Assignee: Yanbo Liang > Fix For: 1.5.2, 1.6.0 > > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > {code} > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > {code} > output: > {code} > 14 > 7 > 8 > 14 > 7 > 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11103) Parquet filters push-down may cause exception when schema merging is turned on
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990793#comment-14990793 ] Reynold Xin commented on SPARK-11103: - I think this was included in 1.5.2 > Parquet filters push-down may cause exception when schema merging is turned on > -- > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 1.5.2, 1.6.0 > > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Ta
[jira] [Updated] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11303: Fix Version/s: 1.5.2 > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny >Assignee: Yanbo Liang > Fix For: 1.5.2, 1.6.0 > > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > {code} > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > {code} > output: > {code} > 14 > 7 > 8 > 14 > 7 > 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6521) executors in the same node read local shuffle file
[ https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990791#comment-14990791 ] Apache Spark commented on SPARK-6521: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/9478 > executors in the same node read local shuffle file > -- > > Key: SPARK-6521 > URL: https://issues.apache.org/jira/browse/SPARK-6521 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 1.2.0 >Reporter: xukun > > In the past, executor read other executor's shuffle file in the same node by > net. This pr make that executors in the same node read local shuffle file In > sort-based Shuffle. It will reduce net transport. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10648) Spark-SQL JDBC fails to set a default precision and scale when they are not defined in an oracle schema.
[ https://issues.apache.org/jira/browse/SPARK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990785#comment-14990785 ] Yin Huai commented on SPARK-10648: -- https://github.com/apache/spark/pull/8780#issuecomment-145598968 and https://github.com/apache/spark/pull/8780#issuecomment-144541760 have the workaround. > Spark-SQL JDBC fails to set a default precision and scale when they are not > defined in an oracle schema. > > > Key: SPARK-10648 > URL: https://issues.apache.org/jira/browse/SPARK-10648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: using oracle 11g, ojdbc7.jar >Reporter: Travis Hegner > > Using oracle 11g as a datasource with ojdbc7.jar. When importing data into a > scala app, I am getting an exception "Overflowed precision". Some times I > would get the exception "Unscaled value too large for precision". > This issue likely affects older versions as well, but this was the version I > verified it on. > I narrowed it down to the fact that the schema detection system was trying to > set the precision to 0, and the scale to -127. > I have a proposed pull request to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7542) Support off-heap sort buffer in UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990761#comment-14990761 ] Apache Spark commented on SPARK-7542: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9477 > Support off-heap sort buffer in UnsafeExternalSorter > > > Key: SPARK-7542 > URL: https://issues.apache.org/jira/browse/SPARK-7542 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Josh Rosen >Assignee: Davies Liu > > {{UnsafeExternalSorter}}, introduced in SPARK-7081, uses on-heap {{long[]}} > arrays as its sort buffers. When records are small, the sorting array might > be as large as the data pages, so it would be useful to be able to allocate > this array off-heap (using our unsafe LongArray). Unfortunately, we can't > currently do this because TimSort calls {{allocate()}} to create data buffers > but doesn't call any corresponding cleanup methods to free them. > We should look into extending TimSort with buffer freeing methods, then > consider switching to LongArray in UnsafeShuffleSortDataFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7542) Support off-heap sort buffer in UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-7542: - Assignee: Davies Liu > Support off-heap sort buffer in UnsafeExternalSorter > > > Key: SPARK-7542 > URL: https://issues.apache.org/jira/browse/SPARK-7542 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Josh Rosen >Assignee: Davies Liu > > {{UnsafeExternalSorter}}, introduced in SPARK-7081, uses on-heap {{long[]}} > arrays as its sort buffers. When records are small, the sorting array might > be as large as the data pages, so it would be useful to be able to allocate > this array off-heap (using our unsafe LongArray). Unfortunately, we can't > currently do this because TimSort calls {{allocate()}} to create data buffers > but doesn't call any corresponding cleanup methods to free them. > We should look into extending TimSort with buffer freeing methods, then > consider switching to LongArray in UnsafeShuffleSortDataFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
[ https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990746#comment-14990746 ] Andrew Davidson commented on SPARK-11509: - I forgot to mentioned. on my cluster master I was able to run bin/pyspark --master local[2] without any problems . I was able to access sc with any issues > ipython notebooks do not work on clusters created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script > -- > > Key: SPARK-11509 > URL: https://issues.apache.org/jira/browse/SPARK-11509 > Project: Spark > Issue Type: Bug > Components: Documentation, EC2, PySpark >Affects Versions: 1.5.1 > Environment: AWS cluster > [ec2-user@ip-172-31-29-60 ~]$ uname -a > Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 > SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Andrew Davidson > > I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. > I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am > able to run the java SparkPi example on the cluster how ever I am not able to > run ipython notebooks on the cluster. (I connect using ssh tunnel) > According to the 1.5.1 getting started doc > http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell > The following should work > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook > --no-browser --port=7000" /root/spark/bin/pyspark > I am able to connect to the notebook server and start a notebook how ever > bug 1) the default sparkContext does not exist > from pyspark import SparkContext > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3 > --- > NameError Traceback (most recent call last) > in () > 1 from pyspark import SparkContext > > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > 3 textFile.take(3) > NameError: name 'sc' is not defined > bug 2) > If I create a SparkContext I get the following python versions miss match > error > sc = SparkContext("local", "Simple App") > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3) > File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.7 than that in driver > 2.6, PySpark cannot run with different minor versions > I am able to run ipython notebooks on my local Mac as follows. (by default > you would get an error that the driver and works are using different version > of python) > $ cat ~/bin/pySparkNotebook.sh > #!/bin/sh > set -x # turn debugging on > #set +x # turn debugging off > export PYSPARK_PYTHON=python3 > export PYSPARK_DRIVER_PYTHON=python3 > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ > I have spent a lot of time trying to debug the pyspark script however I can > not figure out what the problem is > Please let me know if there is something I can do to help > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
[ https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990742#comment-14990742 ] Andrew Davidson commented on SPARK-11509: - yes , it appears the show stopper issue I am facing is the python versions to do not match. How ever on my local mac I was able to figure out how to get everything to match. That technique does not work on a spark cluster. I tried a lot of hacking how ever can not seem to get the version to match. I plan to install python 3 on all machines maybe that will work better kind regards Andy > ipython notebooks do not work on clusters created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script > -- > > Key: SPARK-11509 > URL: https://issues.apache.org/jira/browse/SPARK-11509 > Project: Spark > Issue Type: Bug > Components: Documentation, EC2, PySpark >Affects Versions: 1.5.1 > Environment: AWS cluster > [ec2-user@ip-172-31-29-60 ~]$ uname -a > Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 > SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Andrew Davidson > > I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. > I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am > able to run the java SparkPi example on the cluster how ever I am not able to > run ipython notebooks on the cluster. (I connect using ssh tunnel) > According to the 1.5.1 getting started doc > http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell > The following should work > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook > --no-browser --port=7000" /root/spark/bin/pyspark > I am able to connect to the notebook server and start a notebook how ever > bug 1) the default sparkContext does not exist > from pyspark import SparkContext > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3 > --- > NameError Traceback (most recent call last) > in () > 1 from pyspark import SparkContext > > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > 3 textFile.take(3) > NameError: name 'sc' is not defined > bug 2) > If I create a SparkContext I get the following python versions miss match > error > sc = SparkContext("local", "Simple App") > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3) > File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.7 than that in driver > 2.6, PySpark cannot run with different minor versions > I am able to run ipython notebooks on my local Mac as follows. (by default > you would get an error that the driver and works are using different version > of python) > $ cat ~/bin/pySparkNotebook.sh > #!/bin/sh > set -x # turn debugging on > #set +x # turn debugging off > export PYSPARK_PYTHON=python3 > export PYSPARK_DRIVER_PYTHON=python3 > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ > I have spent a lot of time trying to debug the pyspark script however I can > not figure out what the problem is > Please let me know if there is something I can do to help > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10028) Add Python API for PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-10028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10028. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9469 [https://github.com/apache/spark/pull/9469] > Add Python API for PrefixSpan > - > > Key: SPARK-10028 > URL: https://issues.apache.org/jira/browse/SPARK-10028 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Yanbo Liang >Assignee: Yu Ishikawa > Fix For: 1.6.0 > > > Add Python API for mllib.fpm.PrefixSpan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11459) Allow configuring checkpoint dir, filenames
[ https://issues.apache.org/jira/browse/SPARK-11459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990676#comment-14990676 ] Ryan Williams commented on SPARK-11459: --- I'm mostly interested in saving RDDs to disk with kryo-serde ([SPARK-11461|https://issues.apache.org/jira/browse/SPARK-11461]). The existing checkpoint APIs are functionally exactly what I want, but they mandate putting a UUID in the directory name and fixing the basename to the RDD ID, somewhat unnecessarily. Letting the user opt in to specifying the path is a simple way to get at the functionality that I want without having to do something possibly more invasive e.g. for SPARK-11461, and there's not really a danger of it conflicting with existing checkpoint usages. It could even be exposed via a different method on SparkContext/RDD, if overloading the semantics of {{checkpoint}} is the concern. Another, orthogonal option I've worked on a little is basically copy/pasting a bunch of the checkpointing logic into a Spark package that hangs methods off of SparkContext and RDD that do [checkpointing with configurable path naming], but that's an unideal use-case for Spark packages: it is a bunch of code that's already in Spark, that I'd have to keep up to date, etc. > Allow configuring checkpoint dir, filenames > --- > > Key: SPARK-11459 > URL: https://issues.apache.org/jira/browse/SPARK-11459 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > I frequently want to persist some RDDs to disk and choose the names of the > files that they are saved as. > Currently, the {{RDD.checkpoint}} flow [writes to a directory with a UUID in > its > name|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L2050], > and the file is [always named after the RDD's > ID|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala#L96]. > Is there any reason not to allow the user to e.g. pass a string to > {{RDD.checkpoint}} that will set the location that the RDD is checkpointed to? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11499) Spark History Server UI should respect protocol when doing redirection
[ https://issues.apache.org/jira/browse/SPARK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990674#comment-14990674 ] Lukasz Jastrzebski commented on SPARK-11499: There is also https://en.wikipedia.org/wiki/X-Forwarded-For header that HS could support. > Spark History Server UI should respect protocol when doing redirection > -- > > Key: SPARK-11499 > URL: https://issues.apache.org/jira/browse/SPARK-11499 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Lukasz Jastrzebski > > Use case: > Spark history server is behind load balancer secured with ssl certificate, > unfortunately clicking on the application link redirects it to http protocol, > which may be not expose by load balancer, example flow: > * Trying 52.22.220.1... > * Connected to xxx.yyy.com (52.22.220.1) port 8775 (#0) > * WARNING: SSL: Certificate type not set, assuming PKCS#12 format. > * Client certificate: u...@yyy.com > * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 > * Server certificate: *.yyy.com > * Server certificate: Entrust Certification Authority - L1K > * Server certificate: Entrust Root Certification Authority - G2 > > GET /history/20151030-160604-3039174572-5951-22401-0004 HTTP/1.1 > > Host: xxx.yyy.com:8775 > > User-Agent: curl/7.43.0 > > Accept: */* > > > < HTTP/1.1 302 Found > < Location: > http://xxx.yyy.com:8775/history/20151030-160604-3039174572-5951-22401-0004 > < Connection: close > < Server: Jetty(8.y.z-SNAPSHOT) > < > * Closing connection 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11511) Creating an InputDStream but not using it throws NPE
[ https://issues.apache.org/jira/browse/SPARK-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990639#comment-14990639 ] Apache Spark commented on SPARK-11511: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/9476 > Creating an InputDStream but not using it throws NPE > > > Key: SPARK-11511 > URL: https://issues.apache.org/jira/browse/SPARK-11511 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu > > If an InputDStream is not used, its rememberDuration will null and > DStreamGraph.getMaxInputStreamRememberDuration will throw NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11511) Creating an InputDStream but not using it throws NPE
[ https://issues.apache.org/jira/browse/SPARK-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11511: Assignee: (was: Apache Spark) > Creating an InputDStream but not using it throws NPE > > > Key: SPARK-11511 > URL: https://issues.apache.org/jira/browse/SPARK-11511 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu > > If an InputDStream is not used, its rememberDuration will null and > DStreamGraph.getMaxInputStreamRememberDuration will throw NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11511) Creating an InputDStream but not using it throws NPE
[ https://issues.apache.org/jira/browse/SPARK-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11511: Assignee: Apache Spark > Creating an InputDStream but not using it throws NPE > > > Key: SPARK-11511 > URL: https://issues.apache.org/jira/browse/SPARK-11511 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > If an InputDStream is not used, its rememberDuration will null and > DStreamGraph.getMaxInputStreamRememberDuration will throw NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11511) Creating an InputDStream but not using it throws NPE
Shixiong Zhu created SPARK-11511: Summary: Creating an InputDStream but not using it throws NPE Key: SPARK-11511 URL: https://issues.apache.org/jira/browse/SPARK-11511 Project: Spark Issue Type: Bug Components: Streaming Reporter: Shixiong Zhu If an InputDStream is not used, its rememberDuration will null and DStreamGraph.getMaxInputStreamRememberDuration will throw NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11510) Remove some SQL aggregation tests
[ https://issues.apache.org/jira/browse/SPARK-11510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11510: Assignee: Apache Spark (was: Reynold Xin) > Remove some SQL aggregation tests > - > > Key: SPARK-11510 > URL: https://issues.apache.org/jira/browse/SPARK-11510 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We have some aggregate function tests in both DataFrameAggregateSuite and > SQLQuerySuite. The two have almost the same coverage and we should just > remove the SQL one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11510) Remove some SQL aggregation tests
[ https://issues.apache.org/jira/browse/SPARK-11510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11510: Assignee: Reynold Xin (was: Apache Spark) > Remove some SQL aggregation tests > - > > Key: SPARK-11510 > URL: https://issues.apache.org/jira/browse/SPARK-11510 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We have some aggregate function tests in both DataFrameAggregateSuite and > SQLQuerySuite. The two have almost the same coverage and we should just > remove the SQL one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11510) Remove some SQL aggregation tests
[ https://issues.apache.org/jira/browse/SPARK-11510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990615#comment-14990615 ] Apache Spark commented on SPARK-11510: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/9475 > Remove some SQL aggregation tests > - > > Key: SPARK-11510 > URL: https://issues.apache.org/jira/browse/SPARK-11510 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We have some aggregate function tests in both DataFrameAggregateSuite and > SQLQuerySuite. The two have almost the same coverage and we should just > remove the SQL one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11493) Remove Bitset in BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-11493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11493. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9452 [https://github.com/apache/spark/pull/9452] > Remove Bitset in BytesToBytesMap > > > Key: SPARK-11493 > URL: https://issues.apache.org/jira/browse/SPARK-11493 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.6.0 > > > Since we have 4 bytes as number of records in the beginning of a page, then > the address can not be zero, so we do not need the bitset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11510) Remove some SQL aggregation tests
Reynold Xin created SPARK-11510: --- Summary: Remove some SQL aggregation tests Key: SPARK-11510 URL: https://issues.apache.org/jira/browse/SPARK-11510 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We have some aggregate function tests in both DataFrameAggregateSuite and SQLQuerySuite. The two have almost the same coverage and we should just remove the SQL one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11459) Allow configuring checkpoint dir, filenames
[ https://issues.apache.org/jira/browse/SPARK-11459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11459: -- Priority: Minor (was: Major) What's the use case for this? you can already control the directory, but what Spark puts in it is an implementation detail you don't generally want to rely on > Allow configuring checkpoint dir, filenames > --- > > Key: SPARK-11459 > URL: https://issues.apache.org/jira/browse/SPARK-11459 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > I frequently want to persist some RDDs to disk and choose the names of the > files that they are saved as. > Currently, the {{RDD.checkpoint}} flow [writes to a directory with a UUID in > its > name|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L2050], > and the file is [always named after the RDD's > ID|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala#L96]. > Is there any reason not to allow the user to e.g. pass a string to > {{RDD.checkpoint}} that will set the location that the RDD is checkpointed to? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
[ https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990600#comment-14990600 ] Sean Owen commented on SPARK-11509: --- This ultimately means the initialization failed. In this situation you have to dig in the logs to see why it did, but that's why sc isn't available. But I think you ended up finding the reason there: Python version mismatch right? > ipython notebooks do not work on clusters created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script > -- > > Key: SPARK-11509 > URL: https://issues.apache.org/jira/browse/SPARK-11509 > Project: Spark > Issue Type: Bug > Components: Documentation, EC2, PySpark >Affects Versions: 1.5.1 > Environment: AWS cluster > [ec2-user@ip-172-31-29-60 ~]$ uname -a > Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 > SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Andrew Davidson > > I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. > I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am > able to run the java SparkPi example on the cluster how ever I am not able to > run ipython notebooks on the cluster. (I connect using ssh tunnel) > According to the 1.5.1 getting started doc > http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell > The following should work > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook > --no-browser --port=7000" /root/spark/bin/pyspark > I am able to connect to the notebook server and start a notebook how ever > bug 1) the default sparkContext does not exist > from pyspark import SparkContext > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3 > --- > NameError Traceback (most recent call last) > in () > 1 from pyspark import SparkContext > > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > 3 textFile.take(3) > NameError: name 'sc' is not defined > bug 2) > If I create a SparkContext I get the following python versions miss match > error > sc = SparkContext("local", "Simple App") > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3) > File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.7 than that in driver > 2.6, PySpark cannot run with different minor versions > I am able to run ipython notebooks on my local Mac as follows. (by default > you would get an error that the driver and works are using different version > of python) > $ cat ~/bin/pySparkNotebook.sh > #!/bin/sh > set -x # turn debugging on > #set +x # turn debugging off > export PYSPARK_PYTHON=python3 > export PYSPARK_DRIVER_PYTHON=python3 > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ > I have spent a lot of time trying to debug the pyspark script however I can > not figure out what the problem is > Please let me know if there is something I can do to help > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
Andrew Davidson created SPARK-11509: --- Summary: ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script Key: SPARK-11509 URL: https://issues.apache.org/jira/browse/SPARK-11509 Project: Spark Issue Type: Bug Components: Documentation, EC2, PySpark Affects Versions: 1.5.1 Environment: AWS cluster [ec2-user@ip-172-31-29-60 ~]$ uname -a Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Reporter: Andrew Davidson I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am able to run the java SparkPi example on the cluster how ever I am not able to run ipython notebooks on the cluster. (I connect using ssh tunnel) According to the 1.5.1 getting started doc http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell The following should work PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000" /root/spark/bin/pyspark I am able to connect to the notebook server and start a notebook how ever bug 1) the default sparkContext does not exist from pyspark import SparkContext textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") textFile.take(3 --- NameError Traceback (most recent call last) in () 1 from pyspark import SparkContext > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") 3 textFile.take(3) NameError: name 'sc' is not defined bug 2) If I create a SparkContext I get the following python versions miss match error sc = SparkContext("local", "Simple App") textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") textFile.take(3) File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main ("%d.%d" % sys.version_info[:2], version)) Exception: Python in worker has different version 2.7 than that in driver 2.6, PySpark cannot run with different minor versions I am able to run ipython notebooks on my local Mac as follows. (by default you would get an error that the driver and works are using different version of python) $ cat ~/bin/pySparkNotebook.sh #!/bin/sh set -x # turn debugging on #set +x # turn debugging off export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=python3 IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ I have spent a lot of time trying to debug the pyspark script however I can not figure out what the problem is Please let me know if there is something I can do to help Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10788: Assignee: Apache Spark > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10788: Assignee: (was: Apache Spark) > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990576#comment-14990576 ] Apache Spark commented on SPARK-10788: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/9474 > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org