[jira] [Created] (SPARK-3613) Don't record the size of each shuffle block for large jobs
Reynold Xin created SPARK-3613: -- Summary: Don't record the size of each shuffle block for large jobs Key: SPARK-3613 URL: https://issues.apache.org/jira/browse/SPARK-3613 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin MapStatus saves the size of each block (1 byte per block) for a particular map task. This actually means the shuffle metadata is O(M*R), where M = num maps and R = num reduces. If M is greater than a certain size, we should probably just send an average size instead of a whole array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver
[ https://issues.apache.org/jira/browse/SPARK-3612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141807#comment-14141807 ] Reynold Xin commented on SPARK-3612: [~andrewor14] [~sandyryza] any comment on this? I think you guys worked on this code. > Executor shouldn't quit if heartbeat message fails to reach the driver > -- > > Key: SPARK-3612 > URL: https://issues.apache.org/jira/browse/SPARK-3612 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Reynold Xin > > The thread started by Executor.startDriverHeartbeater can actually terminate > the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an > exception. > I don't think we should quit the executor this way. At the very least, we > would want to log a more meaningful exception then simply > {code} > 14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) > 14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) > 14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) > 14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception > in thread Thread[Driver Heartbeater,5,main] > org.apache.spark.SparkException: Error sending message [message = > Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, > ip-172-31-7-55.eu-west-1.compute.internal, 52303))] > at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 > seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) > ... 1 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver
Reynold Xin created SPARK-3612: -- Summary: Executor shouldn't quit if heartbeat message fails to reach the driver Key: SPARK-3612 URL: https://issues.apache.org/jira/browse/SPARK-3612 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin The thread started by Executor.startDriverHeartbeater can actually terminate the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an exception. I don't think we should quit the executor this way. At the very least, we would want to log a more meaningful exception then simply {code} 14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main] org.apache.spark.SparkException: Error sending message [message = Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, ip-172-31-7-55.eu-west-1.compute.internal, 52303))] at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) ... 1 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3611) Show number of cores for each executor in application web UI
Matei Zaharia created SPARK-3611: Summary: Show number of cores for each executor in application web UI Key: SPARK-3611 URL: https://issues.apache.org/jira/browse/SPARK-3611 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Matei Zaharia Priority: Minor This number is not always fully known, because e.g. in Mesos your executors can scale up and down in # of CPUs, but it would be nice to show at least the number of cores the machine has in that case, or the # of cores the executor has been configured with if known. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.
[ https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141573#comment-14141573 ] Apache Spark commented on SPARK-3606: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/2469 > Spark-on-Yarn AmIpFilter does not work with Yarn HA. > > > Key: SPARK-3606 > URL: https://issues.apache.org/jira/browse/SPARK-3606 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > > The current IP filter only considers one of the RMs in an HA setup. If the > active RM is not the configured one, you get a "connection refused" error > when clicking on the Spark AM links in the RM UI. > Similar to YARN-1811, but for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3610) Unable to load app logs for MLLib programs in history server
SK created SPARK-3610: - Summary: Unable to load app logs for MLLib programs in history server Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: SK Fix For: 1.1.0 The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032". When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal
[ https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141510#comment-14141510 ] Cheng Lian commented on SPARK-2271: --- [~pwendell] I can't find a Maven artifact for this. From the Hive JIRA Reynold pointed out, the {{Decimal128}} comes from Microsoft PolyBase, which I think is not open source. > Use Hive's high performance Decimal128 to replace BigDecimal > > > Key: SPARK-2271 > URL: https://issues.apache.org/jira/browse/SPARK-2271 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Lian > > Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3609) Add sizeInBytes statistics to Limit operator
[ https://issues.apache.org/jira/browse/SPARK-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141493#comment-14141493 ] Apache Spark commented on SPARK-3609: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2468 > Add sizeInBytes statistics to Limit operator > > > Key: SPARK-3609 > URL: https://issues.apache.org/jira/browse/SPARK-3609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The {{sizeInBytes}} statistics of a {{LIMIT}} operator can be estimated > fairly precisely when all output attributes are of native data types, all > native data types except {{StringType}} have fixed size. For {{StringType}}, > we can use a relatively large (say 4K) default size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3609) Add sizeInBytes statistics to Limit operator
Cheng Lian created SPARK-3609: - Summary: Add sizeInBytes statistics to Limit operator Key: SPARK-3609 URL: https://issues.apache.org/jira/browse/SPARK-3609 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian The {{sizeInBytes}} statistics of a {{LIMIT}} operator can be estimated fairly precisely when all output attributes are of native data types, all native data types except {{StringType}} have fixed size. For {{StringType}}, we can use a relatively large (say 4K) default size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3485) should check parameter type when find constructors
[ https://issues.apache.org/jira/browse/SPARK-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3485. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2407 [https://github.com/apache/spark/pull/2407] > should check parameter type when find constructors > -- > > Key: SPARK-3485 > URL: https://issues.apache.org/jira/browse/SPARK-3485 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Adrian Wang > Fix For: 1.2.0 > > > In hiveUdfs, we get constructors of primitivetypes by find a constructor > which takes only one parameter. This is very dangerous when more than one > constructors match. When the sequence of primitiveTypes becomes larger, the > problem would occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3605) Typo in SchemaRDD JavaDoc
[ https://issues.apache.org/jira/browse/SPARK-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3605. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2460 [https://github.com/apache/spark/pull/2460] > Typo in SchemaRDD JavaDoc > - > > Key: SPARK-3605 > URL: https://issues.apache.org/jira/browse/SPARK-3605 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sandy Ryza >Priority: Trivial > Fix For: 1.2.0 > > > "Examples are loading data from Parquet files by using by using the" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3592) applySchema to an RDD of Row
[ https://issues.apache.org/jira/browse/SPARK-3592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3592. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2448 [https://github.com/apache/spark/pull/2448] > applySchema to an RDD of Row > > > Key: SPARK-3592 > URL: https://issues.apache.org/jira/browse/SPARK-3592 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.2.0 > > > Right now, we can not appy schema to a RDD of Row, this should be a Bug, > {code} > >>> srdd = sqlCtx.jsonRDD(sc.parallelize(["""{"a":2}"""])) > >>> sqlCtx.applySchema(srdd.map(lambda x:x), srdd.schema()) > Traceback (most recent call last): > File "", line 1, in > File "/Users/daviesliu/work/spark/python/pyspark/sql.py", line 1121, > in applySchema > _verify_type(row, schema) > File "/Users/daviesliu/work/spark/python/pyspark/sql.py", line 736, > in _verify_type > % (dataType, type(obj))) > TypeError: StructType(List(StructField(a,IntegerType,true))) can not > accept abject in type > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2594) Add CACHE TABLE AS SELECT ...
[ https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2594. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2397 [https://github.com/apache/spark/pull/2397] > Add CACHE TABLE AS SELECT ... > > > Key: SPARK-2594 > URL: https://issues.apache.org/jira/browse/SPARK-2594 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Michael Armbrust >Priority: Critical > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3501) Hive SimpleUDF will create duplicated type cast which cause exception in constant folding
[ https://issues.apache.org/jira/browse/SPARK-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3501. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2368 [https://github.com/apache/spark/pull/2368] > Hive SimpleUDF will create duplicated type cast which cause exception in > constant folding > - > > Key: SPARK-3501 > URL: https://issues.apache.org/jira/browse/SPARK-3501 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Minor > Fix For: 1.2.0 > > > When do the query like: > select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as > timestamp)) from src; > SparkSQL will raise exception: > {panel} > [info] - Cast Timestamp to Timestamp in UDF *** FAILED *** > [info] scala.MatchError: TimestampType (of class > org.apache.spark.sql.catalyst.types.TimestampType$) > [info] at > org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77) > [info] at > org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251) > [info] at > org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) > [info] at > org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) > [info] at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217) > [info] at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180) > [info] at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > [info] at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3608) Spark EC2 Script does not correctly break when AWS tagging succeeds.
[ https://issues.apache.org/jira/browse/SPARK-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141438#comment-14141438 ] Apache Spark commented on SPARK-3608: - User 'vidaha' has created a pull request for this issue: https://github.com/apache/spark/pull/2466 > Spark EC2 Script does not correctly break when AWS tagging succeeds. > > > Key: SPARK-3608 > URL: https://issues.apache.org/jira/browse/SPARK-3608 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.1.0 >Reporter: Vida Ha >Priority: Critical > > Spark EC2 script will tag 5 times and not break out correctly if things > succeed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3608) Spark EC2 Script does not correctly break when AWS tagging succeeds.
Vida Ha created SPARK-3608: -- Summary: Spark EC2 Script does not correctly break when AWS tagging succeeds. Key: SPARK-3608 URL: https://issues.apache.org/jira/browse/SPARK-3608 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Vida Ha Priority: Critical Spark EC2 script will tag 5 times and not break out correctly if things succeed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3607) ConnectionManager threads.max configs on the thread pools don't work
Thomas Graves created SPARK-3607: Summary: ConnectionManager threads.max configs on the thread pools don't work Key: SPARK-3607 URL: https://issues.apache.org/jira/browse/SPARK-3607 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves Priority: Minor In the ConnectionManager we have a bunch of thread pools. They have settings for the maximum number of threads for each Threadpool (like spark.core.connection.handler.threads.max). Those configs don't work because its using a unbounded queue. From the threadpoolexecutor javadoc page: no more than corePoolSize threads will ever be created. (And the value of the maximumPoolSize therefore doesn't have any effect.) luckily this doesn't matter to much as you can work around it by just increasing the minimum like spark.core.connection.handler.threads.min. These configs aren't documented either so its more of an internal thing when someone is reading the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3491) Use pickle to serialize the data in MLlib Python
[ https://issues.apache.org/jira/browse/SPARK-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3491. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2378 [https://github.com/apache/spark/pull/2378] > Use pickle to serialize the data in MLlib Python > > > Key: SPARK-3491 > URL: https://issues.apache.org/jira/browse/SPARK-3491 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.2.0 > > > Currently, we write the code for serialization/deserialization in Python and > Scala manually, it can not scale to the big number of MLlib API. > If the serialization could be done in pickle (using Pyrolite in JVM) in > extensional way, then it should be much easy to add Python API for MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141382#comment-14141382 ] Matei Zaharia commented on SPARK-3129: -- So Hari, what is the maximum sustainable rate in MB/second? That's the number we should be looking for. I think a latency of 50-100 ms to flush is fine, but we can't be writing just 5 Kbytes/second. > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1701) Inconsistent naming: "slice" or "partition"
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1701. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2304 [https://github.com/apache/spark/pull/2304] > Inconsistent naming: "slice" or "partition" > --- > > Key: SPARK-1701 > URL: https://issues.apache.org/jira/browse/SPARK-1701 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Reporter: Daniel Darabos >Priority: Minor > Labels: starter > Fix For: 1.2.0 > > > Throughout the documentation and code "slice" and "partition" are used > interchangeably. (Or so it seems to me.) It would avoid some confusion for > new users to settle on one name. I think "partition" is winning, since that > is the name of the class representing the concept. > This should not be much more complicated to do than a search & replace. I can > take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141344#comment-14141344 ] Apache Spark commented on SPARK-1853: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/2464 > Show Streaming application code context (file, line number) in Spark Stages UI > -- > > Key: SPARK-1853 > URL: https://issues.apache.org/jira/browse/SPARK-1853 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Tathagata Das >Assignee: Mubarak Seyed > Fix For: 1.2.0 > > Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png > > > Right now, the code context (file, and line number) shown for streaming jobs > in stages UI is meaningless as it refers to internal DStream: > rather than user application file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141343#comment-14141343 ] Sean Owen commented on SPARK-3573: -- It had also not occurred to me to use an abstraction for SQL-like queries over tabular data here. On a practical level it means dragging in Spark SQL to do ML and that brings a fresh set of jar hell issues for apps that otherwise don't need it. The data types that are needed in a data frame-like structure are different. For example, it is important to know if a column will take on 2 or N distinct values and expose that via API. Spark SQL would ignore or hide this kind of info. Float/Double types don't matter per se to ML but matter in SQL. I don't think this means Spark SQL could not be a source of a data frame-like thing but I would not expect to conflate them. The bits used from ML do seem like a distinct and much simpler API that expects to be found in ML or core and not tied to SQL. > Dataset > --- > > Key: SPARK-3573 > URL: https://issues.apache.org/jira/browse/SPARK-3573 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra > ML-specific metadata embedded in its schema. > .Sample code > Suppose we have training events stored on HDFS and user/ad features in Hive, > we want to assemble features for training and then apply decision tree. > The proposed pipeline with dataset looks like the following (need more > refinements): > {code} > sqlContext.jsonFile("/path/to/training/events", > 0.01).registerTempTable("event") > val training = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > event.action AS label, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""").cache() > val indexer = new Indexer() > val interactor = new Interactor() > val fvAssembler = new FeatureVectorAssembler() > val treeClassifer = new DecisionTreeClassifer() > val paramMap = new ParamMap() > .put(indexer.features, Map("userCountryIndex" -> "userCountry")) > .put(indexer.sortByFrequency, true) > .put(iteractor.features, Map("genderMatch" -> Array("userGender", > "targetGender"))) > .put(fvAssembler.features, Map("features" -> Array("genderMatch", > "userCountryIndex", "userFeatures"))) > .put(fvAssembler.dense, true) > .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes > "features" and "label" columns. > val pipeline = Pipeline.create(indexer, interactor, fvAssembler, > treeClassifier) > val model = pipeline.fit(raw, paramMap) > sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event") > val test = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""") > val prediction = model.transform(test).select('eventId, 'prediction) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal
[ https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141336#comment-14141336 ] Patrick Wendell commented on SPARK-2271: >From the Hive JIRA it looks like this was originally a microsoft package. >Should we just get the package directly or is it necessary to get it from hive? > Use Hive's high performance Decimal128 to replace BigDecimal > > > Key: SPARK-2271 > URL: https://issues.apache.org/jira/browse/SPARK-2271 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Cheng Lian > > Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141313#comment-14141313 ] Apache Spark commented on SPARK-3604: - User 'harishreedharan' has created a pull request for this issue: https://github.com/apache/spark/pull/2463 > unbounded recursion in getNumPartitions triggers stack overflow for large > UnionRDD > -- > > Key: SPARK-3604 > URL: https://issues.apache.org/jira/browse/SPARK-3604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: linux. Used python, but error is in Scala land. >Reporter: Eric Friedman >Priority: Critical > > I have a large number of parquet files all with the same schema and attempted > to make a UnionRDD out of them. > When I call getNumPartitions(), I get a stack overflow error > that looks like this: > Py4JJavaError: An error occurred while calling o3275.partitions. > : java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.
Marcelo Vanzin created SPARK-3606: - Summary: Spark-on-Yarn AmIpFilter does not work with Yarn HA. Key: SPARK-3606 URL: https://issues.apache.org/jira/browse/SPARK-3606 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin The current IP filter only considers one of the RMs in an HA setup. If the active RM is not the configured one, you get a "connection refused" error when clicking on the Spark AM links in the RM UI. Similar to YARN-1811, but for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141271#comment-14141271 ] Xiangrui Meng commented on SPARK-3573: -- [~sandyr] SQL/Streaming/GraphX provide computation frameworks, while MLlib is to build machine learning pipelines. It is natural to take leverage on the tools we have built inside Spark. For example, we included streaming machine learning algorithms in 1.1 and we are plan to implement LDA using GraphX. I'm not worried about MLlib depending on SQL. MLlib can provide UDFs related to machine learning. It will be an extension to SQL but SQL doesn't depend on MLlib. There are not many types in ML. One thing we want to add is Vector, and its transformations are supported by MLlib's transformers. With weak types, we cannot prevent users declare a string column as numeric, but errors will be generated at runtime. [~epahomov] If we are talking about a single machine learning algorithm, label, feature, and perhaps weight should be sufficient. However, for a data pipeline, we need more flexible operations. I think we should make it easier for users to construct such a pipeline. Libraries like R and Pandas support dataframes, which is very similar to SchemaRDD, while the latter also provides execution plan. Do we need execution plan? Maybe not in the first stage but we definitely need it for future optimization. For training, we use label/features, and for prediction, we need id/features. Spark SQL can figure out the columns needed and optimize it if the underlying storage is in columnar format. One useful thing we can try is to write down some sample code to construct a pipeline with couple components and re-apply the pipeline to test data. Then take look at the code as users and see whether it is simple to use. At the beginning, I tried to define Instance similar to Weka (https://github.com/mengxr/spark-ml/blob/master/doc/instance.md), but it doesn't work well to address those pipelines. > Dataset > --- > > Key: SPARK-3573 > URL: https://issues.apache.org/jira/browse/SPARK-3573 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra > ML-specific metadata embedded in its schema. > .Sample code > Suppose we have training events stored on HDFS and user/ad features in Hive, > we want to assemble features for training and then apply decision tree. > The proposed pipeline with dataset looks like the following (need more > refinements): > {code} > sqlContext.jsonFile("/path/to/training/events", > 0.01).registerTempTable("event") > val training = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > event.action AS label, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""").cache() > val indexer = new Indexer() > val interactor = new Interactor() > val fvAssembler = new FeatureVectorAssembler() > val treeClassifer = new DecisionTreeClassifer() > val paramMap = new ParamMap() > .put(indexer.features, Map("userCountryIndex" -> "userCountry")) > .put(indexer.sortByFrequency, true) > .put(iteractor.features, Map("genderMatch" -> Array("userGender", > "targetGender"))) > .put(fvAssembler.features, Map("features" -> Array("genderMatch", > "userCountryIndex", "userFeatures"))) > .put(fvAssembler.dense, true) > .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes > "features" and "label" columns. > val pipeline = Pipeline.create(indexer, interactor, fvAssembler, > treeClassifier) > val model = pipeline.fit(raw, paramMap) > sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event") > val test = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""") > val prediction = model.transform(test).select('eventId, 'prediction) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2175) Null values when using App trait.
[ https://issues.apache.org/jira/browse/SPARK-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141217#comment-14141217 ] Brandon Amos commented on SPARK-2175: - Hi, does the following snippet from the mailing list post linked above still exhibit this behavior? val suffix = "-suffix" val l = sc.parallelize(List("a", "b", "c")) println(l.map(_+suffix).collect().mkString(",")) > Null values when using App trait. > - > > Key: SPARK-2175 > URL: https://issues.apache.org/jira/browse/SPARK-2175 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Linux >Reporter: Brandon Amos >Priority: Trivial > > See > http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerExceptions-when-using-val-or-broadcast-on-a-standalone-cluster-tc7524.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-951) Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141124#comment-14141124 ] Anant Daksh Asthana commented on SPARK-951: --- caizhua Could you please elaborate a little more on the issue? right now 'This code' and 'input file named Gmm_spark.tbl' are unknown to me at the time of reading this > Gaussian Mixture Model > -- > > Key: SPARK-951 > URL: https://issues.apache.org/jira/browse/SPARK-951 > Project: Spark > Issue Type: Story > Components: Examples >Affects Versions: 0.7.3 >Reporter: caizhua >Priority: Critical > Labels: Learning, Machine, Model > > This code includes the code for Gaussian Mixture Model. The input file named > Gmm_spark.tbl is the input for this program. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3536) SELECT on empty parquet table throws exception
[ https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3536: Assignee: Ravindra Pesala > SELECT on empty parquet table throws exception > -- > > Key: SPARK-3536 > URL: https://issues.apache.org/jira/browse/SPARK-3536 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Ravindra Pesala > Labels: starter > > Reported by [~matei]. Reproduce as follows: > {code} > scala> case class Data(i: Int) > defined class Data > scala> createParquetFile[Data]("testParquet") > scala> parquetFile("testParquet").count() > 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due > to exception - job: 0 > java.lang.NullPointerException > at > org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438) > at > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) > at > org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141063#comment-14141063 ] Sandy Ryza commented on SPARK-3573: --- Currently SchemaRDD does depend on Catalyst. Are you thinking we'd take that out? I wasn't thinking about any specific drawbacks, unless SQL might need to depend on MLLib as well? I guess I'm thinking about it more from the perspective of what mental model we expect users to have when dealing with Datasets. SchemaRDD brings along baggage like LogicalPlans - do users need to understand what that is? SQL and ML types sometimes line up, sometimes have fuzzy relationships, and sometimes can't be translated. How does the mapping get defined? What stops someone from annotating a String column with "numeric"? > Dataset > --- > > Key: SPARK-3573 > URL: https://issues.apache.org/jira/browse/SPARK-3573 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra > ML-specific metadata embedded in its schema. > .Sample code > Suppose we have training events stored on HDFS and user/ad features in Hive, > we want to assemble features for training and then apply decision tree. > The proposed pipeline with dataset looks like the following (need more > refinements): > {code} > sqlContext.jsonFile("/path/to/training/events", > 0.01).registerTempTable("event") > val training = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > event.action AS label, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""").cache() > val indexer = new Indexer() > val interactor = new Interactor() > val fvAssembler = new FeatureVectorAssembler() > val treeClassifer = new DecisionTreeClassifer() > val paramMap = new ParamMap() > .put(indexer.features, Map("userCountryIndex" -> "userCountry")) > .put(indexer.sortByFrequency, true) > .put(iteractor.features, Map("genderMatch" -> Array("userGender", > "targetGender"))) > .put(fvAssembler.features, Map("features" -> Array("genderMatch", > "userCountryIndex", "userFeatures"))) > .put(fvAssembler.dense, true) > .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes > "features" and "label" columns. > val pipeline = Pipeline.create(indexer, interactor, fvAssembler, > treeClassifier) > val model = pipeline.fit(raw, paramMap) > sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event") > val test = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""") > val prediction = model.transform(test).select('eventId, 'prediction) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3605) Typo in SchemaRDD JavaDoc
[ https://issues.apache.org/jira/browse/SPARK-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141028#comment-14141028 ] Apache Spark commented on SPARK-3605: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/2460 > Typo in SchemaRDD JavaDoc > - > > Key: SPARK-3605 > URL: https://issues.apache.org/jira/browse/SPARK-3605 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sandy Ryza >Priority: Trivial > > "Examples are loading data from Parquet files by using by using the" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3605) Typo in SchemaRDD JavaDoc
Sandy Ryza created SPARK-3605: - Summary: Typo in SchemaRDD JavaDoc Key: SPARK-3605 URL: https://issues.apache.org/jira/browse/SPARK-3605 Project: Spark Issue Type: Bug Components: SQL Reporter: Sandy Ryza Priority: Trivial "Examples are loading data from Parquet files by using by using the" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3583) Spark run slow after unexpected repartition
[ https://issues.apache.org/jira/browse/SPARK-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3583. Resolution: Invalid Hi there, can you start on the user list rather than on JIRA? We use JIRA to track well described bugs once we've pinpointed issues. Thanks! > Spark run slow after unexpected repartition > --- > > Key: SPARK-3583 > URL: https://issues.apache.org/jira/browse/SPARK-3583 > Project: Spark > Issue Type: Bug >Affects Versions: 0.9.1 >Reporter: ShiShu > Labels: easyfix > Attachments: spark_q_001.jpg, spark_q_004.jpg, spark_q_005.jpg, > spark_q_006.jpg > > > Hi dear all~ > My spark application sometimes runs much slower than it use to be, so I > wonder why would this happen. > I find out that after a repartition stage of stage 17, all tasks go to one > executor. But in my code, I only use repartition at the very beginning. > In my application, before stage 17, every stage run sucessfully within 1 > minute, but after stage 17, it cost more than 10 minutes for every stage. > Normally my application runs succcessfully and will finish within 9 minites. > My spark version is 0.9.1, and my program is writen by scala. > I take some screenshots but don't know how to post it, pls tell me if you > need. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2175) Null values when using App trait.
[ https://issues.apache.org/jira/browse/SPARK-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140947#comment-14140947 ] Patrick Wendell commented on SPARK-2175: Thanks for reporting this - can someone provide a concise example here to post int the JIRA body? Thanks! > Null values when using App trait. > - > > Key: SPARK-2175 > URL: https://issues.apache.org/jira/browse/SPARK-2175 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Linux >Reporter: Brandon Amos >Priority: Trivial > > See > http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerExceptions-when-using-val-or-broadcast-on-a-standalone-cluster-tc7524.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3604: --- Target Version/s: 1.2.0 > unbounded recursion in getNumPartitions triggers stack overflow for large > UnionRDD > -- > > Key: SPARK-3604 > URL: https://issues.apache.org/jira/browse/SPARK-3604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: linux. Used python, but error is in Scala land. >Reporter: Eric Friedman >Priority: Critical > > I have a large number of parquet files all with the same schema and attempted > to make a UnionRDD out of them. > When I call getNumPartitions(), I get a stack overflow error > that looks like this: > Py4JJavaError: An error occurred while calling o3275.partitions. > : java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3604: --- Priority: Critical (was: Blocker) > unbounded recursion in getNumPartitions triggers stack overflow for large > UnionRDD > -- > > Key: SPARK-3604 > URL: https://issues.apache.org/jira/browse/SPARK-3604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: linux. Used python, but error is in Scala land. >Reporter: Eric Friedman >Priority: Critical > > I have a large number of parquet files all with the same schema and attempted > to make a UnionRDD out of them. > When I call getNumPartitions(), I get a stack overflow error > that looks like this: > Py4JJavaError: An error occurred while calling o3275.partitions. > : java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140940#comment-14140940 ] Patrick Wendell commented on SPARK-3604: Yeah good catch, we should fix this. > unbounded recursion in getNumPartitions triggers stack overflow for large > UnionRDD > -- > > Key: SPARK-3604 > URL: https://issues.apache.org/jira/browse/SPARK-3604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: linux. Used python, but error is in Scala land. >Reporter: Eric Friedman >Priority: Critical > > I have a large number of parquet files all with the same schema and attempted > to make a UnionRDD out of them. > When I call getNumPartitions(), I get a stack overflow error > that looks like this: > Py4JJavaError: An error occurred while calling o3275.partitions. > : java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3599) Avoid loading and printing properties file content frequently
[ https://issues.apache.org/jira/browse/SPARK-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140896#comment-14140896 ] Apache Spark commented on SPARK-3599: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/2454 > Avoid loading and printing properties file content frequently > - > > Key: SPARK-3599 > URL: https://issues.apache.org/jira/browse/SPARK-3599 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: WangTaoTheTonic >Priority: Minor > Attachments: too many verbose.txt > > > When I use -v | -verbos in spark-submit, there prints lots of message about > contents in properties file. > After checking code in SparkSubmit.scala and SparkSubmitArguments.scala, I > found the "getDefaultSparkProperties" method is invoked in three places, and > every time we invoke it, we load properties from properties file, and print > again if option -v used. > We might should use a value instead of method when we use default properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception
[ https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140898#comment-14140898 ] Apache Spark commented on SPARK-3536: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/2456 > SELECT on empty parquet table throws exception > -- > > Key: SPARK-3536 > URL: https://issues.apache.org/jira/browse/SPARK-3536 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust > Labels: starter > > Reported by [~matei]. Reproduce as follows: > {code} > scala> case class Data(i: Int) > defined class Data > scala> createParquetFile[Data]("testParquet") > scala> parquetFile("testParquet").count() > 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due > to exception - job: 0 > java.lang.NullPointerException > at > org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438) > at > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) > at > org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3598) cast to timestamp should be the same as hive
[ https://issues.apache.org/jira/browse/SPARK-3598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140900#comment-14140900 ] Apache Spark commented on SPARK-3598: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/2458 > cast to timestamp should be the same as hive > > > Key: SPARK-3598 > URL: https://issues.apache.org/jira/browse/SPARK-3598 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang > > select cast(1000 as timestamp) from src limit 1; > should return 1970-01-01 00:00:01 > also, current implementation has bug when the time is before 1970-01-01 > 00:00:00 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3250) More Efficient Sampling
[ https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140897#comment-14140897 ] Apache Spark commented on SPARK-3250: - User 'erikerlandson' has created a pull request for this issue: https://github.com/apache/spark/pull/2455 > More Efficient Sampling > --- > > Key: SPARK-3250 > URL: https://issues.apache.org/jira/browse/SPARK-3250 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: RJ Nowling > > Sampling, as currently implemented in Spark, is an O\(n\) operation. A > number of stochastic algorithms achieve speed ups by exploiting O\(k\) > sampling, where k is the number of data points to sample. Examples of such > algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient > Descent with mini batching. > More efficient sampling may be achievable by packing partitions with an > ArrayBuffer or other data structure supporting random access. Since many of > these stochastic algorithms perform repeated rounds of sampling, it may be > feasible to perform a transformation to change the backing data structure > followed by multiple rounds of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3268) DoubleType should support modulus
[ https://issues.apache.org/jira/browse/SPARK-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140899#comment-14140899 ] Apache Spark commented on SPARK-3268: - User 'gvramana' has created a pull request for this issue: https://github.com/apache/spark/pull/2457 > DoubleType should support modulus > -- > > Key: SPARK-3268 > URL: https://issues.apache.org/jira/browse/SPARK-3268 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Chris Grier >Priority: Minor > > Using the modulus operator (%) on Doubles throws and exception. > eg: > SELECT 1388632775.0 % 60 from tablename LIMIT 1 > Throws: > java.lang.Exception: Type DoubleType does not support numeric operations -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140713#comment-14140713 ] Eric Friedman commented on SPARK-3604: -- many more frames of the same content than pasted above, of course. Looks like the problem is here: override def getPartitions: Array[Partition] = { val array = new Array[Partition](rdds.map(_.partitions.size).sum) and should either be solved iteratively or as a tail call. > unbounded recursion in getNumPartitions triggers stack overflow for large > UnionRDD > -- > > Key: SPARK-3604 > URL: https://issues.apache.org/jira/browse/SPARK-3604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: linux. Used python, but error is in Scala land. >Reporter: Eric Friedman >Priority: Blocker > > I have a large number of parquet files all with the same schema and attempted > to make a UnionRDD out of them. > When I call getNumPartitions(), I get a stack overflow error > that looks like this: > Py4JJavaError: An error occurred while calling o3275.partitions. > : java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
Eric Friedman created SPARK-3604: Summary: unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD Key: SPARK-3604 URL: https://issues.apache.org/jira/browse/SPARK-3604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: linux. Used python, but error is in Scala land. Reporter: Eric Friedman Priority: Blocker I have a large number of parquet files all with the same schema and attempted to make a UnionRDD out of them. When I call getNumPartitions(), I get a stack overflow error that looks like this: Py4JJavaError: An error occurred while calling o3275.partitions. : java.lang.StackOverflowError at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13
[ https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140657#comment-14140657 ] Greg Senia commented on SPARK-2706: --- We have been using this fix for a few weeks now against Hive 13. The only outstanding issue I see and this could be something larger is the fact that Spark Thrift service doesn't seem to support the hive.server2.enable.doAs = true. It doesn't set proxy user. > Enable Spark to support Hive 0.13 > - > > Key: SPARK-2706 > URL: https://issues.apache.org/jira/browse/SPARK-2706 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 1.0.1 >Reporter: Chunjun Xiao >Assignee: Zhan Zhang > Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, > spark-hive.err, v1.0.2.diff > > > It seems Spark cannot work with Hive 0.13 well. > When I compiled Spark with Hive 0.13.1, I got some error messages, as > attached below. > So, when can Spark be enabled to support Hive 0.13? > Compiling Error: > {quote} > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: > type mismatch; > found : String > required: Array[String] > [ERROR] val proc: CommandProcessor = > CommandProcessorFactory.get(tokens(0), hiveconf) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: > overloaded method constructor TableDesc with alternatives: > (x$1: Class[_ <: org.apache.hadoop.mapred.InputFormat[_, _]],x$2: > Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc > > ()org.apache.hadoop.hive.ql.plan.TableDesc > cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], > Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in > value tableDesc)(in value tableDesc)], java.util.Properties) > [ERROR] val tableDesc = new TableDesc( > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: > value getPartitionPath is not a member of > org.apache.hadoop.hive.ql.metadata.Partition > [ERROR] val partPath = partition.getPartitionPath > [ERROR]^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: > value appendReadColumnNames is not a member of object > org.apache.hadoop.hive.serde2.ColumnProjectionUtils > [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, > attributes.map(_.name)) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: > org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor > [ERROR] new HiveDecimal(bd.underlying()) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: > type mismatch; > found : org.apache.hadoop.fs.Path > required: String > [ERROR] > SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: > value getExternalTmpFileURI is not a member of > org.apache.hadoop.hive.ql.Context > [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: > org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor > [ERROR] case bd: BigDecimal => new HiveDecimal(bd.underlying()) > [ERROR] ^ > [ERROR] 8 errors found > [DEBUG] Compilation failed (CompilerInterface) > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Spark Project Parent POM .. SUCCESS [2.579s] > [INFO] Spark Project Core SUCCESS [2:39.805s] > [INFO] Spark Project Bagel ... SUCCESS [21.148s] > [INFO] Spark Project GraphX .. SUCCESS [59.950s] > [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] > [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] > [INFO] Spark Project Tools ... SUCCESS [15.405s] > [INFO] Spark Project Catalyst SUCCESS [1:17.405s] > [INFO] Spark Project SQL . SUCCESS
[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140647#comment-14140647 ] Imran Rashid commented on SPARK-2365: - This looks fantastic. I think it will also see heavy use outside of GraphX as well. I only have one question about the api -- I think the name `IndexedRDD` is more appropriate for an interface, not this concrete implementation. I can imagine other indexing strategies that would also feel like a `IndexedRDD`. Its always hard to know when its worth putting in an interface and you will actually need to put in more concrete implementations, but this seems like a good candidate to me. You could partially deal with this by having a companion object to the interface, which constructs this particular implementation. (Eg., the way the `IndexedSeq` companion object's `apply()` methods build a `Vector`). Not that I have any great recommendations for a better name ... `UpdateableHashIndexedRDD` maybe? my other comments on the design can probably be handled in future work, but while they are on my mind: # Is there any way to save & load an `IndexedRDD` from hdfs? It seems like you'll always need to reshuffle the data when you load it if you just do `IndexedRDD(sc.hadoopFile(...))`. It seems we need a way for a hadoop file to be loaded with an "assumed Partitioner" (I thought I opened a ticket for that a while ago, but I can't find it ... I might open another one). Also you might want some way to load the index from disk as well, though I suppose rebuilding that isn't tooo painful. # I'm wondering about whether we should add a "bulk multiget". Eg., say you're IndexedRDD is 1 B entries, and you want to look up and process 100K of them in parallel. You probably don't want to do a sequential scan ... but you also don't want to call `multiget` which will pull the records onto the driver. Actually, as I'm writing this, I'm realizing there is probably a better way to do this -- you should just make another RDD out of your 100K elements, and then do an innerjoin. Does that sound right? We could add a convenience method for this -- but maybe I'm the only one who wants this so its premature to do anything about it. again, I think this is a fantastic addition! I'm looking through the code now, but so far it all seems great. > Add IndexedRDD, an efficient updatable key-value store > -- > > Key: SPARK-2365 > URL: https://issues.apache.org/jira/browse/SPARK-2365 > Project: Spark > Issue Type: New Feature > Components: GraphX, Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > Attachments: 2014-07-07-IndexedRDD-design-review.pdf > > > RDDs currently provide a bulk-updatable, iterator-based interface. This > imposes minimal requirements on the storage layer, which only needs to > support sequential access, enabling on-disk and serialized storage. > However, many applications would benefit from a richer interface. Efficient > support for point lookups would enable serving data out of RDDs, but it > currently requires iterating over an entire partition to find the desired > element. Point updates similarly require copying an entire iterator. Joins > are also expensive, requiring a shuffle and local hash joins. > To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key > uniqueness and pre-indexing the entries for efficient joins and point > lookups, updates, and deletions. > It would be implemented by (1) hash-partitioning the entries by key, (2) > maintaining a hash index within each partition, and (3) using purely > functional (immutable and efficiently updatable) data structures to enable > efficient modifications and deletions. > GraphX would be the first user of IndexedRDD, since it currently implements a > limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including streaming updates to RDDs, direct > serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3603) InvalidClassException on a Linux VM - probably problem with serialization
Tomasz Dudziak created SPARK-3603: - Summary: InvalidClassException on a Linux VM - probably problem with serialization Key: SPARK-3603 URL: https://issues.apache.org/jira/browse/SPARK-3603 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.0.0 Environment: Linux version 2.6.32-358.32.3.el6.x86_64 (mockbu...@x86-029.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri Jan 17 08:42:31 EST 2014 java version "1.7.0_25" OpenJDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64) OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) Spark (either 1.0.0 or 1.1.0) Reporter: Tomasz Dudziak Priority: Critical I have a Scala app connecting to a standalone Spark cluster. It works fine on Windows or on a Linux VM; however, when I try to run the app and the Spark cluster on another Linux VM (the same Linux kernel, Java and Spark - tested for versions 1.0.0 and 1.1.0) I get the below exception. This looks kind of similar to the Big-Endian (IBM Power7) Spark Serialization issue (SPARK-2018), but... my system is definitely little endian and I understand the big endian issue should be already fixed in Spark 1.1.0 anyway. I'd appreaciate your help. 01:34:53.251 WARN [Result resolver thread-0][TaskSetManager] Lost TID 2 (task 1.0:2) 01:34:53.278 WARN [Result resolver thread-0][TaskSetManager] Loss was due to java.io.InvalidClassException java.io.InvalidClassException: scala.reflect.ClassTag$$anon$1; local class incompatible: stream classdesc serialVersionUID = -4937928798201944954, local class serialVersionUID = -8102093212602380348 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.jav
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140396#comment-14140396 ] Alexander Ulanov commented on SPARK-3403: - Thanks, Sam! Posted to OpenBLAS: https://github.com/xianyi/OpenBLAS/issues/452 > NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) > - > > Key: SPARK-3403 > URL: https://issues.apache.org/jira/browse/SPARK-3403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 > Environment: Setup: Windows 7, x64 libraries for netlib-java (as > described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and > MinGW64 precompiled dlls. >Reporter: Alexander Ulanov > Fix For: 1.2.0 > > Attachments: NativeNN.scala > > > Code: > val model = NaiveBayes.train(train) > val predictionAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > predictionAndLabels.foreach(println) > Result: > program crashes with: "Process finished with exit code -1073741819 > (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140319#comment-14140319 ] Sam Halliday commented on SPARK-3403: - thanks guys. This looks like its even more upstream of me. Would be good if you can submit to OpenBLAS. I've never seen great gains in OpenBLAS over ATLAS, and certainly the AMD/Intel versions are far superior so I recommend them if performance is really critical. > NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) > - > > Key: SPARK-3403 > URL: https://issues.apache.org/jira/browse/SPARK-3403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 > Environment: Setup: Windows 7, x64 libraries for netlib-java (as > described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and > MinGW64 precompiled dlls. >Reporter: Alexander Ulanov > Fix For: 1.2.0 > > Attachments: NativeNN.scala > > > Code: > val model = NaiveBayes.train(train) > val predictionAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > predictionAndLabels.foreach(println) > Result: > program crashes with: "Process finished with exit code -1073741819 > (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception
[ https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140269#comment-14140269 ] Ravindra Pesala commented on SPARK-3536: [~isaias.barroso] I have submitted the PR 4 hours ago,but I am not sure why it is not yet linked it to jira. > SELECT on empty parquet table throws exception > -- > > Key: SPARK-3536 > URL: https://issues.apache.org/jira/browse/SPARK-3536 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust > Labels: starter > > Reported by [~matei]. Reproduce as follows: > {code} > scala> case class Data(i: Int) > defined class Data > scala> createParquetFile[Data]("testParquet") > scala> parquetFile("testParquet").count() > 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due > to exception - job: 0 > java.lang.NullPointerException > at > org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438) > at > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) > at > org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140260#comment-14140260 ] Egor Pakhomov commented on SPARK-3530: -- Nice doc. Parameters passing as part of grid search and pipeline creation great and important feature, but it's only one of the features. For me it's more important to see Estimator abstraction in spark code base early, may be not earlier than introducing dataset abstraction, but definitely earlier than any work on grid search. When we where thinking on creating such pipeline framework we came to conclusion that transformations in this pipeline is like steps in oozie workflow - they should be easy retrieble, be persisted, and have some queue. It's because transformation can take hours and rerun the whole pipeline in case of step failure is expensive. Pipeline can consist of gridsearch with parameters search, which means, that there are a lot of parallel executions, which need wise scheduling. So I think pipeline should be executed on some cluster wise scheduler with some persistence. I'm not saying, that it's absolutly necessary now, but it would be great to have architecture open to such possibility. > Pipeline and Parameters > --- > > Key: SPARK-3530 > URL: https://issues.apache.org/jira/browse/SPARK-3530 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > This part of the design doc is for pipelines and parameters. I put the design > doc at > https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing > I will copy the proposed interfaces to this JIRA later. Some sample code can > be viewed at: https://github.com/mengxr/spark-ml/ > Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3602) Can't run cassandra_inputformat.py
[ https://issues.apache.org/jira/browse/SPARK-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140236#comment-14140236 ] Frens Jan Rumph edited comment on SPARK-3602 at 9/19/14 9:45 AM: - When running this against the spark-1.1.0-bin-hadoop build I get the following output: {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/09/19 11:24:31 WARN Utils: Your hostname, laptop-x resolves to a loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0) 14/09/19 11:24:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/09/19 11:24:31 INFO SecurityManager: Changing view acls to: frens-jan, 14/09/19 11:24:31 INFO SecurityManager: Changing modify acls to: frens-jan, 14/09/19 11:24:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); users with modify permissions: Set(frens-jan, ) 14/09/19 11:24:31 INFO Slf4jLogger: Slf4jLogger started 14/09/19 11:24:31 INFO Remoting: Starting remoting 14/09/19 11:24:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@laptop-x.local:44417] 14/09/19 11:24:32 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@laptop-x.local:44417] 14/09/19 11:24:32 INFO Utils: Successfully started service 'sparkDriver' on port 44417. 14/09/19 11:24:32 INFO SparkEnv: Registering MapOutputTracker 14/09/19 11:24:32 INFO SparkEnv: Registering BlockManagerMaster 14/09/19 11:24:32 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140919112432-527c 14/09/19 11:24:32 INFO Utils: Successfully started service 'Connection manager for block manager' on port 44978. 14/09/19 11:24:32 INFO ConnectionManager: Bound socket to port 44978 with id = ConnectionManagerId(laptop-x.local,44978) 14/09/19 11:24:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/09/19 11:24:32 INFO BlockManagerMaster: Trying to register BlockManager 14/09/19 11:24:32 INFO BlockManagerMasterActor: Registering block manager laptop-x.local:44978 with 265.4 MB RAM 14/09/19 11:24:32 INFO BlockManagerMaster: Registered BlockManager 14/09/19 11:24:32 INFO HttpFileServer: HTTP File server directory is /tmp/spark-4168e04d-508f-4f3b-92b4-050ecb47dfc7 14/09/19 11:24:32 INFO HttpServer: Starting HTTP Server 14/09/19 11:24:32 INFO Utils: Successfully started service 'HTTP file server' on port 54892. 14/09/19 11:24:32 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/09/19 11:24:32 INFO SparkUI: Started SparkUI at http://laptop-x.local:4040 14/09/19 11:24:33 INFO SparkContext: Added JAR file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar at http://192.168.2.2:54892/jars/spark-examples-1.1.0-hadoop1.0.4.jar with timestamp 148673018 14/09/19 11:24:33 INFO Utils: Copying /home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py to /tmp/spark-be9320ce-82f7-437d-af36-a31b6f7375be/cassandra_inputformat.py 14/09/19 11:24:33 INFO SparkContext: Added file file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py at http://192.168.2.2:54892/files/cassandra_inputformat.py with timestamp 148673019 14/09/19 11:24:33 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@laptop-x.local:44417/user/HeartbeatReceiver 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with curMem=0, maxMem=278302556 14/09/19 11:24:33 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 34.2 KB, free 265.4 MB) 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with curMem=34980, maxMem=278302556 14/09/19 11:24:33 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 34.2 KB, free 265.3 MB) 14/09/19 11:24:33 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.CassandraCQLKeyConverter 14/09/19 11:24:33 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.CassandraCQLValueConverter 14/09/19 11:24:33 INFO SparkContext: Starting job: first at SerDeUtil.scala:70 14/09/19 11:24:33 INFO DAGScheduler: Got job 0 (first at SerDeUtil.scala:70) with 1 output partitions (allowLocal=true) 14/09/19 11:24:33 INFO DAGScheduler: Final stage: Stage 0(first at SerDeUtil.scala:70) 14/09/19 11:24:33 INFO DAGScheduler: Parents of final stage: List() 14/09/19 11:24:33 INFO DAGScheduler: Missing parents: List() 14/09/19 11:24:33 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at PythonHadoopUtil.scala:185), which has no missing parents 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(2440) called with curM
[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception
[ https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140239#comment-14140239 ] Isaias Barroso commented on SPARK-3536: --- Ravindra Pesala, I've fixed it on a BRANCH and ran the Test Suite for parquet, if someone hasn't started to fix, I can submit a PR. If ok please assign me to this Issue. > SELECT on empty parquet table throws exception > -- > > Key: SPARK-3536 > URL: https://issues.apache.org/jira/browse/SPARK-3536 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust > Labels: starter > > Reported by [~matei]. Reproduce as follows: > {code} > scala> case class Data(i: Int) > defined class Data > scala> createParquetFile[Data]("testParquet") > scala> parquetFile("testParquet").count() > 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due > to exception - job: 0 > java.lang.NullPointerException > at > org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438) > at > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) > at > org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3602) Can't run cassandra_inputformat.py
[ https://issues.apache.org/jira/browse/SPARK-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140236#comment-14140236 ] Frens Jan Rumph edited comment on SPARK-3602 at 9/19/14 9:26 AM: - When running this against the spark-1.1.0-bin-hadoop build I get the following output: {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/09/19 11:24:31 WARN Utils: Your hostname, laptop-x resolves to a loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0) 14/09/19 11:24:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/09/19 11:24:31 INFO SecurityManager: Changing view acls to: frens-jan, 14/09/19 11:24:31 INFO SecurityManager: Changing modify acls to: frens-jan, 14/09/19 11:24:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); users with modify permissions: Set(frens-jan, ) 14/09/19 11:24:31 INFO Slf4jLogger: Slf4jLogger started 14/09/19 11:24:31 INFO Remoting: Starting remoting 14/09/19 11:24:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@laptop-x.local:44417] 14/09/19 11:24:32 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@laptop-x.local:44417] 14/09/19 11:24:32 INFO Utils: Successfully started service 'sparkDriver' on port 44417. 14/09/19 11:24:32 INFO SparkEnv: Registering MapOutputTracker 14/09/19 11:24:32 INFO SparkEnv: Registering BlockManagerMaster 14/09/19 11:24:32 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140919112432-527c 14/09/19 11:24:32 INFO Utils: Successfully started service 'Connection manager for block manager' on port 44978. 14/09/19 11:24:32 INFO ConnectionManager: Bound socket to port 44978 with id = ConnectionManagerId(laptop-x.local,44978) 14/09/19 11:24:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/09/19 11:24:32 INFO BlockManagerMaster: Trying to register BlockManager 14/09/19 11:24:32 INFO BlockManagerMasterActor: Registering block manager laptop-x.local:44978 with 265.4 MB RAM 14/09/19 11:24:32 INFO BlockManagerMaster: Registered BlockManager 14/09/19 11:24:32 INFO HttpFileServer: HTTP File server directory is /tmp/spark-4168e04d-508f-4f3b-92b4-050ecb47dfc7 14/09/19 11:24:32 INFO HttpServer: Starting HTTP Server 14/09/19 11:24:32 INFO Utils: Successfully started service 'HTTP file server' on port 54892. 14/09/19 11:24:32 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/09/19 11:24:32 INFO SparkUI: Started SparkUI at http://laptop-x.local:4040 14/09/19 11:24:33 INFO SparkContext: Added JAR file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar at http://192.168.2.2:54892/jars/spark-examples-1.1.0-hadoop1.0.4.jar with timestamp 148673018 14/09/19 11:24:33 INFO Utils: Copying /home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py to /tmp/spark-be9320ce-82f7-437d-af36-a31b6f7375be/cassandra_inputformat.py 14/09/19 11:24:33 INFO SparkContext: Added file file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py at http://192.168.2.2:54892/files/cassandra_inputformat.py with timestamp 148673019 14/09/19 11:24:33 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@laptop-x.local:44417/user/HeartbeatReceiver 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with curMem=0, maxMem=278302556 14/09/19 11:24:33 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 34.2 KB, free 265.4 MB) 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with curMem=34980, maxMem=278302556 14/09/19 11:24:33 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 34.2 KB, free 265.3 MB) 14/09/19 11:24:33 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.CassandraCQLKeyConverter 14/09/19 11:24:33 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.CassandraCQLValueConverter 14/09/19 11:24:33 INFO SparkContext: Starting job: first at SerDeUtil.scala:70 14/09/19 11:24:33 INFO DAGScheduler: Got job 0 (first at SerDeUtil.scala:70) with 1 output partitions (allowLocal=true) 14/09/19 11:24:33 INFO DAGScheduler: Final stage: Stage 0(first at SerDeUtil.scala:70) 14/09/19 11:24:33 INFO DAGScheduler: Parents of final stage: List() 14/09/19 11:24:33 INFO DAGScheduler: Missing parents: List() 14/09/19 11:24:33 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at PythonHadoopUtil.scala:185), which has no missing parents 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(2440) called with curM
[jira] [Commented] (SPARK-3602) Can't run cassandra_inputformat.py
[ https://issues.apache.org/jira/browse/SPARK-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140236#comment-14140236 ] Frens Jan Rumph commented on SPARK-3602: When running this against the spark-1.1.0-bin-hadoop build I get the following output: {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/09/19 11:24:31 WARN Utils: Your hostname, laptop-x resolves to a loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0) 14/09/19 11:24:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/09/19 11:24:31 INFO SecurityManager: Changing view acls to: frens-jan, 14/09/19 11:24:31 INFO SecurityManager: Changing modify acls to: frens-jan, 14/09/19 11:24:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); users with modify permissions: Set(frens-jan, ) 14/09/19 11:24:31 INFO Slf4jLogger: Slf4jLogger started 14/09/19 11:24:31 INFO Remoting: Starting remoting 14/09/19 11:24:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@laptop-x.local:44417] 14/09/19 11:24:32 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@laptop-x.local:44417] 14/09/19 11:24:32 INFO Utils: Successfully started service 'sparkDriver' on port 44417. 14/09/19 11:24:32 INFO SparkEnv: Registering MapOutputTracker 14/09/19 11:24:32 INFO SparkEnv: Registering BlockManagerMaster 14/09/19 11:24:32 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140919112432-527c 14/09/19 11:24:32 INFO Utils: Successfully started service 'Connection manager for block manager' on port 44978. 14/09/19 11:24:32 INFO ConnectionManager: Bound socket to port 44978 with id = ConnectionManagerId(laptop-x.local,44978) 14/09/19 11:24:32 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/09/19 11:24:32 INFO BlockManagerMaster: Trying to register BlockManager 14/09/19 11:24:32 INFO BlockManagerMasterActor: Registering block manager laptop-x.local:44978 with 265.4 MB RAM 14/09/19 11:24:32 INFO BlockManagerMaster: Registered BlockManager 14/09/19 11:24:32 INFO HttpFileServer: HTTP File server directory is /tmp/spark-4168e04d-508f-4f3b-92b4-050ecb47dfc7 14/09/19 11:24:32 INFO HttpServer: Starting HTTP Server 14/09/19 11:24:32 INFO Utils: Successfully started service 'HTTP file server' on port 54892. 14/09/19 11:24:32 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/09/19 11:24:32 INFO SparkUI: Started SparkUI at http://laptop-x.local:4040 14/09/19 11:24:33 INFO SparkContext: Added JAR file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/lib/spark-examples-1.1.0-hadoop1.0.4.jar at http://192.168.2.2:54892/jars/spark-examples-1.1.0-hadoop1.0.4.jar with timestamp 148673018 14/09/19 11:24:33 INFO Utils: Copying /home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py to /tmp/spark-be9320ce-82f7-437d-af36-a31b6f7375be/cassandra_inputformat.py 14/09/19 11:24:33 INFO SparkContext: Added file file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop1/examples/src/main/python/cassandra_inputformat.py at http://192.168.2.2:54892/files/cassandra_inputformat.py with timestamp 148673019 14/09/19 11:24:33 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@laptop-x.local:44417/user/HeartbeatReceiver 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with curMem=0, maxMem=278302556 14/09/19 11:24:33 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 34.2 KB, free 265.4 MB) 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(34980) called with curMem=34980, maxMem=278302556 14/09/19 11:24:33 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 34.2 KB, free 265.3 MB) 14/09/19 11:24:33 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.CassandraCQLKeyConverter 14/09/19 11:24:33 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.CassandraCQLValueConverter 14/09/19 11:24:33 INFO SparkContext: Starting job: first at SerDeUtil.scala:70 14/09/19 11:24:33 INFO DAGScheduler: Got job 0 (first at SerDeUtil.scala:70) with 1 output partitions (allowLocal=true) 14/09/19 11:24:33 INFO DAGScheduler: Final stage: Stage 0(first at SerDeUtil.scala:70) 14/09/19 11:24:33 INFO DAGScheduler: Parents of final stage: List() 14/09/19 11:24:33 INFO DAGScheduler: Missing parents: List() 14/09/19 11:24:33 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at PythonHadoopUtil.scala:185), which has no missing parents 14/09/19 11:24:33 INFO MemoryStore: ensureFreeSpace(2440) called with curMem=69960, maxMem=278302556 14/09/19 11:24:33 INFO
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140206#comment-14140206 ] Egor Pakhomov commented on SPARK-3573: -- Can you tell more about your motivation to use ShemeRDD as class for Dataset? For me it seems that there are some problems with such decision: 1) Dataset has no connection with sql, rows or relational stuff. Make these two very different topic coexist in one class is quite dangerouse. 2) The only thing I see, which ShemeRDD and dataset have in common is some structure inside RDD. But this structure is very different - one is about rows - another is about features and labeles. Not to mention collabarative filtering algorithms which could contain very different structure. I can't see what motivate us use ShemeRDD as abstraction for dataset. > Dataset > --- > > Key: SPARK-3573 > URL: https://issues.apache.org/jira/browse/SPARK-3573 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra > ML-specific metadata embedded in its schema. > .Sample code > Suppose we have training events stored on HDFS and user/ad features in Hive, > we want to assemble features for training and then apply decision tree. > The proposed pipeline with dataset looks like the following (need more > refinements): > {code} > sqlContext.jsonFile("/path/to/training/events", > 0.01).registerTempTable("event") > val training = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > event.action AS label, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""").cache() > val indexer = new Indexer() > val interactor = new Interactor() > val fvAssembler = new FeatureVectorAssembler() > val treeClassifer = new DecisionTreeClassifer() > val paramMap = new ParamMap() > .put(indexer.features, Map("userCountryIndex" -> "userCountry")) > .put(indexer.sortByFrequency, true) > .put(iteractor.features, Map("genderMatch" -> Array("userGender", > "targetGender"))) > .put(fvAssembler.features, Map("features" -> Array("genderMatch", > "userCountryIndex", "userFeatures"))) > .put(fvAssembler.dense, true) > .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes > "features" and "label" columns. > val pipeline = Pipeline.create(indexer, interactor, fvAssembler, > treeClassifier) > val model = pipeline.fit(raw, paramMap) > sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event") > val test = sqlContext.sql(""" > SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, > user.gender AS userGender, user.country AS userCountry, > user.features AS userFeatures, > ad.targetGender AS targetGender > FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = > ad.id;""") > val prediction = model.transform(test).select('eventId, 'prediction) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3602) Can't run cassandra_inputformat.py
Frens Jan Rumph created SPARK-3602: -- Summary: Can't run cassandra_inputformat.py Key: SPARK-3602 URL: https://issues.apache.org/jira/browse/SPARK-3602 Project: Spark Issue Type: Bug Components: Examples, PySpark Affects Versions: 1.1.0 Environment: Ubuntu 14.04 Reporter: Frens Jan Rumph When I execute: {noformat} wget http://apache.cs.uu.nl/dist/spark/spark-1.1.0/spark-1.1.0-bin-hadoop2.4.tgz tar xzf spark-1.1.0-bin-hadoop2.4.tgz cd spark-1.1.0-bin-hadoop2.4/ ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop2.4.0.jar examples/src/main/python/cassandra_inputformat.py localhost keyspace cf {noformat} The output is: {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/09/19 10:41:10 WARN Utils: Your hostname, laptop-x resolves to a loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0) 14/09/19 10:41:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/09/19 10:41:10 INFO SecurityManager: Changing view acls to: frens-jan, 14/09/19 10:41:10 INFO SecurityManager: Changing modify acls to: frens-jan, 14/09/19 10:41:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); users with modify permissions: Set(frens-jan, ) 14/09/19 10:41:11 INFO Slf4jLogger: Slf4jLogger started 14/09/19 10:41:11 INFO Remoting: Starting remoting 14/09/19 10:41:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@laptop-x.local:43790] 14/09/19 10:41:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@laptop-x.local:43790] 14/09/19 10:41:11 INFO Utils: Successfully started service 'sparkDriver' on port 43790. 14/09/19 10:41:11 INFO SparkEnv: Registering MapOutputTracker 14/09/19 10:41:11 INFO SparkEnv: Registering BlockManagerMaster 14/09/19 10:41:11 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140919104111-145e 14/09/19 10:41:11 INFO Utils: Successfully started service 'Connection manager for block manager' on port 45408. 14/09/19 10:41:11 INFO ConnectionManager: Bound socket to port 45408 with id = ConnectionManagerId(laptop-x.local,45408) 14/09/19 10:41:11 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/09/19 10:41:11 INFO BlockManagerMaster: Trying to register BlockManager 14/09/19 10:41:11 INFO BlockManagerMasterActor: Registering block manager laptop-x.local:45408 with 265.4 MB RAM 14/09/19 10:41:11 INFO BlockManagerMaster: Registered BlockManager 14/09/19 10:41:11 INFO HttpFileServer: HTTP File server directory is /tmp/spark-5f0289d7-9b20-4bd7-a713-db84c38c4eac 14/09/19 10:41:11 INFO HttpServer: Starting HTTP Server 14/09/19 10:41:11 INFO Utils: Successfully started service 'HTTP file server' on port 36556. 14/09/19 10:41:11 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/09/19 10:41:11 INFO SparkUI: Started SparkUI at http://laptop-frens-jan.local:4040 14/09/19 10:41:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/09/19 10:41:12 INFO SparkContext: Added JAR file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/lib/spark-examples-1.1.0-hadoop2.4.0.jar at http://192.168.2.2:36556/jars/spark-examples-1.1.0-hadoop2.4.0.jar with timestamp 146072417 14/09/19 10:41:12 INFO Utils: Copying /home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_inputformat.py to /tmp/spark-7dbb1b4d-016c-4f8b-858d-f79c9297f58f/cassandra_inputformat.py 14/09/19 10:41:12 INFO SparkContext: Added file file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_inputformat.py at http://192.168.2.2:36556/files/cassandra_inputformat.py with timestamp 146072419 14/09/19 10:41:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@laptop-frens-jan.local:43790/user/HeartbeatReceiver 14/09/19 10:41:12 INFO MemoryStore: ensureFreeSpace(167659) called with curMem=0, maxMem=278302556 14/09/19 10:41:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 163.7 KB, free 265.3 MB) 14/09/19 10:41:12 INFO MemoryStore: ensureFreeSpace(167659) called with curMem=167659, maxMem=278302556 14/09/19 10:41:12 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 163.7 KB, free 265.1 MB) 14/09/19 10:41:12 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.CassandraCQLKeyConverter 14/09/19 10:41:12 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.CassandraCQLValueConverter Traceback (most recent call last): File "/home/frens-jan/Desktop/spark-1.1.0-bin-had
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140160#comment-14140160 ] Gaurav Mishra commented on SPARK-3434: -- A matrix being represented by multiple RDDs of sub-matrices may be helpful when an operation on the matrix requires computation over only a small set of its sub-matrices. However, operations like matrix multiplication require computation over all elements in the matrix (i.e. all elements need to be read). Therefore, at least in the case of matrix multiplication, keeping a single RDD seems to be a better idea. Keeping multiple RDDs in that case will only burden us further with the task of keeping track of all sub matrices. > Distributed block matrix > > > Key: SPARK-3434 > URL: https://issues.apache.org/jira/browse/SPARK-3434 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng > > This JIRA is for discussing distributed matrices stored in block > sub-matrices. The main challenge is the partitioning scheme to allow adding > linear algebra operations in the future, e.g.: > 1. matrix multiplication > 2. matrix factorization (QR, LU, ...) > Let's discuss the partitioning and storage and how they fit into the above > use cases. > Questions: > 1. Should it be backed by a single RDD that contains all of the sub-matrices > or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3536) SELECT on empty parquet table throws exception
[ https://issues.apache.org/jira/browse/SPARK-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140145#comment-14140145 ] Ravindra Pesala commented on SPARK-3536: It return null metadata from parquet if querying on empty parquet file while calculating splits.So we should add null check and returns the empty splits solves the issue. > SELECT on empty parquet table throws exception > -- > > Key: SPARK-3536 > URL: https://issues.apache.org/jira/browse/SPARK-3536 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust > Labels: starter > > Reported by [~matei]. Reproduce as follows: > {code} > scala> case class Data(i: Int) > defined class Data > scala> createParquetFile[Data]("testParquet") > scala> parquetFile("testParquet").count() > 14/09/15 14:34:17 WARN scheduler.DAGScheduler: Creating new stage failed due > to exception - job: 0 > java.lang.NullPointerException > at > org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:438) > at > parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) > at > org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mohan gaddam updated SPARK-3601: Description: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResMsg are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record ResMsg { string version; string sequence; string resourceGUID; string GWID; string GWTimestamp; union {Datum, array} data; } avro message samples are as follows: 1) {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", "GWTimestamp": "1409823150737", "data": {"value": "30"}} 2) {"version": "01", "sequence": "1", "resource": "sensor-001", "controller": "002", "controllerTimestamp": "1411038710358", "data": {"value": [ {"name": "Temperature", "value": "30"},{"name": "Speed", "value": "60"},{"name": "Location", "value": ["+401213.1", "-0750015.1"]},{"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}} both 1 and 2 adhere to the avro schema, so decoder is able to convert them into avro objects in spark streaming api. BTW the messages were pulled from kafka source, and decoded by using kafka decoder. was: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.schedul
[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mohan gaddam updated SPARK-3601: Description: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResMsg are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record ResMsg { string version; string sequence; string resourceGUID; string GWID; string GWTimestamp; union {Datum, array} data; } avro message samples are as follows: 1) {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", "GWTimestamp": "1409823150737", "data": {"value": "30"}} 2) {"version": "01", "sequence": "1", "resource": "sensor-001", "controller": "002", "controllerTimestamp": "1411038710358", "data": {"value": [ {"name": "Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": "Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}} both 1 and 2 adhere to the avro schema, so decoder is able to convert them into avro objects in spark streaming api. BTW the messages were pulled from kafka source, and decoded by using kafka decoder. was: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.sche
[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mohan gaddam updated SPARK-3601: Description: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResMsg are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record ResMsg { string version; string sequence; string resourceGUID; string GWID; string GWTimestamp; union {Datum, array} data; } avro message samples are as follows: 1) {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", "GWTimestamp": "1409823150737", "data": {"value": "30"}} 2) {"version": "01", "sequence": "1", "resource": "sensor-001", "controller": "002", "controllerTimestamp": "1411038710358", "data": {"value": [ {"name": "Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": "Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}} both 1 and 2 adhere to the avro schema, so decoder is able to convert them into avro objects in spark streaming api. BTW the messages were pulled from kafka source, and decoded by using kafka decoder. was: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.sched
[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mohan gaddam updated SPARK-3601: Description: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResMsg are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record ResMsg { string version; string sequence; string resourceGUID; string GWID; string GWTimestamp; union {Datum, array} data; } avro message samples are as follows: 1) {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", "GWTimestamp": "1409823150737", "data": {"value": "30"}} 2) {"version": "01", "sequence": "1", "resource": "sensor-001", "controller": "002", "controllerTimestamp": "1411038710358", "data": {"value": [ {"name": "Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": "Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}} both 1 and 2 adhere to the avro schema, so decoder is able to convert them into avro objects in spark streaming api. BTW the messages were pulled from kafka source, and decoded by using kafka decoder. was: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.sche
[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mohan gaddam updated SPARK-3601: Description: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResMsg are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record ResMsg { string version; string sequence; string resourceGUID; string GWID; string GWTimestamp; union {Datum, array} data; } avro message samples are as follows: 1) {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", "GWTimestamp": "1409823150737", "data": {"value": "30"}} 2) {"version": "01", "sequence": "1", "resource": "sensor-001", "controller": "002", "controllerTimestamp": "1411038710358", "data": {"value": [{"name": "Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": "Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}} both 1 and 2 adhere to the avro schema, so decoder is able to convert them into avro objects in spark streaming api. BTW the messages were pulled from kafka source, and decoded by using kafka decoder. was: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.sched
[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mohan gaddam updated SPARK-3601: Description: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResMsg are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record ResMsg { string version; string sequence; string resourceGUID; string GWID; string GWTimestamp; union {Datum, array} data; } avro message samples are as follows: 1) {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", "GWTimestamp": "1409823150737", "data": {"value": "30"}} 2) {"version": "01", "sequence": "1", "resource": "sensor-001", "controller": "002", "controllerTimestamp": "1411038710358", "data": {"value": [{"name": "Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": "Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}} both 1 and 2 adhere to the avro schema, so decoder is able to convert them into avro objects in spark streaming api. By the way the messages were pulled from kafka source, and decoded by using kafka decoder. was: Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. im using Java as programming language. when used complex message throws NPE, stack trace as follows: ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResourceMessage) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.sc
[jira] [Created] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
mohan gaddam created SPARK-3601: --- Summary: Kryo NPE for output operations on Avro complex Objects even after registering. Key: SPARK-3601 URL: https://issues.apache.org/jira/browse/SPARK-3601 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: local, standalone cluster Reporter: mohan gaddam Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. im using Java as programming language. when used complex message throws NPE, stack trace as follows: ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResourceMessage) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResourceMessage are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, array, KeyValueObject} value; } record ResourceMessage { string version; string sequence; string resourceGUID; string GWID; string GWTimestamp; union {Datum, array} data; } avro message samples are as follows: 1) {"version": "01", "sequence": "1", "resourceGUID": "001", "GWID": "002", "GWTimestamp": "1409823150737", "data": {"value": "30"}} 2) {"version": "01", "sequence": "1", "resource": "sensor-001", "controller": "002", "controllerTimestamp": "1411038710358", "data": {"value": [{"name": "Temperature", "value": "30"}, {"name": "Speed", "value": "60"}, {"name": "Location", "value": ["+401213.1", "-0750015.1"]}, {"name": "Timestamp", "value": "2014-09-09T08:15:25-05:00"}]}} both 1 and 2 adhere to the avro schema, so decoder is able to convert them into avro objects in spark streaming api. By the way the messages were pulled from kafka source, and decoded by using kafka decoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138829#comment-14138829 ] Alexander Ulanov edited comment on SPARK-3403 at 9/19/14 7:16 AM: -- Thank you, your answers are really helpful. Should I submit this issue to OpenBLAS ( https://github.com/xianyi/OpenBLAS ) or netlib-java ( https://github.com/fommil/netlib-java )? I thought the latter has jni implementation. I it ok to submit it as is? was (Author: avulanov): Thank you, your answers are really helpful. Should I submit this issue to OpenBLAS (https://github.com/xianyi/OpenBLAS) or netlib-java (https://github.com/fommil/netlib-java)? I thought the latter has jni implementation. I it ok to submit it as is? > NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) > - > > Key: SPARK-3403 > URL: https://issues.apache.org/jira/browse/SPARK-3403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 > Environment: Setup: Windows 7, x64 libraries for netlib-java (as > described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and > MinGW64 precompiled dlls. >Reporter: Alexander Ulanov > Fix For: 1.2.0 > > Attachments: NativeNN.scala > > > Code: > val model = NaiveBayes.train(train) > val predictionAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > predictionAndLabels.foreach(println) > Result: > program crashes with: "Process finished with exit code -1073741819 > (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
[ https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140128#comment-14140128 ] Alexander Ulanov commented on SPARK-3403: - Posted to netlib-java: https://github.com/fommil/netlib-java/issues/72 > NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java) > - > > Key: SPARK-3403 > URL: https://issues.apache.org/jira/browse/SPARK-3403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 > Environment: Setup: Windows 7, x64 libraries for netlib-java (as > described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and > MinGW64 precompiled dlls. >Reporter: Alexander Ulanov > Fix For: 1.2.0 > > Attachments: NativeNN.scala > > > Code: > val model = NaiveBayes.train(train) > val predictionAndLabels = test.map { point => > val score = model.predict(point.features) > (score, point.label) > } > predictionAndLabels.foreach(println) > Result: > program crashes with: "Process finished with exit code -1073741819 > (0xC005)" after displaying the first prediction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140124#comment-14140124 ] Ravindra Pesala commented on SPARK-3298: I guess, we should add some API like *SqlContext.isTableExists(tableName)* to check whether the table already exists or not. So by using this API user can check the table existence and then register the table. The current API *SqlContext.table(tableName)* throws exception if the table is not present,so we cannot use it for this purpose. Please comment on it. > [SQL] registerAsTable / registerTempTable overwrites old tables > --- > > Key: SPARK-3298 > URL: https://issues.apache.org/jira/browse/SPARK-3298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Evan Chan >Priority: Minor > Labels: newbie > > At least in Spark 1.0.2, calling registerAsTable("a") when "a" had been > registered before does not cause an error. However, there is no way to > access the old table, even though it may be cached and taking up space. > How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching
[ https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3600: - Target Version/s: (was: 1.1.1, 1.2.0) > RDD[Double] doesn't use primitive arrays for caching > > > Key: SPARK-3600 > URL: https://issues.apache.org/jira/browse/SPARK-3600 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > RDD's classTag is not passed in through CacheManager. So RDD[Double] uses > object arrays for caching, which leads to huge overhead. However, we need to > send the classTag down many levels to make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching
[ https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3600: - Assignee: (was: Xiangrui Meng) > RDD[Double] doesn't use primitive arrays for caching > > > Key: SPARK-3600 > URL: https://issues.apache.org/jira/browse/SPARK-3600 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Xiangrui Meng > > RDD's classTag is not passed in through CacheManager. So RDD[Double] uses > object arrays for caching, which leads to huge overhead. However, we need to > send the classTag down many levels to make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching
[ https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3600: - Component/s: (was: MLlib) > RDD[Double] doesn't use primitive arrays for caching > > > Key: SPARK-3600 > URL: https://issues.apache.org/jira/browse/SPARK-3600 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > RDD's classTag is not passed in through CacheManager. So RDD[Double] uses > object arrays for caching, which leads to huge overhead. However, we need to > send the classTag down many levels to make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching
[ https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3600: - Issue Type: Improvement (was: Bug) > RDD[Double] doesn't use primitive arrays for caching > > > Key: SPARK-3600 > URL: https://issues.apache.org/jira/browse/SPARK-3600 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > RDD's classTag is not passed in through CacheManager. So RDD[Double] uses > object arrays for caching, which leads to huge overhead. However, we need to > send the classTag down many levels to make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching
[ https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3600: - Description: RDD's classTag is not passed in through CacheManager. So RDD[Double] uses object arrays for caching, which leads to huge overhead. However, we need to send the classTag down many levels to make it work. (was: RandomDataGenerator doesn't have a classTag or @specilaized. So the generated RDDs are RDDs of objects, that cause huge storage overhead.) > RDD[Double] doesn't use primitive arrays for caching > > > Key: SPARK-3600 > URL: https://issues.apache.org/jira/browse/SPARK-3600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > RDD's classTag is not passed in through CacheManager. So RDD[Double] uses > object arrays for caching, which leads to huge overhead. However, we need to > send the classTag down many levels to make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching
[ https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3600: - Summary: RDD[Double] doesn't use primitive arrays for caching (was: RandomRDDs doesn't create primitive typed RDDs) > RDD[Double] doesn't use primitive arrays for caching > > > Key: SPARK-3600 > URL: https://issues.apache.org/jira/browse/SPARK-3600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > RandomDataGenerator doesn't have a classTag or @specilaized. So the generated > RDDs are RDDs of objects, that cause huge storage overhead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org