[jira] [Commented] (SPARK-5947) First class partitioning support in data sources API
[ https://issues.apache.org/jira/browse/SPARK-5947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336556#comment-14336556 ] Philippe Girolami commented on SPARK-5947: -- For some workloads, it can make more sense to use SKEWED ON rather than PARTITION in order to prevent creating thousands of tiny partitions just to handle a few large partitions. As far as I can tell, these two cases can't be inferred from a directory layout so maybe it would make sense to make PARTITION & SKEW part of Spark too, and rely on meta-data defined by the application rather than directory discovery ? > First class partitioning support in data sources API > > > Key: SPARK-5947 > URL: https://issues.apache.org/jira/browse/SPARK-5947 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Lian > > For file system based data sources, implementing Hive style partitioning > support can be complex and error prone. To be specific, partitioning support > include: > # Partition discovery: Given a directory organized similar to Hive > partitions, discover the directory structure and partitioning information > automatically, including partition column names, data types, and values. > # Reading from partitioned tables > # Writing to partitioned tables > It would be good to have first class partitioning support in the data sources > API. For example, add a {{FileBasedScan}} trait with callbacks and default > implementations for these features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334727#comment-14334727 ] Philippe Girolami commented on SPARK-1867: -- [~srowen] I'm only getting this issue when building myself from the source using Maven. Every snapshot, RC & release I've downloaded has worked just fine so for me it's a packaging issue when building locally. > Spark Documentation Error causes java.lang.IllegalStateException: unread > block data > --- > > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: sam > > I've employed two System Administrators on a contract basis (for quite a bit > of money), and both contractors have independently hit the following > exception. What we are doing is: > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on the > cluster > I've also included code snippets, and sbt deps at the bottom. > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > Now I know that (b) is not the problem having successfully run the same code > on other clusters while only including one jar (it's a fat jar). > But I have no idea how to check for (a) - it appears Spark doesn't have any > version checks or anything - it would be nice if it checked versions and > threw a "mismatching version exception: you have user code using version X > and node Y has version Z". > I would be very grateful for advice on this. > The exception: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > My code snippet: > val conf = new SparkConf() >.setMaster(clusterMaster) >.setAppName(appName) >.setSparkHome(sparkHome) >.setJars(SparkContext.jarOfClass(this.getClass)) > println("count = " + new SparkContext(conf).textFile(someHdfsPath).count()) > My SBT dependencies: > // relevant > "org.apache.spark" % "spark-core_2.10" % "0.9.1", > "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0", > // standard, probably unrelated > "com.github.seratch" %% "awscala" % "[0.2,)", > "org.scalacheck" %% "scalacheck" % "1.10.1" % "test", > "org.specs2" %% "specs2" % "1.14" % "test", > "org.scala-lang" % "scala-reflect" % "2.10.3", > "org.scalaz" %% "scalaz-core" % "7.0.5", > "net.minidev" % "json-smart" % "1.2" -- This message was sent by Atlassian JIRA (v6.3.4#6332
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321747#comment-14321747 ] Philippe Girolami commented on SPARK-1867: -- I tried this afternoon again using the 1.3 snapshot and I still run into it. I have checked that the Java version I compiled with is the same as the one I'm running {code} Philippes-MacBook-Air-3:spark Philippe$ java -version java version "1.7.0_75" Java(TM) SE Runtime Environment (build 1.7.0_75-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode) Philippes-MacBook-Air-3:spark Philippe$ mvn -version Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T22:58:10+02:00) Maven home: /Users/Philippe/Documents/apache-maven-3.2.3 Java version: 1.7.0_75, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_75.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac" {code} Maybe another JVM gets chosen when I run spark-shell ? Is there a log somewhere that I could check ? > Spark Documentation Error causes java.lang.IllegalStateException: unread > block data > --- > > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug >Reporter: sam > > I've employed two System Administrators on a contract basis (for quite a bit > of money), and both contractors have independently hit the following > exception. What we are doing is: > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on the > cluster > I've also included code snippets, and sbt deps at the bottom. > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > Now I know that (b) is not the problem having successfully run the same code > on other clusters while only including one jar (it's a fat jar). > But I have no idea how to check for (a) - it appears Spark doesn't have any > version checks or anything - it would be nice if it checked versions and > threw a "mismatching version exception: you have user code using version X > and node Y has version Z". > I would be very grateful for advice on this. > The exception: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > My code snippet: > val conf = new SparkConf() >.setMaster(clusterMaster) >.setAppName(appName) >.setSparkHome(s
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307719#comment-14307719 ] Philippe Girolami commented on SPARK-1867: -- [~srowen] I'm unfortunately reporting this bug. To mitigate SPARK-5557, I've reverted my working branch to commit cd5da42 until it gets sorted out. I should have included the stack trace. I think someone could easily verify by doing a clean clone, checkout cd5da42 and build the way I describe. Then it's simply a matter of launch spark-shell. If that works for you, then I agree it's on my side but I can't imagine how it could be given the steps I describe to reproduce it. {code} Philippes-MacBook-Air-3:spark Philippe$ bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/02/05 19:13:15 INFO SecurityManager: Changing view acls to: Philippe 15/02/05 19:13:15 INFO SecurityManager: Changing modify acls to: Philippe 15/02/05 19:13:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Philippe); users with modify permissions: Set(Philippe) 15/02/05 19:13:15 INFO HttpServer: Starting HTTP Server 15/02/05 19:13:16 INFO Utils: Successfully started service 'HTTP class server' on port 61040. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60) Type in expressions to have them evaluated. Type :help for more information. 15/02/05 19:13:21 INFO SparkContext: Running Spark version 1.3.0-SNAPSHOT 15/02/05 19:13:21 INFO SecurityManager: Changing view acls to: Philippe 15/02/05 19:13:21 INFO SecurityManager: Changing modify acls to: Philippe 15/02/05 19:13:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Philippe); users with modify permissions: Set(Philippe) 15/02/05 19:13:22 INFO Slf4jLogger: Slf4jLogger started 15/02/05 19:13:22 INFO Remoting: Starting remoting 15/02/05 19:13:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.31:61043] 15/02/05 19:13:22 INFO Utils: Successfully started service 'sparkDriver' on port 61043. 15/02/05 19:13:22 INFO SparkEnv: Registering MapOutputTracker 15/02/05 19:13:22 INFO SparkEnv: Registering BlockManagerMaster 15/02/05 19:13:22 INFO DiskBlockManager: Created local directory at /var/folders/8r/0ty24ys52kvdvx8r6nz2cdc0gn/T/spark-local-20150205191322-7e22 15/02/05 19:13:22 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 15/02/05 19:13:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/02/05 19:13:23 INFO HttpFileServer: HTTP File server directory is /var/folders/8r/0ty24ys52kvdvx8r6nz2cdc0gn/T/spark-8400830a-a7fc-4909-ae37-ee4b48e3ff88 15/02/05 19:13:23 INFO HttpServer: Starting HTTP Server 15/02/05 19:13:23 INFO Utils: Successfully started service 'HTTP file server' on port 61044. 15/02/05 19:13:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 15/02/05 19:13:23 INFO Utils: Successfully started service 'SparkUI' on port 4041. 15/02/05 19:13:23 INFO SparkUI: Started SparkUI at http://192.168.1.31:4041 15/02/05 19:13:23 INFO Executor: Using REPL class URI: http://192.168.1.31:61040 15/02/05 19:13:23 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.1.31:61043/user/HeartbeatReceiver 15/02/05 19:13:23 INFO NettyBlockTransferService: Server created on 61046 15/02/05 19:13:23 INFO BlockManagerMaster: Trying to register BlockManager 15/02/05 19:13:23 INFO BlockManagerMasterActor: Registering block manager localhost:61046 with 265.4 MB RAM, BlockManagerId(, localhost, 61046) 15/02/05 19:13:23 INFO BlockManagerMaster: Registered BlockManager 15/02/05 19:13:23 INFO SparkILoop: Created spark context.. Spark context available as sc. scala> val source = sc.textFile("/tmp/test") 15/02/05 19:13:27 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=278302556 15/02/05 19:13:27 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 265.3 MB) 15/02/05 19:13:27 INFO MemoryStore: ensureFreeSpace(22736) called with curMem=163705, maxMem=278302556 15/02/05 19:13:27 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.2 MB) 15/02/05 19:13:27 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61046 (size: 22.2 KB, free: 265.4 MB) 15/02/05 19:13:27 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/02/05 19:13:27 INFO SparkContext: Created broad
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307580#comment-14307580 ] Philippe Girolami commented on SPARK-1867: -- Has anyone figured this out ? I'm seeing this happen when running spark-shell off the master branch (at cd5da42), using the same example as [~ansonism]. Works fine in 1.2.0, downloaded from the website. {code} val source = sc.textFile("/tmp/testfile.txt") source.saveAsTextFile("/tmp/test_spark_output") {code} I built master using {code} mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -Pbigtop-dist -DskipTests clean package install {code} on MacOS using Sun Java 7 {quote} java version "1.7.0_60" Java(TM) SE Runtime Environment (build 1.7.0_60-b19) Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode) {quote} > Spark Documentation Error causes java.lang.IllegalStateException: unread > block data > --- > > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug >Reporter: sam > > I've employed two System Administrators on a contract basis (for quite a bit > of money), and both contractors have independently hit the following > exception. What we are doing is: > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on the > cluster > I've also included code snippets, and sbt deps at the bottom. > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > Now I know that (b) is not the problem having successfully run the same code > on other clusters while only including one jar (it's a fat jar). > But I have no idea how to check for (a) - it appears Spark doesn't have any > version checks or anything - it would be nice if it checked versions and > threw a "mismatching version exception: you have user code using version X > and node Y has version Z". > I would be very grateful for advice on this. > The exception: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > My code snippet: > val conf = new SparkConf() >.setMaster(clusterMaster) >.setAppName(appName) >.setSparkHome(sparkHome) >.setJars(SparkContext.jarOfClass(this.getClass)) > println("count = " + new SparkContext(conf).textFile(someHdfsPath).count()) > My SBT dependencies: > // relevant > "org.apache.spark" % "spark-core_2.10" % "0.9.1", > "org.apache.hadoop" % "had