[jira] [Commented] (SPARK-5947) First class partitioning support in data sources API

2015-02-25 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336556#comment-14336556
 ] 

Philippe Girolami commented on SPARK-5947:
--

For some workloads, it can make more sense to use SKEWED ON rather than 
PARTITION in order to prevent creating thousands of tiny partitions just to 
handle a few large partitions.
As far as I can tell, these two cases can't be inferred from a directory layout 
so maybe it would make sense to make PARTITION & SKEW part of Spark too, and 
rely on meta-data defined by the application rather than directory discovery ?

> First class partitioning support in data sources API
> 
>
> Key: SPARK-5947
> URL: https://issues.apache.org/jira/browse/SPARK-5947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Lian
>
> For file system based data sources, implementing Hive style partitioning 
> support can be complex and error prone. To be specific, partitioning support 
> include:
> # Partition discovery:  Given a directory organized similar to Hive 
> partitions, discover the directory structure and partitioning information 
> automatically, including partition column names, data types, and values.
> # Reading from partitioned tables
> # Writing to partitioned tables
> It would be good to have first class partitioning support in the data sources 
> API. For example, add a {{FileBasedScan}} trait with callbacks and default 
> implementations for these features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-24 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334727#comment-14334727
 ] 

Philippe Girolami commented on SPARK-1867:
--

[~srowen] I'm only getting this issue when building myself from the source 
using Maven. Every snapshot, RC & release I've downloaded has worked just fine 
so for me it's a packaging issue when building locally.


> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException: unread block data [duplicate 59]
> My code snippet:
> val conf = new SparkConf()
>.setMaster(clusterMaster)
>.setAppName(appName)
>.setSparkHome(sparkHome)
>.setJars(SparkContext.jarOfClass(this.getClass))
> println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())
> My SBT dependencies:
> // relevant
> "org.apache.spark" % "spark-core_2.10" % "0.9.1",
> "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",
> // standard, probably unrelated
> "com.github.seratch" %% "awscala" % "[0.2,)",
> "org.scalacheck" %% "scalacheck" % "1.10.1" % "test",
> "org.specs2" %% "specs2" % "1.14" % "test",
> "org.scala-lang" % "scala-reflect" % "2.10.3",
> "org.scalaz" %% "scalaz-core" % "7.0.5",
> "net.minidev" % "json-smart" % "1.2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-14 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321747#comment-14321747
 ] 

Philippe Girolami commented on SPARK-1867:
--

I tried this afternoon again using the 1.3 snapshot and I still run into it. I 
have checked that the Java version I compiled with is the same as the one I'm 
running
{code}
Philippes-MacBook-Air-3:spark Philippe$ java -version
java version "1.7.0_75"
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)
Philippes-MacBook-Air-3:spark Philippe$ mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 
2014-08-11T22:58:10+02:00)
Maven home: /Users/Philippe/Documents/apache-maven-3.2.3
Java version: 1.7.0_75, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_75.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac"
{code}

Maybe another JVM gets chosen when I run spark-shell ? Is there a log somewhere 
that I could check ?

> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException: unread block data [duplicate 59]
> My code snippet:
> val conf = new SparkConf()
>.setMaster(clusterMaster)
>.setAppName(appName)
>.setSparkHome(s

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-05 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307719#comment-14307719
 ] 

Philippe Girolami commented on SPARK-1867:
--

[~srowen] I'm unfortunately reporting this bug. To mitigate SPARK-5557, I've 
reverted my working branch to commit cd5da42 until it gets sorted out. I should 
have included the stack trace. I think someone could easily verify by doing a 
clean clone, checkout cd5da42 and build the way I describe. Then it's simply a 
matter of launch spark-shell. If that works for you, then I agree it's on my 
side but I can't imagine how it could be given the steps I describe to 
reproduce it.

{code}
Philippes-MacBook-Air-3:spark Philippe$ bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/02/05 19:13:15 INFO SecurityManager: Changing view acls to: Philippe
15/02/05 19:13:15 INFO SecurityManager: Changing modify acls to: Philippe
15/02/05 19:13:15 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(Philippe); users 
with modify permissions: Set(Philippe)
15/02/05 19:13:15 INFO HttpServer: Starting HTTP Server
15/02/05 19:13:16 INFO Utils: Successfully started service 'HTTP class server' 
on port 61040.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0-SNAPSHOT
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
15/02/05 19:13:21 INFO SparkContext: Running Spark version 1.3.0-SNAPSHOT
15/02/05 19:13:21 INFO SecurityManager: Changing view acls to: Philippe
15/02/05 19:13:21 INFO SecurityManager: Changing modify acls to: Philippe
15/02/05 19:13:21 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(Philippe); users 
with modify permissions: Set(Philippe)
15/02/05 19:13:22 INFO Slf4jLogger: Slf4jLogger started
15/02/05 19:13:22 INFO Remoting: Starting remoting
15/02/05 19:13:22 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.1.31:61043]
15/02/05 19:13:22 INFO Utils: Successfully started service 'sparkDriver' on 
port 61043.
15/02/05 19:13:22 INFO SparkEnv: Registering MapOutputTracker
15/02/05 19:13:22 INFO SparkEnv: Registering BlockManagerMaster
15/02/05 19:13:22 INFO DiskBlockManager: Created local directory at 
/var/folders/8r/0ty24ys52kvdvx8r6nz2cdc0gn/T/spark-local-20150205191322-7e22
15/02/05 19:13:22 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/02/05 19:13:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/02/05 19:13:23 INFO HttpFileServer: HTTP File server directory is 
/var/folders/8r/0ty24ys52kvdvx8r6nz2cdc0gn/T/spark-8400830a-a7fc-4909-ae37-ee4b48e3ff88
15/02/05 19:13:23 INFO HttpServer: Starting HTTP Server
15/02/05 19:13:23 INFO Utils: Successfully started service 'HTTP file server' 
on port 61044.
15/02/05 19:13:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
15/02/05 19:13:23 INFO Utils: Successfully started service 'SparkUI' on port 
4041.
15/02/05 19:13:23 INFO SparkUI: Started SparkUI at http://192.168.1.31:4041
15/02/05 19:13:23 INFO Executor: Using REPL class URI: http://192.168.1.31:61040
15/02/05 19:13:23 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@192.168.1.31:61043/user/HeartbeatReceiver
15/02/05 19:13:23 INFO NettyBlockTransferService: Server created on 61046
15/02/05 19:13:23 INFO BlockManagerMaster: Trying to register BlockManager
15/02/05 19:13:23 INFO BlockManagerMasterActor: Registering block manager 
localhost:61046 with 265.4 MB RAM, BlockManagerId(, localhost, 61046)
15/02/05 19:13:23 INFO BlockManagerMaster: Registered BlockManager
15/02/05 19:13:23 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala> val source = sc.textFile("/tmp/test")
15/02/05 19:13:27 INFO MemoryStore: ensureFreeSpace(163705) called with 
curMem=0, maxMem=278302556
15/02/05 19:13:27 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 159.9 KB, free 265.3 MB)
15/02/05 19:13:27 INFO MemoryStore: ensureFreeSpace(22736) called with 
curMem=163705, maxMem=278302556
15/02/05 19:13:27 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 22.2 KB, free 265.2 MB)
15/02/05 19:13:27 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:61046 (size: 22.2 KB, free: 265.4 MB)
15/02/05 19:13:27 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0
15/02/05 19:13:27 INFO SparkContext: Created broad

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-05 Thread Philippe Girolami (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307580#comment-14307580
 ] 

Philippe Girolami commented on SPARK-1867:
--

Has anyone figured this out ? I'm seeing this happen when running spark-shell 
off the master branch (at cd5da42), using the same example as [~ansonism]. 
Works fine in 1.2.0, downloaded from the website.
{code}
val source = sc.textFile("/tmp/testfile.txt")
source.saveAsTextFile("/tmp/test_spark_output")
{code}

I built master using
{code}
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver 
-Pbigtop-dist -DskipTests  clean package install
{code} on MacOS using Sun Java 7
{quote}
java version "1.7.0_60"
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
{quote}

> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException: unread block data [duplicate 59]
> My code snippet:
> val conf = new SparkConf()
>.setMaster(clusterMaster)
>.setAppName(appName)
>.setSparkHome(sparkHome)
>.setJars(SparkContext.jarOfClass(this.getClass))
> println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())
> My SBT dependencies:
> // relevant
> "org.apache.spark" % "spark-core_2.10" % "0.9.1",
> "org.apache.hadoop" % "had