Broadcast variables: when should I use them?
Hello. I have a number of static Arrays and Maps in my Spark Streaming driver program. They are simple collections, initialized with integer values and strings directly in the code. There is no RDD/DStream involvement here. I do not expect them to contain more than 100 entries, each. They are used in several subsequent parallel operations. The question is: Should I convert them into broadcast variables? Thanks and regards. -Bob -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-when-should-I-use-them-tp21366.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
R: Broadcast variables: when should I use them?
Hi, Yes, if they are not big, it's a good practice to broadcast them to avoid serializing them each time you use clojure. Paolo Inviata dal mio Windows Phone Da: frodo777mailto:roberto.vaquer...@bitmonlab.com Inviato: 26/01/2015 14:34 A: user@spark.apache.orgmailto:user@spark.apache.org Oggetto: Broadcast variables: when should I use them? Hello. I have a number of static Arrays and Maps in my Spark Streaming driver program. They are simple collections, initialized with integer values and strings directly in the code. There is no RDD/DStream involvement here. I do not expect them to contain more than 100 entries, each. They are used in several subsequent parallel operations. The question is: Should I convert them into broadcast variables? Thanks and regards. -Bob -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-when-should-I-use-them-tp21366.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark context not picking up default hadoop filesystem
hi all, I am trying to create a spark context programmatically, using org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop config that is created during the process is not picking up core-site.xml, so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in spark-env.sh, also core-site.xml in in the conf folder. The whole thing works if it is executed through spark shell. Just wondering where spark is picking up the hadoop config path from? many thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
HW imbalance
Hi, is it possible to mix hosts with (significantly) different specs within a cluster (without wasting the extra resources)? for example having 10 nodes with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to utilize the extra memory by spark executors (as my understanding is all spark executors must have same memory). thanks,Antony.
Re: No AMI for Spark 1.2 using ec2 scripts
Thanks. Turns out this is a proxy problem somehow. Sorry to bother you. /Håkan On Mon Jan 26 2015 at 11:02:18 AM Franc Carter franc.car...@rozettatech.com wrote: AMI's are specific to an AWS region, so the ami-id of the spark AMI in us-west will be different if it exists. I can't remember where but I have a memory of seeing somewhere that the AMI was only in us-east cheers On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson haj...@gmail.com wrote: Thanks, I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and 1.2 with the same command: ./spark-ec2 -k foo -i bar.pem launch mycluster By default this launches in us-east-1. I tried changing the the region using: -r us-west-1 but that had the same result: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm Looking up https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a browser results in the same AMI ID as yours. If I search for ami-7a320f3f AMI in AWS, I can't find any such image. I tried searching in all regions I could find in the AWS console. The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in us-west-1. Strange. Not sure what to do. /Håkan On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke charles.fed...@gmail.com wrote: I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. What parameters are you using when you execute spark-ec2? I am launching in the us-west-1 region (ami-7a320f3f) which may explain things. On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote: Hi, When I try to launch a standalone cluster on EC2 using the scripts in the ec2 directory for Spark 1.2, I get the following error: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm It seems there is not yet any AMI available on EC2. Any ideas when there will be one? This works without problems for version 1.1. Starting up a cluster using these scripts is so simple and straightforward, so I am really missing it on 1.2. /Håkan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
[SQL] Self join with ArrayType columns problems
Using Spark 1.2.0, we are facing some weird behaviour when performing self join on a table with some ArrayType field. (potential bug ?) I have set up a minimal non working example here: https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f In a nutshell, if the ArrayType column used for the pivot is created manually in the StructType definition, everything works as expected. However, if the ArrayType pivot column is obtained by a sql query (be it by using a array wrapper, or using a collect_list operator for instance), then results are completely off. Could anyone have a look as this really is a blocking issue. Thanks! Cheers P. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp21364.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark context not picking up default hadoop filesystem
Ah i think for locally you should give the full hdfs URL. like val logs = sc.textFile(hdfs://akhldz:9000/sigmoid/logs) Thanks Best Regards On Mon, Jan 26, 2015 at 9:36 PM, Tamas Jambor jambo...@gmail.com wrote: thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning that it was deprecated (didn't solve the problem), also tried to run with --driver-class-path, which did not work either. I am trying this locally. On Mon Jan 26 2015 at 15:04:03 Akhil Das ak...@sigmoidanalytics.com wrote: You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw are you running the application locally? or in standalone mode? Thanks Best Regards On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote: hi all, I am trying to create a spark context programmatically, using org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop config that is created during the process is not picking up core-site.xml, so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in spark-env.sh, also core-site.xml in in the conf folder. The whole thing works if it is executed through spark shell. Just wondering where spark is picking up the hadoop config path from? many thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issues when combining Spark and a third party java library
Its more like, Spark is not able to find the hadoop jars. Try setting the HADOOP_CONF_DIR and also make sure *-site.xml are available in the CLASSPATH/SPARK_CLASSPATH. Thanks Best Regards On Mon, Jan 26, 2015 at 7:28 PM, Staffan staffan.arvids...@gmail.com wrote: I'm using Maven and Eclipse to build my project. I'm letting Maven download all the things I need for running everything, which has worked fine up until now. I need to use the CDK library (https://github.com/egonw/cdk, http://sourceforge.net/projects/cdk/) and once I add the dependencies to my pom.xml Spark starts to complain (this is without calling any function or importing any new library into my code, only by introducing new dependencies to the pom.xml). Trying to set up a SparkContext give me the following errors in the log: [main] DEBUG org.apache.spark.rdd.HadoopRDD - SplitLocationInfo and other new Hadoop classes are unavailable. Using the older Hadoop location info code. java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:191) at org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.init(HadoopRDD.scala:381) at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:391) at org.apache.spark.rdd.HadoopRDD$.init(HadoopRDD.scala:390) at org.apache.spark.rdd.HadoopRDD$.clinit(HadoopRDD.scala) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:159) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.foreach(RDD.scala:765) later in the log: [Executor task launch worker-0] DEBUG org.apache.spark.deploy.SparkHadoopUtil - Couldn't find method for retrieving thread-level FileSystem input data java.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics() at java.lang.Class.getDeclaredMethod(Class.java:2009) at org.apache.spark.util.Utils$.invoke(Utils.scala:1733) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:138) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) There has also been issues related to HADOOP_HOME not being set etc., but which seems to be intermittent and
Re: Can Spark benefit from Hive-like partitions?
You can create a partitioned hive table using Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote: Hi, I've got a bunch of data stored in S3 under directories like this: s3n://blah/y=2015/m=01/d=25/lots-of-files.csv In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that it only scans the necessary directories for files to read. As far as I can tell from searching and reading the docs, the right way of loading this data into Spark is to use sc.textFile(s3n://blah/*/*/*/) 1) Is there any way in Spark to access y, m and d as fields? In Hive, you declare them in the schema, but you don't put them in the CSV files - their values are extracted from the path. 2) Is there any way to get Spark to use the y, m and d fields to minimise the files it transfers from S3? Thanks, Danny.
Re: spark context not picking up default hadoop filesystem
thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning that it was deprecated (didn't solve the problem), also tried to run with --driver-class-path, which did not work either. I am trying this locally. On Mon Jan 26 2015 at 15:04:03 Akhil Das ak...@sigmoidanalytics.com wrote: You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw are you running the application locally? or in standalone mode? Thanks Best Regards On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote: hi all, I am trying to create a spark context programmatically, using org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop config that is created during the process is not picking up core-site.xml, so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in spark-env.sh, also core-site.xml in in the conf folder. The whole thing works if it is executed through spark shell. Just wondering where spark is picking up the hadoop config path from? many thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lost task - connection closed
Here is the first error I get at the executors: 15/01/26 17:27:04 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[handle-message-executor-16,5,main] java.lang.StackOverflowError at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1840) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) If you have any pointers for me on how to debug this, that would be very useful. I tried running with both spark 1.2.0 and 1.1.1, getting the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-task-connection-closed-tp21361p21371.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark context not picking up default hadoop filesystem
You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw are you running the application locally? or in standalone mode? Thanks Best Regards On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote: hi all, I am trying to create a spark context programmatically, using org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop config that is created during the process is not picking up core-site.xml, so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in spark-env.sh, also core-site.xml in in the conf folder. The whole thing works if it is executed through spark shell. Just wondering where spark is picking up the hadoop config path from? many thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HW imbalance
should have said I am running as yarn-client. all I can see is specifying the generic executor memory that is then to be used in all containers. On Monday, 26 January 2015, 16:48, Charles Feduke charles.fed...@gmail.com wrote: You should look at using Mesos. This should abstract away the individual hosts into a pool of resources and make the different physical specifications manageable. I haven't tried configuring Spark Standalone mode to have different specs on different machines but based on spark-env.sh.template: # - SPARK_WORKER_CORES, to set the number of cores to use on this machine# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) it looks like you should be able to mix. (Its not clear to me whether SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where the config file resides.) On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, is it possible to mix hosts with (significantly) different specs within a cluster (without wasting the extra resources)? for example having 10 nodes with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to utilize the extra memory by spark executors (as my understanding is all spark executors must have same memory). thanks,Antony.
Re: cannot run spark-shell interactively against cluster from remote host - confusing memory warnings
When you say remote cluster you need to make sure a few things like: - No firewall/network is blocking any connection (Simply ping from localmachine to remote ip and vice versa) - Make sure all ports (unless you specify them manually) are open. You can also refer this discussion, http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html Hope it helps. Thanks Best Regards On Sun, Jan 25, 2015 at 2:40 AM, Joseph Lust jl...@mc10inc.com wrote: I’ve setup a Spark cluster in the last few weeks and everything is working, but * I cannot run spark-shell interactively against the cluster from a remote host* - Deploy .jar to cluster from remote (laptop) spark-submit and have it run – Check - Run .jar on spark-shell locally – Check - Run same .jar on spark-shell on master server – Check - Run spark-shell interactively against cluster on master server – Check - Run spark-shell interactively from remote (laptop) against cluster – *FAIL* It seems other people have faced this same issue: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-working-local-but-not-remote-td19727.html I’m getting the same warnings about memory, despite plenty of memory being available for the job to run (see above working cases) WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory” Some have suggested it has to do with conflicts of Jars on the class path and that Spark is providing spurious memory error messages while the problem is really class path conflicts. http://apache-spark-user-list.1001560.n3.nabble.com/WARN-ClusterScheduler-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-thay-td374.html#a396 Details: - Cluster: 1 master, 3 workers on 4GB/4 core Ubuntu 14.04 LTS - Local (aka remote laptop) MacBook Pro 10.10.1 - All running HotSpot Java (build 1.8.0_31-b13 and uild 1.8.0_25-b17) - All running spark-1.2.0-bin-hadoop2.4 - Using Standalone cluster manager Cluster UI: * Even when I clamp down to the most restrictive amounts, 1 core, 1 executor, 128m (of 3G available), it still says I don’t have the resources: Start Console example $ spark-shell --executor-memory 128m --total-executor-cores 1 --driver-cores 1 --master spark://:7077 15/01/24 15:57:29 INFO SparkILoop: Created spark context.. Spark context available as sc. scala val rdd = sc.parallelize(1 to 1000); rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala rdd.count 15/01/24 15:58:20 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/01/24 15:58:20 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:838 15/01/24 15:58:20 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (ParallelCollectionRDD[0] at parallelize at console:12) 15/01/24 15:58:20 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 15/01/24 15:58:35 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory End console example So, can anyone tell me if remote interactive spark-shell on a Standalone cluster even works? Thanks for your help. Cluster UI below showing job is running on cluster, is using a driver app and worker, and that there are plenty of cores and GB of memory free. Sincerely, Joe Lust
Error when cache partitioned Parquet table
Hi all, I meet below error when I cache a partitioned Parquet table. It seems that, Spark is trying to extract the partitioned key in the Parquet file, so it is not found. But other query could run successfully, even request the partitioned key. Is it a bug in SparkSQL? Is there any workaround for it? Thank you! java.util.NoSuchElementException: key not found: querydate at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$3.apply(ParquetTableOperations.scala:142) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$3.apply(ParquetTableOperations.scala:142) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:142) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:127) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- 郑旭东 ZHENG, Xu-dong
Re: HW imbalance
You should look at using Mesos. This should abstract away the individual hosts into a pool of resources and make the different physical specifications manageable. I haven't tried configuring Spark Standalone mode to have different specs on different machines but based on spark-env.sh.template: # - SPARK_WORKER_CORES, to set the number of cores to use on this machine # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g) # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) it looks like you should be able to mix. (Its not clear to me whether SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where the config file resides.) On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, is it possible to mix hosts with (significantly) different specs within a cluster (without wasting the extra resources)? for example having 10 nodes with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to utilize the extra memory by spark executors (as my understanding is all spark executors must have same memory). thanks, Antony.
Re: [SQL] Self join with ArrayType columns problems
It seems likely that there is some sort of bug related to the reuse of array objects that are returned by UDFs. Can you open a JIRA? I'll also note that the sql method on HiveContext does run HiveQL (configured by spark.sql.dialect) and the hql method has been deprecated since 1.1 (and will probably be removed in 1.3). The errors are probably because array and collect set are hive UDFs and thus not available in a SQLContext. On Mon, Jan 26, 2015 at 5:44 AM, Dean Wampler deanwamp...@gmail.com wrote: You are creating a HiveContext, then using the sql method instead of hql. Is that deliberate? The code doesn't work if you replace HiveContext with SQLContext. Lots of exceptions are thrown, but I don't have time to investigate now. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Jan 26, 2015 at 7:17 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Using Spark 1.2.0, we are facing some weird behaviour when performing self join on a table with some ArrayType field. (potential bug ?) I have set up a minimal non working example here: https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f In a nutshell, if the ArrayType column used for the pivot is created manually in the StructType definition, everything works as expected. However, if the ArrayType pivot column is obtained by a sql query (be it by using a array wrapper, or using a collect_list operator for instance), then results are completely off. Could anyone have a look as this really is a blocking issue. Thanks! Cheers P. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp21364.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Can Spark benefit from Hive-like partitions?
Hi, I've got a bunch of data stored in S3 under directories like this: s3n://blah/y=2015/m=01/d=25/lots-of-files.csv In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that it only scans the necessary directories for files to read. As far as I can tell from searching and reading the docs, the right way of loading this data into Spark is to use sc.textFile(s3n://blah/*/*/*/) 1) Is there any way in Spark to access y, m and d as fields? In Hive, you declare them in the schema, but you don't put them in the CSV files - their values are extracted from the path. 2) Is there any way to get Spark to use the y, m and d fields to minimise the files it transfers from S3? Thanks, Danny.
Re: [SQL] Self join with ArrayType columns problems
You are creating a HiveContext, then using the sql method instead of hql. Is that deliberate? The code doesn't work if you replace HiveContext with SQLContext. Lots of exceptions are thrown, but I don't have time to investigate now. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Jan 26, 2015 at 7:17 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Using Spark 1.2.0, we are facing some weird behaviour when performing self join on a table with some ArrayType field. (potential bug ?) I have set up a minimal non working example here: https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f In a nutshell, if the ArrayType column used for the pivot is created manually in the StructType definition, everything works as expected. However, if the ArrayType pivot column is obtained by a sql query (be it by using a array wrapper, or using a collect_list operator for instance), then results are completely off. Could anyone have a look as this really is a blocking issue. Thanks! Cheers P. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp21364.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Issues when combining Spark and a third party java library
I'm using Maven and Eclipse to build my project. I'm letting Maven download all the things I need for running everything, which has worked fine up until now. I need to use the CDK library (https://github.com/egonw/cdk, http://sourceforge.net/projects/cdk/) and once I add the dependencies to my pom.xml Spark starts to complain (this is without calling any function or importing any new library into my code, only by introducing new dependencies to the pom.xml). Trying to set up a SparkContext give me the following errors in the log: [main] DEBUG org.apache.spark.rdd.HadoopRDD - SplitLocationInfo and other new Hadoop classes are unavailable. Using the older Hadoop location info code. java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:191) at org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.init(HadoopRDD.scala:381) at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:391) at org.apache.spark.rdd.HadoopRDD$.init(HadoopRDD.scala:390) at org.apache.spark.rdd.HadoopRDD$.clinit(HadoopRDD.scala) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:159) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.foreach(RDD.scala:765) later in the log: [Executor task launch worker-0] DEBUG org.apache.spark.deploy.SparkHadoopUtil - Couldn't find method for retrieving thread-level FileSystem input data java.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics() at java.lang.Class.getDeclaredMethod(Class.java:2009) at org.apache.spark.util.Utils$.invoke(Utils.scala:1733) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:138) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:220) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) There has also been issues related to HADOOP_HOME not being set etc., but which seems to be intermittent and only occur sometimes. After testing different versions of both CDK and Spark, I've found out that the Spark version 0.9.1 and earlier DO NOT have this problem, so there is something in the newer versions of Spark that do not play well with others... However, I need the functionality in the later versions of Spark so this do not solve my problem. Anyone willing
Re: Can Spark benefit from Hive-like partitions?
Currently no if you don't want to use Spark SQL's HiveContext. But we're working on adding partitioning support to the external data sources API, with which you can create, for example, partitioned Parquet tables without using Hive. Cheng On 1/26/15 8:47 AM, Danny Yates wrote: Thanks Michael. I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SVD in pyspark ?
Hi Andreas, There unfortunately is not a Python API yet for distributed matrices or their operations. Here's the JIRA to follow to stay up-to-date on it: https://issues.apache.org/jira/browse/SPARK-3956 There are internal wrappers (used to create the Python API), but they are not really public APIs. The bigger challenge is creating/storing the distributed matrix in Python. Joseph On Sun, Jan 25, 2015 at 11:32 AM, Chip Senkbeil chip.senkb...@gmail.com wrote: Hi Andreas, With regard to the notebook interface, you can use the Spark Kernel ( https://github.com/ibm-et/spark-kernel) as the backend for an IPython 3.0 notebook. The kernel is designed to be the foundation for interactive applications connecting to Apache Spark and uses the IPython 5.0 message protocol - used by IPython 3.0 - to communicate. See the getting started section here: https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel It discusses getting IPython connected to a Spark Kernel. If you have any more questions, feel free to ask! Signed, Chip Senkbeil IBM Emerging Technologies Software Engineer On Sun Jan 25 2015 at 1:12:32 PM Andreas Rhode m.a.rh...@gmail.com wrote: Is the distributed SVD functionality exposed to Python yet? Seems it's only available to scala or java, unless I am missing something, looking for a pyspark equivalent to org.apache.spark.mllib.linalg.SingularValueDecomposition In case it's not there yet, is there a way to make a wrapper to call from python into the corresponding java/scala code? The reason for using python instead of just directly scala is that I like to take advantage of the notebook interface for visualization. As a side, is there a inotebook like interface for the scala based REPL? Thanks Andreas -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/SVD-in-pyspark-tp21356.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can Spark benefit from Hive-like partitions?
Good to hear there will be partitioning support. I’ve had some success loading partitioned data specified with Unix glowing format. i.e.: sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”) would load dates 2014-11-24 through 2014-11-30. Not the most ideal solution, but it seems to work for loading data from a range. Best, Chris On Jan 26, 2015, at 10:55 AM, Cheng Lian lian.cs@gmail.com wrote: Currently no if you don't want to use Spark SQL's HiveContext. But we're working on adding partitioning support to the external data sources API, with which you can create, for example, partitioned Parquet tables without using Hive. Cheng On 1/26/15 8:47 AM, Danny Yates wrote: Thanks Michael. I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark (Streaming?) holding on to Mesos resources
Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster. For example: This is a job that has spark.cores.max = 4 and spark.executor.memory=3g IDFrameworkHostCPUsMem…5050-16506-1146497FooStreamingdnode-4.hdfs.private713.4 GB…5050-16506-1146495FooStreaming dnode-0.hdfs.private16.4 GB…5050-16506-1146491FooStreaming dnode-5.hdfs.private711.9 GB…5050-16506-1146449FooStreaming dnode-3.hdfs.private74.9 GB…5050-16506-1146247FooStreaming dnode-1.hdfs.private0.55.9 GB…5050-16506-1146226FooStreaming dnode-2.hdfs.private37.9 GB…5050-16506-1144069FooStreaming dnode-3.hdfs.private18.7 GB…5050-16506-1133091FooStreaming dnode-5.hdfs.private11.7 GB…5050-16506-1133090FooStreaming dnode-2.hdfs.private55.2 GB…5050-16506-1133089FooStreaming dnode-1.hdfs.private6.56.3 GB…5050-16506-1133088FooStreaming dnode-4.hdfs.private1251 MB…5050-16506-1133087FooStreaming dnode-0.hdfs.private6.46.8 GB The only way to release the resources is by manually finding the process in the cluster and killing it. The jobs are often streaming but also batch jobs show this behavior. We have more streaming jobs than batch, so stats are biased. Any ideas of what's up here? Hopefully some very bad ugly bug that has been fixed already and that will urge us to upgrade our infra? Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 -kr, Gerard.
Re: HW imbalance
Hi Antony, Unfortunately, all executors for any single Spark application must have the same amount of memory. It's possibly to configure YARN with different amounts of memory for each host (using yarn.nodemanager.resource.memory-mb), so other apps might be able to take advantage of the extra memory. -Sandy On Mon, Jan 26, 2015 at 8:34 AM, Michael Segel msegel_had...@hotmail.com wrote: If you’re running YARN, then you should be able to mix and max where YARN is managing the resources available on the node. Having said that… it depends on which version of Hadoop/YARN. If you’re running Hortonworks and Ambari, then setting up multiple profiles may not be straight forward. (I haven’t seen the latest version of Ambari) So in theory, one profile would be for your smaller 36GB of ram, then one profile for your 128GB sized machines. Then as your request resources for your spark job, it should schedule the jobs based on the cluster’s available resources. (At least in theory. I haven’t tried this so YMMV) HTH -Mike On Jan 26, 2015, at 4:25 PM, Antony Mayi antonym...@yahoo.com.INVALID wrote: should have said I am running as yarn-client. all I can see is specifying the generic executor memory that is then to be used in all containers. On Monday, 26 January 2015, 16:48, Charles Feduke charles.fed...@gmail.com wrote: You should look at using Mesos. This should abstract away the individual hosts into a pool of resources and make the different physical specifications manageable. I haven't tried configuring Spark Standalone mode to have different specs on different machines but based on spark-env.sh.template: # - SPARK_WORKER_CORES, to set the number of cores to use on this machine # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g) # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) it looks like you should be able to mix. (Its not clear to me whether SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where the config file resides.) On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, is it possible to mix hosts with (significantly) different specs within a cluster (without wasting the extra resources)? for example having 10 nodes with 36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to utilize the extra memory by spark executors (as my understanding is all spark executors must have same memory). thanks, Antony.
Re: Spark webUI - application details page
Where is the history server running? Is it running on the same node as the logs directory. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p21374.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Error when cache partitioned Parquet table
Hi Xu-dong, Thats probably because your table's partition path don't look like hdfs://somepath/key=value/*.parquet. Spark is trying to extract the partition key's value from the path while caching and hence the exception is being thrown since it can't find one. On Mon, Jan 26, 2015 at 10:45 AM, ZHENG, Xu-dong dong...@gmail.com wrote: Hi all, I meet below error when I cache a partitioned Parquet table. It seems that, Spark is trying to extract the partitioned key in the Parquet file, so it is not found. But other query could run successfully, even request the partitioned key. Is it a bug in SparkSQL? Is there any workaround for it? Thank you! java.util.NoSuchElementException: key not found: querydate at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$3.apply(ParquetTableOperations.scala:142) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$3.apply(ParquetTableOperations.scala:142) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:142) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:127) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- 郑旭东 ZHENG, Xu-dong
Re: Can Spark benefit from Hive-like partitions?
Thanks Michael. I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Thanks
Re: Lost task - connection closed
It looks like something weird is going on with your object serialization, perhaps a funny form of self-reference which is not detected by ObjectOutputStream's typical loop avoidance. That, or you have some data structure like a linked list with a parent pointer and you have many thousand elements. Assuming the stack trace is coming from an executor, it is probably a problem with the objects you're sending back as results, so I would carefully examine these and maybe try serializing some using ObjectOutputStream manually. If your program looks like foo.map { row = doComplexOperation(row) }.take(10) you can also try changing it to foo.map { row = doComplexOperation(row); 1 }.take(10) to avoid serializing the result of that complex operation, which should help narrow down where exactly the problematic objects are coming from. On Mon, Jan 26, 2015 at 8:31 AM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: Here is the first error I get at the executors: 15/01/26 17:27:04 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[handle-message-executor-16,5,main] java.lang.StackOverflowError at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1840) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) If you have any pointers for me on how to debug this, that would be very useful. I tried running with both spark 1.2.0 and 1.1.1, getting the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lost-task-connection-closed-tp21361p21371.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
large data set to get rid of exceeds Integer.MAX_VALUE error
Hi, This seems to be a known issue (see here: http://apache-spark-user-list.1001560.n3.nabble.com/ALS-failure-with-size-gt-Integer-MAX-VALUE-td19982.html) The data set is about 1.5 T bytes. There are 14 region servers. I am not sure how many regions there are for this data set. But very likely each region will have much more than 2g data. In this case, repartition seems also a very expensive action (I would guess), if possible in my cluster at all. Could any one give some suggestions to make this job done? Thanks! platform: spark 1.2.0, cdh5.3.0. The error is like, py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 34, node007): java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:156) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:93) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at
Spark (Streaming?) holding on to Mesos Resources
(looks like the list didn't like a HTML table on the previous email. My excuses for any duplicates) Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster. For example: This is a job that has spark.cores.max = 4 and spark.executor.memory=3g | ID |Framework |Host|CPUs |Mem …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB …5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB …5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB …5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB …5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB …5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB …5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB …5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB …5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB …5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB …5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB …5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB The only way to release the resources is by manually finding the process in the cluster and killing it. The jobs are often streaming but also batch jobs show this behavior. We have more streaming jobs than batch, so stats are biased. Any ideas of what's up here? Hopefully some very bad ugly bug that has been fixed already and that will urge us to upgrade our infra? Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 -kr, Gerard.
Spark and S3 server side encryption
We are trying to create a Spark job that writes out a file to S3 that leverage S3's server side encryption for sensitive data. Typically this is accomplished by setting the appropriate header on the put request, but it isn't clear whether this capability is exposed in the Spark/Hadoop APIs. Does anyone have any suggestions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-S3-server-side-encryption-tp21377.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SaveAsTextFile to S3 bucket
Does anyone know if I can save a RDD as a text file to a pre-created directory in S3 bucket? I have a directory created in S3 bucket: //nexgen-software/dev When I tried to save a RDD as text file in this directory: rdd.saveAsTextFile(s3n://nexgen-software/dev/output); I got following exception at runtime: Exception in thread main org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' - ResponseCode=403, ResponseMessage=Forbidden I have verified /dev has write permission. However, if I grant the bucket //nexgen-software write permission, I don't get exception. But the output is not created under dev. Rather, a different /dev/output directory is created directory in the bucket (//nexgen-software). Is this how saveAsTextFile behalves in S3? Is there anyway I can have output created under a pre-defied directory. Thanks in advance.
Re: HDFS Namenode in safemode when I turn off my EC2 instance
Hello Sean and Akhil, I shut down the services on Cloudera Manager. I shut them down in the appropriate order and then stopped all services of CM. I then shut down my instances. I then turned my instances back on, but I am getting the same error. 1) I tried hadoop fs -safemode leave and it said -safemode is an unknown command, but it does recognize hadoop fs 2) I also noticed I can't ping my instances from my personal laptop and I can't ping google.com from my instances. However, I can still run my Kafka Zookeeper/server/console producer/consumer. I know this is the spark thread, but thought that might be relevant. Thank you for any suggestions! Best, Su On Thu, Jan 22, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: If you are using CDH, you would be shutting down services with Cloudera Manager. I believe you can do it manually using Linux 'services' if you do the steps correctly across your whole cluster. I'm not sure if the stock stop-all.sh script is supposed to work. Certainly, if you are using CM, by far the easiest is to start/stop all of these things in CM. On Wed, Jan 21, 2015 at 6:08 PM, Su She suhsheka...@gmail.com wrote: Hello Sean Akhil, I tried running the stop-all.sh script on my master and I got this message: localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic). chown: changing ownership of `/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/logs': Operation not permitted no org.apache.spark.deploy.master.Master to stop I am running Spark (Yarn) via Cloudera Manager. I tried stopping it from Cloudera Manager first, but it looked like it was only stopping the history server, so I started Spark again and tried ./stop-all.sh and got the above message. Also, what is the command for shutting down storage or can I simply stop hdfs in Cloudera Manager? Thank you for the help! On Sat, Jan 17, 2015 at 12:58 PM, Su She suhsheka...@gmail.com wrote: Thanks Akhil and Sean for the responses. I will try shutting down spark, then storage and then the instances. Initially, when hdfs was in safe mode, I waited for 1 hour and the problem still persisted. I will try this new method. Thanks! On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen so...@cloudera.com wrote: You would not want to turn off storage underneath Spark. Shut down Spark first, then storage, then shut down the instances. Reverse the order when restarting. HDFS will be in safe mode for a short time after being started before it becomes writeable. I would first check that it's not just that. Otherwise, find out why the cluster went into safe mode from the logs, fix it, and then leave safe mode. On Sat, Jan 17, 2015 at 9:03 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Safest way would be to first shutdown HDFS and then shutdown Spark (call stop-all.sh would do) and then shutdown the machines. You can execute the following command to disable safe mode: hadoop fs -safemode leave Thanks Best Regards On Sat, Jan 17, 2015 at 8:31 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am encountering trouble running Spark applications when I shut down my EC2 instances. Everything else seems to work except Spark. When I try running a simple Spark application, like sc.parallelize() I get the message that hdfs name node is in safemode. Has anyone else had this issue? Is there a proper protocol I should be following to turn off my spark nodes? Thank you!
Re: saving rdd to multiple files named by the key
This might be helpful: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport sha...@plaid.com wrote: Hi, I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k]. I got them by combining many [k,v] by [k]. I could then save to file by partitions, but that still doesn't allow me to choose the name, and leaves me stuck with foo/part-... Any tips? Thanks, Sharon
Mathematical functions in spark sql
Hello everyone! I try execute select 2/3 and I get 0.. Is there any way to cast double to int or something similar? Also it will be cool to get list of functions supported by spark sql. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Mathematical functions in spark sql
Have you tried floor() or ceil() functions ? According to http://spark.apache.org/sql/, Spark SQL is compatible with Hive SQL. Cheers On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote: Hello everyone! I try execute select 2/3 and I get 0.. Is there any way to cast double to int or something similar? Also it will be cool to get list of functions supported by spark sql. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule
All, I recently try to build Spark-1.2 on my enterprise server (which has Hadoop 2.3 with YARN). Here're the steps I followed for the build: $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package $ export SPARK_HOME=/path/to/spark/folder $ export HADOOP_CONF_DIR=/etc/hadoop/conf However, when I try to work with this installation either locally or on YARN, I get the following error: Exception in thread main java.lang.ExceptionInInitializerError at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159) at org.apache.spark.SparkContext.init(SparkContext.scala:232) at water.MyDriver$.main(MyDriver.scala:19) at water.MyDriver.main(MyDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:360) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.SparkException: Unable to load YARN support at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) ... 15 more Caused by: java.lang.IllegalArgumentException: Invalid rule: L RULE:[2:$1@$0](.*@XXXCOMPANY.COM)s/(.*)@XXXCOMPANY.COM/$1/L DEFAULT at org.apache.hadoop.security.authentication.util.KerberosName.parseRules(KerberosName.java:321) at org.apache.hadoop.security.authentication.util.KerberosName.setRules(KerberosName.java:386) at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:75) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:247) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:43) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.init(YarnSparkHadoopUtil.scala:45) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at java.lang.Class.newInstance(Class.java:374) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:196) ... 17 more I noticed that when I unset HADOOP_CONF_DIR, I'm able to work in the local mode without any errors. I'm able to work with pre-installed Spark 1.0, locally and on yarn, without any issues. It looks like I may be missing a configuration step somewhere. Any thoughts on what may be causing this? NR -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SaveAsTextFile to S3 bucket
Your output folder specifies rdd.saveAsTextFile(s3n://nexgen-software/dev/output); So it will try to write to /dev/output which is as expected. If you create the directory /dev/output upfront in your bucket, and try to save it to that (empty) directory, what is the behaviour? On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin kevin.c...@neustar.biz wrote: Does anyone know if I can save a RDD as a text file to a pre-created directory in S3 bucket? I have a directory created in S3 bucket: //nexgen-software/dev When I tried to save a RDD as text file in this directory: rdd.saveAsTextFile(s3n://nexgen-software/dev/output); I got following exception at runtime: Exception in thread main org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' - ResponseCode=403, ResponseMessage=Forbidden I have verified /dev has write permission. However, if I grant the bucket //nexgen-software write permission, I don't get exception. But the output is not created under dev. Rather, a different /dev/output directory is created directory in the bucket (//nexgen-software). Is this how saveAsTextFile behalves in S3? Is there anyway I can have output created under a pre-defied directory. Thanks in advance.
saving rdd to multiple files named by the key
Hi, I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k]. I got them by combining many [k,v] by [k]. I could then save to file by partitions, but that still doesn't allow me to choose the name, and leaves me stuck with foo/part-... Any tips? Thanks, Sharon
Re: spark 1.2 ec2 launch script hang
Try using an absolute path to the pem file On Jan 26, 2015, at 8:57 PM, ey-chih chow eyc...@hotmail.com wrote: Hi, I used the spark-ec2 script of spark 1.2 to launch a cluster. I have modified the script according to https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab9753aae939b3bb33be953e2c13a70 But the script was still hung at the following message: Waiting for cluster to enter 'ssh-ready' state. Any additional thing I should do to make it succeed? Thanks. Ey-Chih Chow -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HDFS Namenode in safemode when I turn off my EC2 instance
Command would be: hadoop dfsadmin -safemode leave If you are not able to ping your instances, it can be because of you are blocking all the ICMP requests. Im not quiet sure why you are not able to ping google.com from your instances. Make sure the internal IP (ifconfig) is proper in the file /etc/hosts. And regarding the kafka, im assuming that youbare running on a single instance and hence you are not having any issues ( mostly, it binds to localhost in that case) On 27 Jan 2015 07:25, Su She suhsheka...@gmail.com wrote: Hello Sean and Akhil, I shut down the services on Cloudera Manager. I shut them down in the appropriate order and then stopped all services of CM. I then shut down my instances. I then turned my instances back on, but I am getting the same error. 1) I tried hadoop fs -safemode leave and it said -safemode is an unknown command, but it does recognize hadoop fs 2) I also noticed I can't ping my instances from my personal laptop and I can't ping google.com from my instances. However, I can still run my Kafka Zookeeper/server/console producer/consumer. I know this is the spark thread, but thought that might be relevant. Thank you for any suggestions! Best, Su On Thu, Jan 22, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: If you are using CDH, you would be shutting down services with Cloudera Manager. I believe you can do it manually using Linux 'services' if you do the steps correctly across your whole cluster. I'm not sure if the stock stop-all.sh script is supposed to work. Certainly, if you are using CM, by far the easiest is to start/stop all of these things in CM. On Wed, Jan 21, 2015 at 6:08 PM, Su She suhsheka...@gmail.com wrote: Hello Sean Akhil, I tried running the stop-all.sh script on my master and I got this message: localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic). chown: changing ownership of `/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/logs': Operation not permitted no org.apache.spark.deploy.master.Master to stop I am running Spark (Yarn) via Cloudera Manager. I tried stopping it from Cloudera Manager first, but it looked like it was only stopping the history server, so I started Spark again and tried ./stop-all.sh and got the above message. Also, what is the command for shutting down storage or can I simply stop hdfs in Cloudera Manager? Thank you for the help! On Sat, Jan 17, 2015 at 12:58 PM, Su She suhsheka...@gmail.com wrote: Thanks Akhil and Sean for the responses. I will try shutting down spark, then storage and then the instances. Initially, when hdfs was in safe mode, I waited for 1 hour and the problem still persisted. I will try this new method. Thanks! On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen so...@cloudera.com wrote: You would not want to turn off storage underneath Spark. Shut down Spark first, then storage, then shut down the instances. Reverse the order when restarting. HDFS will be in safe mode for a short time after being started before it becomes writeable. I would first check that it's not just that. Otherwise, find out why the cluster went into safe mode from the logs, fix it, and then leave safe mode. On Sat, Jan 17, 2015 at 9:03 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Safest way would be to first shutdown HDFS and then shutdown Spark (call stop-all.sh would do) and then shutdown the machines. You can execute the following command to disable safe mode: hadoop fs -safemode leave Thanks Best Regards On Sat, Jan 17, 2015 at 8:31 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am encountering trouble running Spark applications when I shut down my EC2 instances. Everything else seems to work except Spark. When I try running a simple Spark application, like sc.parallelize() I get the message that hdfs name node is in safemode. Has anyone else had this issue? Is there a proper protocol I should be following to turn off my spark nodes? Thank you!
Re: SaveAsTextFile to S3 bucket
When spark saves rdd to a text file, the directory must not exist upfront. It will create a directory and write the data to part- under that directory. In my use case, I create a directory dev in the bucket ://nexgen-software/dev . I expect it creates output direct under dev and a part- under output. But it gave me exception as I only give write permission to dev not the bucket. If I open up write permission to bucket, it worked. But it did not create output directory under dev, it rather creates another dev/output directory under bucket. I just want to know if it is possible to have output directory created under dev directory I created upfront. From: Nick Pentreath nick.pentre...@gmail.commailto:nick.pentre...@gmail.com Date: Monday, January 26, 2015 9:15 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SaveAsTextFile to S3 bucket Your output folder specifies rdd.saveAsTextFile(s3n://nexgen-software/dev/output); So it will try to write to /dev/output which is as expected. If you create the directory /dev/output upfront in your bucket, and try to save it to that (empty) directory, what is the behaviour? On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin kevin.c...@neustar.bizmailto:kevin.c...@neustar.biz wrote: Does anyone know if I can save a RDD as a text file to a pre-created directory in S3 bucket? I have a directory created in S3 bucket: //nexgen-software/dev When I tried to save a RDD as text file in this directory: rdd.saveAsTextFile(s3n://nexgen-software/dev/output); I got following exception at runtime: Exception in thread main org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' - ResponseCode=403, ResponseMessage=Forbidden I have verified /dev has write permission. However, if I grant the bucket //nexgen-software write permission, I don't get exception. But the output is not created under dev. Rather, a different /dev/output directory is created directory in the bucket (//nexgen-software). Is this how saveAsTextFile behalves in S3? Is there anyway I can have output created under a pre-defied directory. Thanks in advance.
spark sqlContext udaf
Hi, any one can show me some examples using UDAF for spark sqlcontext?
Re: Mathematical functions in spark sql
I have tried select ceil(2/3), but got key not found: floor On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu yuzhih...@gmail.com wrote: Have you tried floor() or ceil() functions ? According to http://spark.apache.org/sql/, Spark SQL is compatible with Hive SQL. Cheers On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote: Hello everyone! I try execute select 2/3 and I get 0.. Is there any way to cast double to int or something similar? Also it will be cool to get list of functions supported by spark sql. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark 1.2 ec2 launch script hang
Hi, I used the spark-ec2 script of spark 1.2 to launch a cluster. I have modified the script according to https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab9753aae939b3bb33be953e2c13a70 But the script was still hung at the following message: Waiting for cluster to enter 'ssh-ready' state. Any additional thing I should do to make it succeed? Thanks. Ey-Chih Chow -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SaveAsTextFile to S3 bucket
By default, the files will be created under the path provided as the argument for saveAsTextFile. This argument is considered as a folder in the bucket and actual files are created in it with the naming convention part-n, where n is the number of output partition. On Mon, Jan 26, 2015 at 9:15 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Your output folder specifies rdd.saveAsTextFile(s3n://nexgen-software/dev/output); So it will try to write to /dev/output which is as expected. If you create the directory /dev/output upfront in your bucket, and try to save it to that (empty) directory, what is the behaviour? On Tue, Jan 27, 2015 at 6:21 AM, Chen, Kevin kevin.c...@neustar.biz wrote: Does anyone know if I can save a RDD as a text file to a pre-created directory in S3 bucket? I have a directory created in S3 bucket: //nexgen-software/dev When I tried to save a RDD as text file in this directory: rdd.saveAsTextFile(s3n://nexgen-software/dev/output); I got following exception at runtime: Exception in thread main org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/dev' - ResponseCode=403, ResponseMessage=Forbidden I have verified /dev has write permission. However, if I grant the bucket //nexgen-software write permission, I don't get exception. But the output is not created under dev. Rather, a different /dev/output directory is created directory in the bucket (//nexgen-software). Is this how saveAsTextFile behalves in S3? Is there anyway I can have output created under a pre-defied directory. Thanks in advance.
Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType
Awesome ! That would be great !! On Mon, Jan 26, 2015 at 3:18 PM, Michael Armbrust mich...@databricks.com wrote: I'm aiming for 1.3. On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Michael. I am sure there have been many requests for this support. Any release targeted for this? Thanks, On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com wrote: Those annotations actually don't work because the timestamp is SQL has optional nano-second precision. However, there is a PR to add support using parquets INT96 type: https://github.com/apache/spark/pull/3820 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com wrote: Looking further at the trace and ParquetTypes.scala, it seems there is no support for Timestamp and Date in fromPrimitiveDataType(ctype: DataType): Option[ParquetTypeInfo]. Since Parquet supports these type with some decoration over Int ( https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md), any reason why Date / Timestamp are not supported right now ? Thanks, Manoj On Fri, Jan 23, 2015 at 11:40 AM, Manoj Samel manojsamelt...@gmail.com wrote: Using Spark 1.2 Read a CSV file, apply schema to convert to SchemaRDD and then schemaRdd.saveAsParquetFile If the schema includes Timestamptype, it gives following trace when doing the save Exception in thread main java.lang.RuntimeException: Unsupported datatype TimestampType at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply( ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply( ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType( ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply( ParquetTypes.scala:363) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply( ParquetTypes.scala:362) at scala.collection.TraversableLike$$anonfun$map$1.apply( TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply( TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach( ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map( TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes( ParquetTypes.scala:361) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData( ParquetTypes.scala:407) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty( ParquetRelation.scala:166) at org.apache.spark.sql.parquet.ParquetRelation$.create( ParquetRelation.scala:145) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply( SparkStrategies.scala:204) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply( QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply( QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply( QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute( SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan( SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute( SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan( SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute( SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd( SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile( SchemaRDDLike.scala:76) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile( SchemaRDD.scala:108) at bdrt.MyTest$.createParquetWithDate(MyTest.scala:88) at bdrt.MyTest$delayedInit$body.apply(MyTest.scala:54) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp( AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach( TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) at bdrt.MyTest$.main(MyTest.scala:10)
Re: Spark 1.2 – How to change Default (Random) port ….
Thanks. But after setting spark.shuffle.blockTransferService to nio application fails with Akka Client disassociation. 15/01/27 13:38:11 ERROR TaskSchedulerImpl: Lost executor 3 on wynchcs218.wyn.cnw.co.nz: remote Akka client disassociated 15/01/27 13:38:11 INFO TaskSetManager: Re-queueing tasks for 3 from TaskSet 0.0 15/01/27 13:38:11 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7, wynchcs218.wyn.cnw.co.nz): ExecutorLostFailure (executor lost) 15/01/27 13:38:11 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 15/01/27 13:38:11 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6, wynchcs218.wyn.cnw.co.nz): ExecutorLostFailure (executor lost) 15/01/27 13:38:11 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/01/27 13:38:11 INFO TaskSchedulerImpl: Cancelling stage 0 15/01/27 13:38:11 INFO DAGScheduler: Failed to run count at RowMatrix.scala:71 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, wynchcs218.wyn.cnw.co.nz): ExecutorLostFailure (executor lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 15/01/27 13:38:11 INFO DAGScheduler: Executor lost: 3 (epoch 3) 15/01/27 13:38:11 INFO BlockManagerMasterActor: Trying to remove executor 3 from BlockManagerMaster. 15/01/27 13:38:11 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor On Mon, Jan 26, 2015 at 6:34 PM, Aaron Davidson ilike...@gmail.com wrote: This was a regression caused by Netty Block Transfer Service. The fix for this just barely missed the 1.2 release, and you can see the associated JIRA here: https://issues.apache.org/jira/browse/SPARK-4837 Current master has the fix, and the Spark 1.2.1 release will have it included. If you don't want to rebuild from master or wait, then you can turn it off by setting spark.shuffle.blockTransferService to nio. On Sun, Jan 25, 2015 at 6:28 PM, Shailesh Birari sbirar...@gmail.com wrote: Can anyone please let me know ? I don't want to open all ports on n/w. So, am interested in the property by which this new port I can configure. Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p21360.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No AMI for Spark 1.2 using ec2 scripts
AMI's are specific to an AWS region, so the ami-id of the spark AMI in us-west will be different if it exists. I can't remember where but I have a memory of seeing somewhere that the AMI was only in us-east cheers On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson haj...@gmail.com wrote: Thanks, I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and 1.2 with the same command: ./spark-ec2 -k foo -i bar.pem launch mycluster By default this launches in us-east-1. I tried changing the the region using: -r us-west-1 but that had the same result: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm Looking up https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a browser results in the same AMI ID as yours. If I search for ami-7a320f3f AMI in AWS, I can't find any such image. I tried searching in all regions I could find in the AWS console. The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in us-west-1. Strange. Not sure what to do. /Håkan On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke charles.fed...@gmail.com wrote: I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. What parameters are you using when you execute spark-ec2? I am launching in the us-west-1 region (ami-7a320f3f) which may explain things. On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote: Hi, When I try to launch a standalone cluster on EC2 using the scripts in the ec2 directory for Spark 1.2, I get the following error: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm It seems there is not yet any AMI available on EC2. Any ideas when there will be one? This works without problems for version 1.1. Starting up a cluster using these scripts is so simple and straightforward, so I am really missing it on 1.2. /Håkan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com franc.car...@rozettatech.com| www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA
Re: SparkSQL tasks spend too much time to finish.
Hi, San You need to provide more information to diagnose this problem, like : 1. What kind of SQL did you execute? 2. If there are some |group| operation in this SQL, could you do some statistic about how many unique group keys in this case? On 1/26/15 17:01, luohui20...@sina.com wrote: Hi there, When running a sql query, i found abnormal time cost of tasks like the attached pic. it runs fast in the first few tasks, but extremely slow in later tasks, which speed 100X more time than early ones.However,they are dealing with same size data at 128mb standalone pseudo distributed cluster executor memory:40g driver momory:5g Any advices will be appreciated. Thanksamp;Best regards! 罗辉 San.Luo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Eclipse on spark
I use this: http://scala-ide.org/ I also use Maven with this archetype: https://github.com/davidB/scala-archetype-simple. To be frank though, you should be fine using SBT. On Sat, Jan 24, 2015 at 6:33 PM, riginos samarasrigi...@gmail.com wrote: How to compile a Spark project in Scala IDE for Eclipse? I got many scala scripts and i no longer want to load them from scala-shell what can i do? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-on-spark-tp21350.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [GraphX] Integration with TinkerPop3/Gremlin
TinkerPop has become an Apache Incubator project and seems to have Spark in mind in their proposal https://wiki.apache.org/incubator/TinkerPopProposal. That's good news! I hope there will be nice collaborations between the communities. On Wed, Jan 7, 2015 at 11:31 AM, Nicolas Colson nicolas.col...@gmail.com wrote: Saw it. It's a great read, thanks a lot! On Wed, Jan 7, 2015 at 11:27 AM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer at the first link i posted, it appears the bottom of the page does contain links to all the replies. On Wed, Jan 7, 2015 at 6:16 AM, Corey Nolet cjno...@gmail.com wrote: There was a fairly lengthy conversation about this on the dev list and i believe a bunch of work has been logged against it as well. [1] contains a link to the initial conversation but I'm having trouble tracking down the link to the whole discussion. Also see [2] for code. [1] https://www.mail-archive.com/dev@spark.apache.org/msg06231.html [2] https://github.com/kellrott/spark-gremlin On Wed, Jan 7, 2015 at 6:03 AM, Nicolas Colson nicolas.col...@gmail.com wrote: Hi Spark/GraphX community, I'm wondering if you have TinkerPop3/Gremlin on your radar? (github https://github.com/tinkerpop/tinkerpop3, doc http://www.tinkerpop.com/docs/3.0.0-SNAPSHOT) They've done an amazing work refactoring their stack recently and Gremlin is a very nice DSL to work with graphs. They even have a scala client https://github.com/mpollmeier/gremlin-scala. So far, they've used Hadoop for MapReduce tasks and I think GraphX could nicely dig in. Any view? Thanks, Nicolas
Re: No AMI for Spark 1.2 using ec2 scripts
Thanks, I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and 1.2 with the same command: ./spark-ec2 -k foo -i bar.pem launch mycluster By default this launches in us-east-1. I tried changing the the region using: -r us-west-1 but that had the same result: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm Looking up https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-west-1/pvm in a browser results in the same AMI ID as yours. If I search for ami-7a320f3f AMI in AWS, I can't find any such image. I tried searching in all regions I could find in the AWS console. The AMI for 1.1 is spark.ami.pvm.v9 (ami-5bb18832). I can find that AMI in us-west-1. Strange. Not sure what to do. /Håkan On Mon Jan 26 2015 at 9:02:42 AM Charles Feduke charles.fed...@gmail.com wrote: I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. What parameters are you using when you execute spark-ec2? I am launching in the us-west-1 region (ami-7a320f3f) which may explain things. On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote: Hi, When I try to launch a standalone cluster on EC2 using the scripts in the ec2 directory for Spark 1.2, I get the following error: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm It seems there is not yet any AMI available on EC2. Any ideas when there will be one? This works without problems for version 1.1. Starting up a cluster using these scripts is so simple and straightforward, so I am really missing it on 1.2. /Håkan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Worker never used by our Spark applications
Hello, we are running Spark 1.2.0 standalone on a cluster made up of 4 machines, each of them running one Worker and one of them also running the Master; they are all connected to the same HDFS instance. Until a few days ago, they were all configured with SPARK_WORKER_MEMORY = 18G and the jobs running on our cluster were making use of all of them. A few days ago though, we added a new machine to the cluster, set up one Worker on that machine, and reconfigured the machines as follows: | machine | SPARK_WORKER_MEMORY | | #1 | 16G | | #2 | 18G | | #3 | 24G | | #4 | 18G | | #5 (new) | 36G | Ever since we introduced this configuration change, our applications running on the cluster are not using the Worker running on machine #1 anymore, even though it is regularly registered to the cluster. I would be very grateful if anybody could explain how Spark chooses which workers to use and why that one is not used anymore. Regards, Federico Ragona - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RE: Shuffle to HDFS
I have also thought that Hadoop mapper output result is saved on HDFS, at least if the job only has Mapper but doesn't have Reducer. If there is reducer, then the map output will be saved on local disk? From: Shao, Saisai Date: 2015-01-26 15:23 To: Larry Liu CC: u...@spark.incubator.apache.org Subject: RE: Shuffle to HDFS Hey Larry, I don’t think Hadoop will put shuffle output in HDFS, instead it’s behavior is the same as what Spark did, store mapper output (shuffle) data on local disks. You might misunderstood something J. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Monday, January 26, 2015 3:03 PM To: Shao, Saisai Cc: u...@spark.incubator.apache.org Subject: Re: Shuffle to HDFS Hi,Jerry Thanks for your reply. The reason I have this question is that in Hadoop, mapper intermediate output (shuffle) will be stored in HDFS. I think the default location for spark is /tmp I think. Larry On Sun, Jan 25, 2015 at 9:44 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi Larry, I don’t think current Spark’s shuffle can support HDFS as a shuffle output. Anyway, is there any specific reason to spill shuffle data to HDFS or NFS, this will severely increase the shuffle time. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Sunday, January 25, 2015 4:45 PM To: u...@spark.incubator.apache.org Subject: Shuffle to HDFS How to change shuffle output to HDFS or NFS?
Re: RE: Shuffle to HDFS
If there is no Reducer, there is no shuffle. The Mapper output goes to HDFS, yes. But the question here is about shuffle files, right? Those are written by the Mapper to local disk. Reducers load them from the Mappers over the network then. Shuffle files do not go to HDFS. On Mon, Jan 26, 2015 at 10:01 AM, bit1...@163.com bit1...@163.com wrote: I have also thought that Hadoop mapper output result is saved on HDFS, at least if the job only has Mapper but doesn't have Reducer. If there is reducer, then the map output will be saved on local disk? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No AMI for Spark 1.2 using ec2 scripts
I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts. I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later. What parameters are you using when you execute spark-ec2? I am launching in the us-west-1 region (ami-7a320f3f) which may explain things. On Mon Jan 26 2015 at 2:40:01 AM hajons haj...@gmail.com wrote: Hi, When I try to launch a standalone cluster on EC2 using the scripts in the ec2 directory for Spark 1.2, I get the following error: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm It seems there is not yet any AMI available on EC2. Any ideas when there will be one? This works without problems for version 1.1. Starting up a cluster using these scripts is so simple and straightforward, so I am really missing it on 1.2. /Håkan -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/No-AMI-for-Spark-1-2-using-ec2-scripts-tp21362.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Worker never used by our Spark applications
Hello, we are running Spark 1.2.0 standalone on a cluster made up of 4 machines, each of them running one Worker and one of them also running the Master; they are all connected to the same HDFS instance. Until a few days ago, they were all configured with SPARK_WORKER_MEMORY = 18G and the jobs running on our cluster were making use of all of them. A few days ago though, we added a new machine to the cluster, set up one Worker on that machine, and reconfigured the machines as follows: | machine | SPARK_WORKER_MEMORY | | #1 | 16G | | #2 | 18G | | #3 | 24G | | #4 | 18G | | #5 (new) | 36G | Ever since we introduced this configuration change, our applications running on the cluster are not using the Worker running on machine #1 anymore, even though it is regularly registered to the cluster. I would be very grateful if anybody could explain how Spark chooses which workers to use and why that one is not used anymore. Regards, Federico Ragona - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Eclipse on spark
I am using SBT On 26 Jan 2015 15:54, Luke Wilson-Mawer lukewilsonma...@gmail.com wrote: I use this: http://scala-ide.org/ I also use Maven with this archetype: https://github.com/davidB/scala-archetype-simple. To be frank though, you should be fine using SBT. On Sat, Jan 24, 2015 at 6:33 PM, riginos samarasrigi...@gmail.com wrote: How to compile a Spark project in Scala IDE for Eclipse? I got many scala scripts and i no longer want to load them from scala-shell what can i do? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-on-spark-tp21350.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Pairwise Processing of a List
AFAIK ordering is not strictly guaranteed unless the RDD is the product of a sort. I think that in practice, you'll never find elements of a file read in some random order, for example (although see the recent issue about partition ordering potentially depending on how the local file system lists them). Likewise I can't imagine you encounter elements from one Kafka partition out of order. One receiver hears one partition and create one block per block interval. What I'm not 100% clear on is whether you get undefined ordering when you have multiple threads listening in one receiver. You can always sort RDDs by a timestamp of some sort to be sure, although that has overheads. I'm also curious about what if anything is guaranteed here without a sort. On Mon, Jan 26, 2015 at 1:33 AM, Tobias Pfeiffer t...@preferred.jp wrote: Sean, On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen so...@cloudera.com wrote: Note that RDDs don't really guarantee anything about ordering though, so this only makes sense if you've already sorted some upstream RDD by a timestamp or sequence number. Speaking of order, is there some reading on guarantees and non-guarantees about order in RDDs? For example, when reading a file and doing zipWithIndex, can I assume that the lines are numbered in order? Does this hold for receiving data from Kafka, too? Tobias - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark (Streaming?) holding on to Mesos Resources
Hi, What do your jobs do? Ideally post source code, but some description would already helpful to support you. Memory leaks can have several reasons - it may not be Spark at all. Thank you. Le 26 janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit : (looks like the list didn't like a HTML table on the previous email. My excuses for any duplicates) Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster. For example: This is a job that has spark.cores.max = 4 and spark.executor.memory=3g | ID |Framework |Host|CPUs |Mem …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB …5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB …5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB …5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB …5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB …5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB …5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB …5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB …5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB …5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB …5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB …5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB The only way to release the resources is by manually finding the process in the cluster and killing it. The jobs are often streaming but also batch jobs show this behavior. We have more streaming jobs than batch, so stats are biased. Any ideas of what's up here? Hopefully some very bad ugly bug that has been fixed already and that will urge us to upgrade our infra? Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 -kr, Gerard.
Re: Can Spark benefit from Hive-like partitions?
I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Let me clarify, you do not need to have Hive installed, and what I'm suggesting is completely self-contained in Spark SQL. We support the Hive Query Language for expressing partitioned tables when you are using a HiveContext, but the execution will be done using RDDs. If you don't manually configure a hive installation, Spark will just create a local metastore in the current directory. In the future we are planning to support non-HiveQL mechanisms for expressing partitioning.
Re: Spark (Streaming?) holding on to Mesos Resources
Hi Jörn, A memory leak on the job would be contained within the resources reserved for it, wouldn't it? And the job holding resources is not always the same. Sometimes it's one of the Streaming jobs, sometimes it's a heavy batch job that runs every hour. Looks to me that whatever is causing the issue, it's participating in the resource offer protocol of Mesos and my first suspect would be the Mesos scheduler in Spark. (The table above is the tab Offers from the Mesos UI. Are there any other factors involved in the offer acceptance/rejection between Mesos and a scheduler? What do you think? -kr, Gerard. On Mon, Jan 26, 2015 at 11:23 PM, Jörn Franke jornfra...@gmail.com wrote: Hi, What do your jobs do? Ideally post source code, but some description would already helpful to support you. Memory leaks can have several reasons - it may not be Spark at all. Thank you. Le 26 janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit : (looks like the list didn't like a HTML table on the previous email. My excuses for any duplicates) Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster. For example: This is a job that has spark.cores.max = 4 and spark.executor.memory=3g | ID |Framework |Host|CPUs |Mem …5050-16506-1146497 FooStreaming dnode-4.hdfs.private 7 13.4 GB …5050-16506-1146495 FooStreamingdnode-0.hdfs.private 1 6.4 GB …5050-16506-1146491 FooStreamingdnode-5.hdfs.private 7 11.9 GB …5050-16506-1146449 FooStreamingdnode-3.hdfs.private 7 4.9 GB …5050-16506-1146247 FooStreamingdnode-1.hdfs.private 0.5 5.9 GB …5050-16506-1146226 FooStreamingdnode-2.hdfs.private 3 7.9 GB …5050-16506-1144069 FooStreamingdnode-3.hdfs.private 1 8.7 GB …5050-16506-1133091 FooStreamingdnode-5.hdfs.private 1 1.7 GB …5050-16506-1133090 FooStreamingdnode-2.hdfs.private 5 5.2 GB …5050-16506-1133089 FooStreamingdnode-1.hdfs.private 6.5 6.3 GB …5050-16506-1133088 FooStreamingdnode-4.hdfs.private 1 251 MB …5050-16506-1133087 FooStreamingdnode-0.hdfs.private 6.4 6.8 GB The only way to release the resources is by manually finding the process in the cluster and killing it. The jobs are often streaming but also batch jobs show this behavior. We have more streaming jobs than batch, so stats are biased. Any ideas of what's up here? Hopefully some very bad ugly bug that has been fixed already and that will urge us to upgrade our infra? Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 -kr, Gerard.
Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType
I'm aiming for 1.3. On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Michael. I am sure there have been many requests for this support. Any release targeted for this? Thanks, On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com wrote: Those annotations actually don't work because the timestamp is SQL has optional nano-second precision. However, there is a PR to add support using parquets INT96 type: https://github.com/apache/spark/pull/3820 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com wrote: Looking further at the trace and ParquetTypes.scala, it seems there is no support for Timestamp and Date in fromPrimitiveDataType(ctype: DataType): Option[ParquetTypeInfo]. Since Parquet supports these type with some decoration over Int ( https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md), any reason why Date / Timestamp are not supported right now ? Thanks, Manoj On Fri, Jan 23, 2015 at 11:40 AM, Manoj Samel manojsamelt...@gmail.com wrote: Using Spark 1.2 Read a CSV file, apply schema to convert to SchemaRDD and then schemaRdd.saveAsParquetFile If the schema includes Timestamptype, it gives following trace when doing the save Exception in thread main java.lang.RuntimeException: Unsupported datatype TimestampType at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply( ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply( ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType( ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply( ParquetTypes.scala:363) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply( ParquetTypes.scala:362) at scala.collection.TraversableLike$$anonfun$map$1.apply( TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply( TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach( ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244 ) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes( ParquetTypes.scala:361) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData( ParquetTypes.scala:407) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty( ParquetRelation.scala:166) at org.apache.spark.sql.parquet.ParquetRelation$.create( ParquetRelation.scala:145) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply( SparkStrategies.scala:204) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply( QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply( QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply( QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute( SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan( SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute( SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan( SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute( SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd( SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile( SchemaRDDLike.scala:76) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:108 ) at bdrt.MyTest$.createParquetWithDate(MyTest.scala:88) at bdrt.MyTest$delayedInit$body.apply(MyTest.scala:54) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp( AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach( TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) at bdrt.MyTest$.main(MyTest.scala:10)
Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType
Thanks Michael. I am sure there have been many requests for this support. Any release targeted for this? Thanks, On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com wrote: Those annotations actually don't work because the timestamp is SQL has optional nano-second precision. However, there is a PR to add support using parquets INT96 type: https://github.com/apache/spark/pull/3820 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com wrote: Looking further at the trace and ParquetTypes.scala, it seems there is no support for Timestamp and Date in fromPrimitiveDataType(ctype: DataType): Option[ParquetTypeInfo]. Since Parquet supports these type with some decoration over Int ( https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md), any reason why Date / Timestamp are not supported right now ? Thanks, Manoj On Fri, Jan 23, 2015 at 11:40 AM, Manoj Samel manojsamelt...@gmail.com wrote: Using Spark 1.2 Read a CSV file, apply schema to convert to SchemaRDD and then schemaRdd.saveAsParquetFile If the schema includes Timestamptype, it gives following trace when doing the save Exception in thread main java.lang.RuntimeException: Unsupported datatype TimestampType at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply( ParquetTypes.scala:343) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply( ParquetTypes.scala:292) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType( ParquetTypes.scala:291) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply( ParquetTypes.scala:363) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply( ParquetTypes.scala:362) at scala.collection.TraversableLike$$anonfun$map$1.apply( TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply( TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach( ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes( ParquetTypes.scala:361) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData( ParquetTypes.scala:407) at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty( ParquetRelation.scala:166) at org.apache.spark.sql.parquet.ParquetRelation$.create( ParquetRelation.scala:145) at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply( SparkStrategies.scala:204) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply( QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply( QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply( QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute( SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan( SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute( SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan( SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute( SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd( SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile( SchemaRDDLike.scala:76) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:108) at bdrt.MyTest$.createParquetWithDate(MyTest.scala:88) at bdrt.MyTest$delayedInit$body.apply(MyTest.scala:54) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp( AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach( TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) at bdrt.MyTest$.main(MyTest.scala:10)
Re: Can Spark benefit from Hive-like partitions?
Ah, well that is interesting. I'll experiment further tomorrow. Thank you for the info! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark webUI - application details page
Hi, I do'nt have any history server running. As SK's already pointed in a previous post the history server seems to be required only in mesos or yarn mode, not in standalone mode. https://spark.apache.org/docs/1.1.1/monitoring.html If Spark is run on Mesos or YARN, it is still possible to reconstruct the UI of a finished application through Spark’s history server, provided that the application’s event logs exist. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p21379.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org