Re: parallelize for a large Seq is extreamly slow.
reduceByKey(_+_).countByKey instead of countByKey seems to be fast. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: parallelize for a large Seq is extreamly slow.
parallelize is still so slow. package com.semi.nlp import org.apache.spark._ import SparkContext._ import scala.io.Source import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[Map[String,Int]]) kryo.register(classOf[Map[String,Long]]) kryo.register(classOf[Seq[(String,Long)]]) kryo.register(classOf[Seq[(String,Int)]]) } } object WFilter2 { def initspark(name:String) = { val conf = new SparkConf() .setMaster(yarn-standalone) .setAppName(name) .setSparkHome(System.getenv(SPARK_HOME)) .setJars(SparkContext.jarOfClass(this.getClass)) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) //.set(spark.closure.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryoserializer.buffer.mb, 256) .set(spark.kryo.registrator, com.semi.nlp.MyRegistrator) .set(spark.cores.max, 30) new SparkContext(conf) } def main(args: Array[String]) { val spark = initspark(word filter mapping) val stopset = spark broadcast Source.fromURL(this.getClass.getResource(/stoplist.txt)).getLines.map(_.trim).toSet val file = spark.textFile(hdfs://ns1/nlp/wiki.splited) val tf_map = spark broadcast file.flatMap(_.split(\t)).map((_,1)).reduceByKey(_+_).countByKey val df_map = spark broadcast file.flatMap(x=Set(x.split(\t):_*).toBuffer).map((_,1)).reduceByKey(_+_).countByKey val word_mapping = spark broadcast Map(df_map.value.keys.zipWithIndex.toBuffer:_*) def w_filter(w:String) = if (tf_map.value(w) 8 || df_map.value(w) 4 || (stopset.value contains w)) false else true val mapped = file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t)) spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs) spark.stop() } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4871.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
how to get subArray without copy
Hi, all i want to do the following operations: (1) each partition do some operations on the partition data in Array format (2) split the array into subArrays, and combine each subArray with an id (3) do a shuffle according to the id here is the pseudo code /*pseudo code*/ case class MyObject(val id: Int, val arr: Array[T]) val b = a.mapPartitions{ itr = val c = itr.toArray /*some operations on c*/ val d = new Array[MyObject](2) d(0) = (0, c(index0 to index1)) //line 10 d(1) = (1, c(index2 to index3)) //line 11 d.toIterator } b.groupBy(id...) my question is how to get the subArray with memory efficiency in line 10 and 11, i don't want val d to occupy extra memory. is there a way to do like pointer reference in c? Array.slice does a copy and it consumes 4x memory than the original one, i don't know the reason. it's related to java autoboxing? Array.view returns IndexedSeqView, if you convert it to Array right in line10 d(0) = (0, c.view(index0, index1).toArray) //line 10 it's the same as Array.slice if you convert it to Array after b.groupBy(id...), error occurs since it's not serializable ERROR executor.Executor: Exception in task ID 1 java.io.NotSerializableException: scala.collection.mutable.IndexedSeqView$$anon$2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-subArray-without-copy-tp4873.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Running out of memory Naive Bayes
I'm just wondering are the SparkVector calculations really taking into account the sparsity or just converting to dense? On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.comwrote: I've been trying to use the Naive Bayes classifier. Each example in the dataset is about 2 million features, only about 20-50 of which are non-zero, so the vectors are very sparse. I keep running out of memory though, even for about 1000 examples on 30gb RAM while the entire dataset is 4 million examples. And I would also like to note that I'm using the sparse vector class.
Re: Parquet-SPARK-PIG integration.
Figured how to do it. Hence thought of sharing in case if someone is interested. import parquet.column.ColumnReader import parquet.filter.ColumnRecordFilter._ import parquet.filter.ColumnPredicates._ import parquet.hadoop.{ParquetOutputFormat, ParquetInputFormat} import org.apache.hadoop.mapred.JobConf import parquet.bytes.BytesInput import parquet.pig.TupleReadSupport; import org.apache.pig.data.Tuple; val conf = new JobConf() conf.set(parquet.pig.schema,id:int,name:chararray) ParquetInputFormat.setReadSupportClass(conf,classOf[TupleReadSupport]) val file = sc.newAPIHadoopFile(path/part-m-0.parquet,classOf[ParquetInputFormat[Tuple]],classOf[Void],classOf[Tuple],conf).map(x=(x._2.get(0),x._2.get(1))).collect Regards, SB On Sat, Apr 26, 2014 at 3:31 PM, suman bharadwaj suman@gmail.comwrote: Hi All, We have written PIG Jobs which outputs the data in parquet format. For eg: register parquet-column-1.3.1.jar; register parquet-common-1.3.1.jar; register parquet-format-2.0.0.jar; register parquet-hadoop-1.3.1.jar; register parquet-pig-1.3.1.jar; register parquet-encoding-1.3.1.jar; A =load 'path' using PigStorage('\t') as (in:int,name:chararray); store A into 'output_path' using parquet.pig.ParquetStorer(); Now how do i read this parquet file in SPARK ? Thanks in advance. Regards, SB
Re: parallelize for a large Seq is extreamly slow.
Could it be that you're using the default number of partitions of parallelize() is too small in this case? Try something like spark.parallelize(word_mapping.value.toSeq, 60). (Given your setup, it should already be 30, but perhaps that's not the case in YARN mode...) On Fri, Apr 25, 2014 at 11:38 PM, Earthson earthson...@gmail.com wrote: parallelize is still so slow. package com.semi.nlp import org.apache.spark._ import SparkContext._ import scala.io.Source import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[Map[String,Int]]) kryo.register(classOf[Map[String,Long]]) kryo.register(classOf[Seq[(String,Long)]]) kryo.register(classOf[Seq[(String,Int)]]) } } object WFilter2 { def initspark(name:String) = { val conf = new SparkConf() .setMaster(yarn-standalone) .setAppName(name) .setSparkHome(System.getenv(SPARK_HOME)) .setJars(SparkContext.jarOfClass(this.getClass)) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) //.set(spark.closure.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryoserializer.buffer.mb, 256) .set(spark.kryo.registrator, com.semi.nlp.MyRegistrator) .set(spark.cores.max, 30) new SparkContext(conf) } def main(args: Array[String]) { val spark = initspark(word filter mapping) val stopset = spark broadcast Source.fromURL(this.getClass.getResource(/stoplist.txt)).getLines.map(_.trim).toSet val file = spark.textFile(hdfs://ns1/nlp/wiki.splited) val tf_map = spark broadcast file.flatMap(_.split(\t)).map((_,1)).reduceByKey(_+_).countByKey val df_map = spark broadcast file.flatMap(x=Set(x.split(\t):_*).toBuffer).map((_,1)).reduceByKey(_+_).countByKey val word_mapping = spark broadcast Map(df_map.value.keys.zipWithIndex.toBuffer:_*) def w_filter(w:String) = if (tf_map.value(w) 8 || df_map.value(w) 4 || (stopset.value contains w)) false else true val mapped = file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t)) spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs) spark.stop() } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4871.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Using Spark in IntelliJ Scala Console
Hi all, TLDR: running spark locally through IntelliJ IDEA Scala Console results in java.lang.ClassNotFoundException Long version: I'm an algorithms developer in SupersonicAds - an ad network. We are building a major new big data project and we are now in the process of selecting our tech stack tools. I'm new to Spark, but I'm very excited about it. It is my opinion that Spark can be a great tool for us, and that we might be able to build most of our toolchain on top of it. We currently develop in Scala and we are using IntelliJ IDEA as our IDE (we love it). One of the features I love about IDEA is the Scala Console which lets me work interactively with all of my project's code available and all of the IDE's features convenience. That is as opposed to the Scala Shell Spark Shell which I dislike because it is based on JLine and doesn't behave like a good shell would (I cant even Ctrl-c to abort a line without crashing the whole thing). Of course, as an algo guy, having a good REPL is crucial to me. To get started, I added the following line to build.sbt: org.apache.spark %% spark-core % 0.9.1 Then, added the following main class: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main extends App { val sc = new SparkContext(local, myApp) val r = sc.parallelize(1 to 1000) println(r.filter(_ % 2 == 1).first() = + r.filter(_ % 2 == 1).first()) println(r.filter(_ % 2 == 1).count() = + r.filter(_ % 2 == 1).count()) } Make, Run, Works perfectly. Next, I try running the same in the scala console. Bad news - the last line throws an exception: ERROR executor.Executor: Exception in task ID 0 java.lang.ClassNotFoundException: $line5.$read$$iw$$iw$$iw$$iw$$anonfun$2 It is my guess that for some reason Spark is not able to find the anonymous function (_ % 2 == 1). Please note I'm running locally so I did not provide any jars. For some reason when using first() instead of count() it works. Needless to say it also works in Spark Shell but as I stated, working with it is not an option. This issue brings much sadness to my heart, and I could not find a solution on the mailing list archives or elsewhere. I am hoping someone here might offer some help. Thanks, Jon
Re: Using Spark in IntelliJ Scala Console
The spark-shell is a special version of the Scala REPL that serves the classes created for each line over HTTP. Do you know if the InteliJ Spark console is just the normal Scala repl in a GUI wrapper, or if it is something else entirely? If its the former, perhaps it might be possible to tell InteliJ to bring up the spark version instead. On Sat, Apr 26, 2014 at 10:47 AM, Jonathan Chayat jonatha...@supersonicads.com wrote: Hi all, TLDR: running spark locally through IntelliJ IDEA Scala Console results in java.lang.ClassNotFoundException Long version: I'm an algorithms developer in SupersonicAds - an ad network. We are building a major new big data project and we are now in the process of selecting our tech stack tools. I'm new to Spark, but I'm very excited about it. It is my opinion that Spark can be a great tool for us, and that we might be able to build most of our toolchain on top of it. We currently develop in Scala and we are using IntelliJ IDEA as our IDE (we love it). One of the features I love about IDEA is the Scala Console which lets me work interactively with all of my project's code available and all of the IDE's features convenience. That is as opposed to the Scala Shell Spark Shell which I dislike because it is based on JLine and doesn't behave like a good shell would (I cant even Ctrl-c to abort a line without crashing the whole thing). Of course, as an algo guy, having a good REPL is crucial to me. To get started, I added the following line to build.sbt: org.apache.spark %% spark-core % 0.9.1 Then, added the following main class: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main extends App { val sc = new SparkContext(local, myApp) val r = sc.parallelize(1 to 1000) println(r.filter(_ % 2 == 1).first() = + r.filter(_ % 2 == 1).first()) println(r.filter(_ % 2 == 1).count() = + r.filter(_ % 2 == 1).count()) } Make, Run, Works perfectly. Next, I try running the same in the scala console. Bad news - the last line throws an exception: ERROR executor.Executor: Exception in task ID 0 java.lang.ClassNotFoundException: $line5.$read$$iw$$iw$$iw$$iw$$anonfun$2 It is my guess that for some reason Spark is not able to find the anonymous function (_ % 2 == 1). Please note I'm running locally so I did not provide any jars. For some reason when using first() instead of count() it works. Needless to say it also works in Spark Shell but as I stated, working with it is not an option. This issue brings much sadness to my heart, and I could not find a solution on the mailing list archives or elsewhere. I am hoping someone here might offer some help. Thanks, Jon
Re: Question about Transforming huge files from Local to HDFS
1) When I tried to read a huge file from local and used Avro + Parquet to transform it into Parquet format and stored them to HDFS using the API saveAsNewAPIHadoopFile, the JVM would be out of memory, because the file is too large to be contained by memory. How much memory are you giving the JVM and how many cores are you giving spark? You do not need to be able to fit the entire dataset in memory, but since parquet is a columnar format you are going to need a pretty large buffer space when writing out a lot of data. I believe this is especially true if you have a lot of columns. Options include increasing the amount of memory you give spark executors spark.executor.memory or decreasing the number of parallel tasks spark.cores.max. You might also try asking on the parquet mailing list. There are probably options for configuring how much buffer space it allocates. However, a lot of the performance benefits of formats like parquet come from large buffers so this may not be the best option. 2) When I tried to read a fraction of them and write to HDFS as Parquet format using the API saveAsNewAPIHadoopFile, I found that for each loop, it would generate a directory with a list of files, namely, it would be deemed as several independent outputs, which was not what I would like and would occur some problems when I tried to process them in the future. This is slightly different, but in Spark SQL (coming with the 1.0 release) there is experimental support for creating a virtual table that is backed by a parquet file. You can do many insertions into this table and they will all be read by a single job pointing to that directory.
Re: Spark and HBase
Thank you for sharing. Phoenix for realtime queries and Spark for more complex batch processing seems like a potentially good combo. I wonder if Spark's future will include support for the same kinds of workloads that Phoenix is being built for. This little tidbithttp://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.htmlabout the future of Spark SQL seems to suggest just that (noting for others reading that Phoenix is basically a SQL skin over HBase): Look for future blog posts on the following topics: - ... - Reading and writing data using other formats and systems, include Avro and HBase I would certainly be nice to have one big data framework to rule them all. Nick On Sat, Apr 26, 2014 at 10:00 AM, Josh Mahonin jmaho...@filetrek.comwrote: We're still in the infancy stages of the architecture for the project I'm on, but presently we're investigating HBase / Phoenix data store for it's realtime query abilities, and being able to expose data over a JDBC connector is attractive for us. Much of our data is event based, and many of the reports we'd like to do can be accomplished using simple SQL queries on that data - assuming they are performant. This far, the evidence is showing that it is even across many millions of rows. However, there are a number of models we have that today exist as a combination of PIG and python batch jobs that I'd like to replace with Spark, which thus far has shown to be more than adequate for what we're doing today. As far as using Phoenix as an endpoint for a batch load, the only real advantage I see over using straight HBase is that I can specify a query to prefilter the data before attaching it to an RDD. I haven't run the numbers yet to see how this compare to more traditional methods though. The only worry I have is that the Phoenix input format doesn't adequately split the data across multiple nodes, so that's something I will need to look at further. Josh On Apr 25, 2014, at 6:33 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Josh, is there a specific use pattern you think is served well by Phoenix + Spark? Just curious. On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin jmaho...@filetrek.comwrote: Phoenix generally presents itself as an endpoint using JDBC, which in my testing seems to play nicely using JdbcRDD. However, a few days ago a patch was made against Phoenix to implement support via PIG using a custom Hadoop InputFormat, which means now it has Spark support too. Here's a code snippet that sets up an RDD for a specific query: -- val phoenixConf = new PhoenixPigConfiguration(new Configuration()) phoenixConf.setSelectStatement(SELECT EVENTTYPE,EVENTTIME FROM EVENTS WHERE EVENTTYPE = 'some_type') phoenixConf.setSelectColumns(EVENTTYPE,EVENTTIME) phoenixConf.configure(servername, EVENTS, 100L) val phoenixRDD = sc.newAPIHadoopRDD( phoenixConf.getConfiguration(), classOf[PhoenixInputFormat], classOf[NullWritable], classOf[PhoenixRecord]) -- I'm still very new at Spark and even less experienced with Phoenix, but I'm hoping there's an advantage over the JdbcRDD in terms of partitioning. The JdbcRDD seems to implement partitioning based on a query predicate that is user defined, but I think Phoenix's InputFormat is able to figure out the splits which Spark is able to leverage. I don't really know how to verify if this is the case or not though, so if anyone else is looking into this, I'd love to hear their thoughts. Josh On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html . It looks like Apache Phoenix aims to provide flexible SQL access to data, both for transactional and analytic purposes, and at interactive speeds. Nick On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote: First, I have not tried it myself. However, what I have heard it has some basic SQL features so you can query you HBase table like query content on HDFS using Hive. So it is not query a simple column, I believe you can do joins and other SQL queries. Maybe you can wrap up an EMR cluster with Hbase preconfigured and give it a try. Sorry cannot provide more detailed explanation and help. On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Thanks for the quick reply Bin. Phenix is something I'm going to try for sure but is seems somehow useless if I can use Spark. Probably, as you said, since Phoenix use a dedicated data structure within each HBase Table has a more effective memory usage but if I need to deserialize data stored in a HBase cell I still have to read in memory that object and thus I need Spark. From what I understood Phoenix is
Re: Using Spark in IntelliJ Scala Console
This is a little bit of a hack, but might work for you. You'll need to be on sbt 0.13.2. connectInput in run := true outputStrategy in run := Some (StdoutOutput) console := { (runMain in Compile).toTask( org.apache.spark.repl.Main -usejavacp).value } On Sat, Apr 26, 2014 at 1:05 PM, Jonathan Chayat jonatha...@supersonicads.com wrote: Hi Michael, thanks for your prompt reply. It seems like IntelliJ Scala Console actually runs the Scala REPL (they print the same stuff when starting up). It is probably the SBT console. When I tried the same code in the Scala REPL of my project using sbt console it didn't work either. It only worked in spark project's bin/spark-shell Is there a way to customize the SBT console of a project listing spark as a dependency? Thx, Jon On Sat, Apr 26, 2014 at 9:42 PM, Michael Armbrust mich...@databricks.comwrote: The spark-shell is a special version of the Scala REPL that serves the classes created for each line over HTTP. Do you know if the InteliJ Spark console is just the normal Scala repl in a GUI wrapper, or if it is something else entirely? If its the former, perhaps it might be possible to tell InteliJ to bring up the spark version instead. On Sat, Apr 26, 2014 at 10:47 AM, Jonathan Chayat jonatha...@supersonicads.com wrote: Hi all, TLDR: running spark locally through IntelliJ IDEA Scala Console results in java.lang.ClassNotFoundException Long version: I'm an algorithms developer in SupersonicAds - an ad network. We are building a major new big data project and we are now in the process of selecting our tech stack tools. I'm new to Spark, but I'm very excited about it. It is my opinion that Spark can be a great tool for us, and that we might be able to build most of our toolchain on top of it. We currently develop in Scala and we are using IntelliJ IDEA as our IDE (we love it). One of the features I love about IDEA is the Scala Console which lets me work interactively with all of my project's code available and all of the IDE's features convenience. That is as opposed to the Scala Shell Spark Shell which I dislike because it is based on JLine and doesn't behave like a good shell would (I cant even Ctrl-c to abort a line without crashing the whole thing). Of course, as an algo guy, having a good REPL is crucial to me. To get started, I added the following line to build.sbt: org.apache.spark %% spark-core % 0.9.1 Then, added the following main class: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main extends App { val sc = new SparkContext(local, myApp) val r = sc.parallelize(1 to 1000) println(r.filter(_ % 2 == 1).first() = + r.filter(_ % 2 == 1).first()) println(r.filter(_ % 2 == 1).count() = + r.filter(_ % 2 == 1).count()) } Make, Run, Works perfectly. Next, I try running the same in the scala console. Bad news - the last line throws an exception: ERROR executor.Executor: Exception in task ID 0 java.lang.ClassNotFoundException: $line5.$read$$iw$$iw$$iw$$iw$$anonfun$2 It is my guess that for some reason Spark is not able to find the anonymous function (_ % 2 == 1). Please note I'm running locally so I did not provide any jars. For some reason when using first() instead of count() it works. Needless to say it also works in Spark Shell but as I stated, working with it is not an option. This issue brings much sadness to my heart, and I could not find a solution on the mailing list archives or elsewhere. I am hoping someone here might offer some help. Thanks, Jon
Re: Using Spark in IntelliJ Scala Console
You'll also need: libraryDependencies += org.apache.spark %% spark-repl % spark version On Sat, Apr 26, 2014 at 3:32 PM, Michael Armbrust mich...@databricks.comwrote: This is a little bit of a hack, but might work for you. You'll need to be on sbt 0.13.2. connectInput in run := true outputStrategy in run := Some (StdoutOutput) console := { (runMain in Compile).toTask( org.apache.spark.repl.Main -usejavacp).value } On Sat, Apr 26, 2014 at 1:05 PM, Jonathan Chayat jonatha...@supersonicads.com wrote: Hi Michael, thanks for your prompt reply. It seems like IntelliJ Scala Console actually runs the Scala REPL (they print the same stuff when starting up). It is probably the SBT console. When I tried the same code in the Scala REPL of my project using sbt console it didn't work either. It only worked in spark project's bin/spark-shell Is there a way to customize the SBT console of a project listing spark as a dependency? Thx, Jon On Sat, Apr 26, 2014 at 9:42 PM, Michael Armbrust mich...@databricks.com wrote: The spark-shell is a special version of the Scala REPL that serves the classes created for each line over HTTP. Do you know if the InteliJ Spark console is just the normal Scala repl in a GUI wrapper, or if it is something else entirely? If its the former, perhaps it might be possible to tell InteliJ to bring up the spark version instead. On Sat, Apr 26, 2014 at 10:47 AM, Jonathan Chayat jonatha...@supersonicads.com wrote: Hi all, TLDR: running spark locally through IntelliJ IDEA Scala Console results in java.lang.ClassNotFoundException Long version: I'm an algorithms developer in SupersonicAds - an ad network. We are building a major new big data project and we are now in the process of selecting our tech stack tools. I'm new to Spark, but I'm very excited about it. It is my opinion that Spark can be a great tool for us, and that we might be able to build most of our toolchain on top of it. We currently develop in Scala and we are using IntelliJ IDEA as our IDE (we love it). One of the features I love about IDEA is the Scala Console which lets me work interactively with all of my project's code available and all of the IDE's features convenience. That is as opposed to the Scala Shell Spark Shell which I dislike because it is based on JLine and doesn't behave like a good shell would (I cant even Ctrl-c to abort a line without crashing the whole thing). Of course, as an algo guy, having a good REPL is crucial to me. To get started, I added the following line to build.sbt: org.apache.spark %% spark-core % 0.9.1 Then, added the following main class: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main extends App { val sc = new SparkContext(local, myApp) val r = sc.parallelize(1 to 1000) println(r.filter(_ % 2 == 1).first() = + r.filter(_ % 2 == 1).first()) println(r.filter(_ % 2 == 1).count() = + r.filter(_ % 2 == 1).count()) } Make, Run, Works perfectly. Next, I try running the same in the scala console. Bad news - the last line throws an exception: ERROR executor.Executor: Exception in task ID 0 java.lang.ClassNotFoundException: $line5.$read$$iw$$iw$$iw$$iw$$anonfun$2 It is my guess that for some reason Spark is not able to find the anonymous function (_ % 2 == 1). Please note I'm running locally so I did not provide any jars. For some reason when using first() instead of count() it works. Needless to say it also works in Spark Shell but as I stated, working with it is not an option. This issue brings much sadness to my heart, and I could not find a solution on the mailing list archives or elsewhere. I am hoping someone here might offer some help. Thanks, Jon
Re: how to get subArray without copy
the way i can find out is to use 2-D Array if the split has regularity -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-subArray-without-copy-tp4873p4888.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Running out of memory Naive Bayes
Which version of mllib are you using? For Spark 1.0, mllib will support sparse feature vector which will improve performance a lot when computing the distance between points and centroid. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Apr 26, 2014 at 5:49 AM, John King usedforprinting...@gmail.com wrote: I'm just wondering are the SparkVector calculations really taking into account the sparsity or just converting to dense? On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.com wrote: I've been trying to use the Naive Bayes classifier. Each example in the dataset is about 2 million features, only about 20-50 of which are non-zero, so the vectors are very sparse. I keep running out of memory though, even for about 1000 examples on 30gb RAM while the entire dataset is 4 million examples. And I would also like to note that I'm using the sparse vector class.
questions about debugging a spark application
Hi, all i have some questions about debug in spark: 1) when application finished, application UI is shut down, i can not see the details about the app, like shuffle size, duration time, stage information... there are not sufficient informations in the master UI. do i need to hang the application on? 2) how to get details about each task the executor run? like memory usage... 3) since i'am not familiar with JVM. do i need to run the program step by step or hang on the program to use JVM utilities like jstack, jmap... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Running out of memory Naive Bayes
How many labels does your dataset have? -Xiangrui On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai dbt...@stanford.edu wrote: Which version of mllib are you using? For Spark 1.0, mllib will support sparse feature vector which will improve performance a lot when computing the distance between points and centroid. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Apr 26, 2014 at 5:49 AM, John King usedforprinting...@gmail.com wrote: I'm just wondering are the SparkVector calculations really taking into account the sparsity or just converting to dense? On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.com wrote: I've been trying to use the Naive Bayes classifier. Each example in the dataset is about 2 million features, only about 20-50 of which are non-zero, so the vectors are very sparse. I keep running out of memory though, even for about 1000 examples on 30gb RAM while the entire dataset is 4 million examples. And I would also like to note that I'm using the sparse vector class.
Re: Using Spark in IntelliJ Scala Console
In IntelliJ, nothing changed. In SBT console I got this error: $sbt console [info] Running org.apache.spark.repl.Main -usejavacp 14/04/27 08:29:44 INFO spark.HttpServer: Starting HTTP Server 14/04/27 08:29:44 INFO server.Server: jetty-7.6.8.v20121106 14/04/27 08:29:44 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:50966 Failed to initialize compiler: object scala.annotation.Annotation in compiler mirror not found. ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. 14/04/27 08:29:44 WARN repl.SparkILoop$SparkILoopInterpreter: Warning: compiler accessed before init set up. Assuming no postInit code. Failed to initialize compiler: object scala.annotation.Annotation in compiler mirror not found. ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. [error] (run-main-0) java.lang.AssertionError: assertion failed: null java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:197) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:924) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:881) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:973) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) [trace] Stack trace suppressed: run last compile:runMain for the full output. java.lang.RuntimeException: Nonzero exit code: 1 at scala.sys.package$.error(package.scala:27) [trace] Stack trace suppressed: run last compile:runMain for the full output. [error] (compile:runMain) Nonzero exit code: 1 [error] Total time: 1 s, completed Apr 27, 2014 8:29:44 AM On Sun, Apr 27, 2014 at 1:42 AM, Michael Armbrust mich...@databricks.comwrote: You'll also need: libraryDependencies += org.apache.spark %% spark-repl % spark version On Sat, Apr 26, 2014 at 3:32 PM, Michael Armbrust mich...@databricks.comwrote: This is a little bit of a hack, but might work for you. You'll need to be on sbt 0.13.2. connectInput in run := true outputStrategy in run := Some (StdoutOutput) console := { (runMain in Compile).toTask( org.apache.spark.repl.Main -usejavacp).value } On Sat, Apr 26, 2014 at 1:05 PM, Jonathan Chayat jonatha...@supersonicads.com wrote: Hi Michael, thanks for your prompt reply. It seems like IntelliJ Scala Console actually runs the Scala REPL (they print the same stuff when starting up). It is probably the SBT console. When I tried the same code in the Scala REPL of my project using sbt console it didn't work either. It only worked in spark project's bin/spark-shell Is there a way to customize the SBT console of a project listing spark as a dependency? Thx, Jon On Sat, Apr 26, 2014 at 9:42 PM, Michael Armbrust mich...@databricks.com wrote: The spark-shell is a special version of the Scala REPL that serves the classes created for each line over HTTP. Do you know if the InteliJ Spark console is just the normal Scala repl in a GUI wrapper, or if it is something else entirely? If its the former, perhaps it might be possible to tell InteliJ to bring up the spark version instead. On Sat, Apr 26, 2014 at 10:47 AM, Jonathan Chayat jonatha...@supersonicads.com wrote: Hi all, TLDR: running spark locally through IntelliJ IDEA Scala Console results in java.lang.ClassNotFoundException Long version: I'm an algorithms developer in SupersonicAds - an ad network. We are building a major new big data project and we are now in the process of selecting our tech stack tools. I'm new to Spark, but I'm very excited about it. It is my opinion that Spark can be a great tool for us, and that we might be able to build most of our toolchain on top of it. We currently develop in Scala and we are using IntelliJ IDEA as our IDE (we love it). One of the features I love about IDEA is the Scala Console which lets me work interactively with all of my project's code available and all of the IDE's features convenience. That is as opposed to the Scala Shell Spark Shell which I dislike