Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Earthson
reduceByKey(_+_).countByKey instead of countByKey seems to be fast.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Earthson
parallelize is still so slow. 



package com.semi.nlp

import org.apache.spark._
import SparkContext._
import scala.io.Source
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator

class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Map[String,Int]])
kryo.register(classOf[Map[String,Long]])
kryo.register(classOf[Seq[(String,Long)]])
kryo.register(classOf[Seq[(String,Int)]])
}
}

object WFilter2 {
def initspark(name:String) = {
val conf = new SparkConf()
.setMaster(yarn-standalone)
.setAppName(name)
.setSparkHome(System.getenv(SPARK_HOME))
.setJars(SparkContext.jarOfClass(this.getClass))
.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
//.set(spark.closure.serializer,
org.apache.spark.serializer.KryoSerializer)
.set(spark.kryoserializer.buffer.mb, 256)
.set(spark.kryo.registrator,
com.semi.nlp.MyRegistrator)
.set(spark.cores.max, 30)
new SparkContext(conf)
}

def main(args: Array[String]) {
val spark = initspark(word filter mapping)
val stopset = spark broadcast
Source.fromURL(this.getClass.getResource(/stoplist.txt)).getLines.map(_.trim).toSet
val file = spark.textFile(hdfs://ns1/nlp/wiki.splited)
val tf_map = spark broadcast
file.flatMap(_.split(\t)).map((_,1)).reduceByKey(_+_).countByKey
val df_map = spark broadcast
file.flatMap(x=Set(x.split(\t):_*).toBuffer).map((_,1)).reduceByKey(_+_).countByKey
val word_mapping = spark broadcast
Map(df_map.value.keys.zipWithIndex.toBuffer:_*)
def w_filter(w:String) = if (tf_map.value(w)  8 || df_map.value(w)
 4 || (stopset.value contains w)) false else true
val mapped =
file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t))
   
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs)
spark.stop()
}
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


how to get subArray without copy

2014-04-26 Thread wxhsdp
Hi, all
  i want to do the following operations:
  (1) each partition do some operations on the partition data in Array
format
  (2) split the array into subArrays, and combine each subArray with an id
  (3) do a shuffle according to the id

  here is the pseudo code
  /*pseudo code*/

  case class MyObject(val id: Int, val arr: Array[T])

  val b = a.mapPartitions{ itr =
val c = itr.toArray

/*some operations on c*/

val d = new Array[MyObject](2)

d(0) = (0, c(index0 to index1)) //line 10
d(1) = (1, c(index2 to index3)) //line 11

d.toIterator
  }

  b.groupBy(id...)

  my question is how to get the subArray with memory efficiency in line 10
and 11, i don't want val d to
  occupy extra memory. is there a way to do like pointer reference in c?

  Array.slice does a copy and it consumes 4x memory than the original one, i
don't know the reason. it's
  related to java autoboxing?

   Array.view returns IndexedSeqView, if you convert it to Array right in
line10
   d(0) = (0, c.view(index0, index1).toArray) //line 10
   it's the same as Array.slice

   if you convert it to Array after b.groupBy(id...), error occurs since
it's not serializable

ERROR executor.Executor: Exception in task ID 1
java.io.NotSerializableException:
scala.collection.mutable.IndexedSeqView$$anon$2




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-subArray-without-copy-tp4873.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running out of memory Naive Bayes

2014-04-26 Thread John King
I'm just wondering are the SparkVector calculations really taking into
account the sparsity or just converting to dense?


On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.comwrote:

 I've been trying to use the Naive Bayes classifier. Each example in the
 dataset is about 2 million features, only about 20-50 of which are
 non-zero, so the vectors are very sparse. I keep running out of memory
 though, even for about 1000 examples on 30gb RAM while the entire dataset
 is 4 million examples. And I would also like to note that I'm using the
 sparse vector class.



Re: Parquet-SPARK-PIG integration.

2014-04-26 Thread suman bharadwaj
Figured how to do it. Hence thought of sharing in case if someone is
interested.

import parquet.column.ColumnReader
import parquet.filter.ColumnRecordFilter._
import parquet.filter.ColumnPredicates._
import parquet.hadoop.{ParquetOutputFormat, ParquetInputFormat}
import org.apache.hadoop.mapred.JobConf
import parquet.bytes.BytesInput
import parquet.pig.TupleReadSupport;
import org.apache.pig.data.Tuple;
val conf = new JobConf()
conf.set(parquet.pig.schema,id:int,name:chararray)
ParquetInputFormat.setReadSupportClass(conf,classOf[TupleReadSupport])
val file =
sc.newAPIHadoopFile(path/part-m-0.parquet,classOf[ParquetInputFormat[Tuple]],classOf[Void],classOf[Tuple],conf).map(x=(x._2.get(0),x._2.get(1))).collect

Regards,
SB


On Sat, Apr 26, 2014 at 3:31 PM, suman bharadwaj suman@gmail.comwrote:

 Hi All,

 We have written PIG Jobs which outputs the data in parquet format.

 For eg:

 register parquet-column-1.3.1.jar;
 register parquet-common-1.3.1.jar;
 register parquet-format-2.0.0.jar;
 register parquet-hadoop-1.3.1.jar;
 register parquet-pig-1.3.1.jar;
 register parquet-encoding-1.3.1.jar;

 A =load 'path' using PigStorage('\t') as (in:int,name:chararray);
 store A into 'output_path' using parquet.pig.ParquetStorer();

 Now how do i read this parquet file in SPARK ?

 Thanks in advance.

 Regards,
 SB



Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Aaron Davidson
Could it be that you're using the default number of partitions of
parallelize() is too small in this case? Try something like
spark.parallelize(word_mapping.value.toSeq, 60). (Given your setup, it
should already be 30, but perhaps that's not the case in YARN mode...)


On Fri, Apr 25, 2014 at 11:38 PM, Earthson earthson...@gmail.com wrote:

 parallelize is still so slow.

 

 package com.semi.nlp

 import org.apache.spark._
 import SparkContext._
 import scala.io.Source
 import com.esotericsoftware.kryo.Kryo
 import org.apache.spark.serializer.KryoRegistrator

 class MyRegistrator extends KryoRegistrator {
 override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[Map[String,Int]])
 kryo.register(classOf[Map[String,Long]])
 kryo.register(classOf[Seq[(String,Long)]])
 kryo.register(classOf[Seq[(String,Int)]])
 }
 }

 object WFilter2 {
 def initspark(name:String) = {
 val conf = new SparkConf()
 .setMaster(yarn-standalone)
 .setAppName(name)
 .setSparkHome(System.getenv(SPARK_HOME))
 .setJars(SparkContext.jarOfClass(this.getClass))
 .set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer)
 //.set(spark.closure.serializer,
 org.apache.spark.serializer.KryoSerializer)
 .set(spark.kryoserializer.buffer.mb, 256)
 .set(spark.kryo.registrator,
 com.semi.nlp.MyRegistrator)
 .set(spark.cores.max, 30)
 new SparkContext(conf)
 }

 def main(args: Array[String]) {
 val spark = initspark(word filter mapping)
 val stopset = spark broadcast

 Source.fromURL(this.getClass.getResource(/stoplist.txt)).getLines.map(_.trim).toSet
 val file = spark.textFile(hdfs://ns1/nlp/wiki.splited)
 val tf_map = spark broadcast
 file.flatMap(_.split(\t)).map((_,1)).reduceByKey(_+_).countByKey
 val df_map = spark broadcast

 file.flatMap(x=Set(x.split(\t):_*).toBuffer).map((_,1)).reduceByKey(_+_).countByKey
 val word_mapping = spark broadcast
 Map(df_map.value.keys.zipWithIndex.toBuffer:_*)
 def w_filter(w:String) = if (tf_map.value(w)  8 || df_map.value(w)
  4 || (stopset.value contains w)) false else true
 val mapped =

 file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t))


 spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
 mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs)
 spark.stop()
 }
 }



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4871.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Using Spark in IntelliJ Scala Console

2014-04-26 Thread Jonathan Chayat
Hi all,

TLDR: running spark locally through IntelliJ IDEA Scala Console results
in java.lang.ClassNotFoundException

Long version:

I'm an algorithms developer in SupersonicAds - an ad network. We are
building a major new big data project and we are now in the process of
selecting our tech stack  tools.

I'm new to Spark, but I'm very excited about it. It is my opinion that
Spark can be a great tool for us, and that we might be able to build most
of our toolchain on top of it.

We currently develop in Scala and we are using IntelliJ IDEA as our IDE (we
love it). One of the features I love about IDEA is the Scala Console which
lets me work interactively with all of my project's code available and all
of the IDE's features  convenience. That is as opposed to the Scala Shell
 Spark Shell which I dislike because it is based on JLine and doesn't
behave like a good shell would (I cant even Ctrl-c to abort a line without
crashing the whole thing). Of course, as an algo guy, having a good REPL is
crucial to me.

To get started, I added the following line to build.sbt:

 org.apache.spark %% spark-core % 0.9.1


Then, added the following main class:

import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._



object Main extends App {
   val sc = new SparkContext(local, myApp)
   val r = sc.parallelize(1 to 1000)
   println(r.filter(_ % 2 == 1).first() =  + r.filter(_ % 2 == 1).first())
   println(r.filter(_ % 2 == 1).count() =  + r.filter(_ % 2 == 1).count())
 }


Make, Run, Works perfectly.

Next, I try running the same in the scala console.
Bad news - the last line throws an exception:

 ERROR executor.Executor: Exception in task ID 0
 java.lang.ClassNotFoundException: $line5.$read$$iw$$iw$$iw$$iw$$anonfun$2


It is my guess that for some reason Spark is not able to find the anonymous
function (_ % 2 == 1). Please note I'm running locally so I did not provide
any jars. For some reason when using first() instead of count() it works.
Needless to say it also works in Spark Shell but as I stated, working with
it is not an option.

This issue brings much sadness to my heart, and I could not find a solution
on the mailing list archives or elsewhere. I am hoping someone here might
offer some help.

Thanks,
Jon


Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
The spark-shell is a special version of the Scala REPL that serves the
classes created for each line over HTTP.  Do you know if the InteliJ Spark
console is just the normal Scala repl in a GUI wrapper, or if it is
something else entirely?  If its the former, perhaps it might be possible
to tell InteliJ to bring up the spark version instead.


On Sat, Apr 26, 2014 at 10:47 AM, Jonathan Chayat 
jonatha...@supersonicads.com wrote:

 Hi all,

 TLDR: running spark locally through IntelliJ IDEA Scala Console results
 in java.lang.ClassNotFoundException

 Long version:

 I'm an algorithms developer in SupersonicAds - an ad network. We are
 building a major new big data project and we are now in the process of
 selecting our tech stack  tools.

 I'm new to Spark, but I'm very excited about it. It is my opinion that
 Spark can be a great tool for us, and that we might be able to build most
 of our toolchain on top of it.

 We currently develop in Scala and we are using IntelliJ IDEA as our IDE
 (we love it). One of the features I love about IDEA is the Scala Console
 which lets me work interactively with all of my project's code available
 and all of the IDE's features  convenience. That is as opposed to the
 Scala Shell  Spark Shell which I dislike because it is based on JLine and
 doesn't behave like a good shell would (I cant even Ctrl-c to abort a line
 without crashing the whole thing). Of course, as an algo guy, having a good
 REPL is crucial to me.

 To get started, I added the following line to build.sbt:

 org.apache.spark %% spark-core % 0.9.1


 Then, added the following main class:

 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._



 object Main extends App {
   val sc = new SparkContext(local, myApp)
   val r = sc.parallelize(1 to 1000)
   println(r.filter(_ % 2 == 1).first() =  + r.filter(_ % 2 ==
 1).first())
   println(r.filter(_ % 2 == 1).count() =  + r.filter(_ % 2 ==
 1).count())
 }


 Make, Run, Works perfectly.

 Next, I try running the same in the scala console.
 Bad news - the last line throws an exception:

 ERROR executor.Executor: Exception in task ID 0
 java.lang.ClassNotFoundException: $line5.$read$$iw$$iw$$iw$$iw$$anonfun$2


 It is my guess that for some reason Spark is not able to find the
 anonymous function (_ % 2 == 1). Please note I'm running locally so I did
 not provide any jars. For some reason when using first() instead of count()
 it works. Needless to say it also works in Spark Shell but as I stated,
 working with it is not an option.

 This issue brings much sadness to my heart, and I could not find a
 solution on the mailing list archives or elsewhere. I am hoping someone
 here might offer some help.

 Thanks,
 Jon



Re: Question about Transforming huge files from Local to HDFS

2014-04-26 Thread Michael Armbrust
 1) When I tried to read a huge file from local and used Avro + Parquet to
 transform it into Parquet format and stored them to HDFS using the API
 saveAsNewAPIHadoopFile, the JVM would be out of memory, because the file
 is too large to be contained by memory.


How much memory are you giving the JVM and how many cores are you giving
spark?  You do not need to be able to fit the entire dataset in memory, but
since parquet is a columnar format you are going to need a pretty large
buffer space when writing out a lot of data.  I believe this is especially
true if you have a lot of columns.

Options include increasing the amount of memory you give spark executors
spark.executor.memory or decreasing the number of parallel tasks
spark.cores.max.

You might also try asking on the parquet mailing list.  There are probably
options for configuring how much buffer space it allocates.  However, a lot
of the performance benefits of formats like parquet come from large buffers
so this may not be the best option.


 2) When I tried to read a fraction of them and write to HDFS as Parquet
 format using the API saveAsNewAPIHadoopFile, I found that for each loop,
 it would generate a directory with a list of files, namely, it would be
 deemed as several independent outputs, which was not what I would like and
 would occur some problems when I tried to process them in the future.


This is slightly different, but in Spark SQL (coming with the 1.0 release)
there is experimental support for creating a virtual table that is backed
by a parquet file.  You can do many insertions into this table and they
will all be read by a single job pointing to that directory.


Re: Spark and HBase

2014-04-26 Thread Nicholas Chammas
Thank you for sharing. Phoenix for realtime queries and Spark for more
complex batch processing seems like a potentially good combo.

I wonder if Spark's future will include support for the same kinds of
workloads that Phoenix is being built for. This little
tidbithttp://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.htmlabout
the future of Spark SQL seems to suggest just that (noting for others
reading that Phoenix is basically a SQL skin over HBase):

Look for future blog posts on the following topics:

- ...


- Reading and writing data using other formats and systems, include
Avro and HBase

 I would certainly be nice to have one big data framework to rule them all.

Nick



On Sat, Apr 26, 2014 at 10:00 AM, Josh Mahonin jmaho...@filetrek.comwrote:

 We're still in the infancy stages of the architecture for the project I'm
 on, but presently we're investigating HBase / Phoenix data store for it's
 realtime query abilities, and being able to expose data over a JDBC
 connector is attractive for us.

 Much of our data is event based, and many of the reports we'd like to do
 can be accomplished using simple SQL queries on that data - assuming they
 are performant. This far, the evidence is showing that it is even across
 many millions of rows.

 However, there are a number of models we have that today exist as a
 combination of PIG and python batch jobs that I'd like to replace with
 Spark, which thus far has shown to be more than adequate for what we're
 doing today.

 As far as using Phoenix as an endpoint for a batch load, the only real
 advantage I see over using straight HBase is that I can specify a query to
 prefilter the data before attaching it to an RDD. I haven't run the numbers
 yet to see how this compare to more traditional methods though.

 The only worry I have is that the Phoenix input format doesn't adequately
 split the data across multiple nodes, so that's something I will need to
 look at further.

 Josh



 On Apr 25, 2014, at 6:33 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Josh, is there a specific use pattern you think is served well by Phoenix
 + Spark? Just curious.


 On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin jmaho...@filetrek.comwrote:

 Phoenix generally presents itself as an endpoint using JDBC, which in my
 testing seems to play nicely using JdbcRDD.

 However, a few days ago a patch was made against Phoenix to implement
 support via PIG using a custom Hadoop InputFormat, which means now it has
 Spark support too.

 Here's a code snippet that sets up an RDD for a specific query:

 --
 val phoenixConf = new PhoenixPigConfiguration(new Configuration())
 phoenixConf.setSelectStatement(SELECT EVENTTYPE,EVENTTIME FROM EVENTS
 WHERE EVENTTYPE = 'some_type')
 phoenixConf.setSelectColumns(EVENTTYPE,EVENTTIME)
 phoenixConf.configure(servername, EVENTS, 100L)

 val phoenixRDD = sc.newAPIHadoopRDD(
 phoenixConf.getConfiguration(),
 classOf[PhoenixInputFormat],
   classOf[NullWritable],
   classOf[PhoenixRecord])
 --

 I'm still very new at Spark and even less experienced with Phoenix, but
 I'm hoping there's an advantage over the JdbcRDD in terms of partitioning.
 The JdbcRDD seems to implement partitioning based on a query predicate that
 is user defined, but I think Phoenix's InputFormat is able to figure out
 the splits which Spark is able to leverage. I don't really know how to
 verify if this is the case or not though, so if anyone else is looking into
 this, I'd love to hear their thoughts.

 Josh


 On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Just took a quick look at the overview 
 herehttp://phoenix.incubator.apache.org/ and
 the quick start guide 
 herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
 .

 It looks like Apache Phoenix aims to provide flexible SQL access to
 data, both for transactional and analytic purposes, and at interactive
 speeds.

 Nick


 On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote:

 First, I have not tried it myself. However, what I have heard it has
 some basic SQL features so you can query you HBase table like query content
 on HDFS using Hive.
 So it is not query a simple column, I believe you can do joins and
 other SQL queries. Maybe you can wrap up an EMR cluster with Hbase
 preconfigured and give it a try.

 Sorry cannot provide more detailed explanation and help.



 On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier 
 pomperma...@okkam.it wrote:

 Thanks for the quick reply Bin. Phenix is something I'm going to try
 for sure but is seems somehow useless if I can use Spark.
 Probably, as you said, since Phoenix use a dedicated data structure
 within each HBase Table has a more effective memory usage but if I need to
 deserialize data stored in a HBase cell I still have to read in memory 
 that
 object and thus I need Spark. From what I understood Phoenix is 

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
This is a little bit of a hack, but might work for you.  You'll need to be
on sbt 0.13.2.

connectInput in run := true

outputStrategy in run := Some (StdoutOutput)

console := {
  (runMain in Compile).toTask( org.apache.spark.repl.Main
-usejavacp).value
}


On Sat, Apr 26, 2014 at 1:05 PM, Jonathan Chayat 
jonatha...@supersonicads.com wrote:

 Hi Michael, thanks for your prompt reply.

 It seems like IntelliJ Scala Console actually runs the Scala REPL (they
 print the same stuff when starting up).
 It is probably the SBT console.

 When I tried the same code in the Scala REPL of my project using sbt
 console it didn't work either.
 It only worked in spark project's bin/spark-shell

 Is there a way to customize the SBT console of a project listing spark as
 a dependency?

 Thx,
 Jon


 On Sat, Apr 26, 2014 at 9:42 PM, Michael Armbrust 
 mich...@databricks.comwrote:

 The spark-shell is a special version of the Scala REPL that serves the
 classes created for each line over HTTP.  Do you know if the InteliJ Spark
 console is just the normal Scala repl in a GUI wrapper, or if it is
 something else entirely?  If its the former, perhaps it might be possible
 to tell InteliJ to bring up the spark version instead.


 On Sat, Apr 26, 2014 at 10:47 AM, Jonathan Chayat 
 jonatha...@supersonicads.com wrote:

 Hi all,

 TLDR: running spark locally through IntelliJ IDEA Scala Console results
 in java.lang.ClassNotFoundException

 Long version:

 I'm an algorithms developer in SupersonicAds - an ad network. We are
 building a major new big data project and we are now in the process of
 selecting our tech stack  tools.

 I'm new to Spark, but I'm very excited about it. It is my opinion that
 Spark can be a great tool for us, and that we might be able to build most
 of our toolchain on top of it.

 We currently develop in Scala and we are using IntelliJ IDEA as our IDE
 (we love it). One of the features I love about IDEA is the Scala Console
 which lets me work interactively with all of my project's code available
 and all of the IDE's features  convenience. That is as opposed to the
 Scala Shell  Spark Shell which I dislike because it is based on JLine and
 doesn't behave like a good shell would (I cant even Ctrl-c to abort a line
 without crashing the whole thing). Of course, as an algo guy, having a good
 REPL is crucial to me.

 To get started, I added the following line to build.sbt:

 org.apache.spark %% spark-core % 0.9.1


 Then, added the following main class:

 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._



 object Main extends App {
   val sc = new SparkContext(local, myApp)
   val r = sc.parallelize(1 to 1000)
   println(r.filter(_ % 2 == 1).first() =  + r.filter(_ % 2 ==
 1).first())
   println(r.filter(_ % 2 == 1).count() =  + r.filter(_ % 2 ==
 1).count())
 }


 Make, Run, Works perfectly.

 Next, I try running the same in the scala console.
 Bad news - the last line throws an exception:

 ERROR executor.Executor: Exception in task ID 0
 java.lang.ClassNotFoundException:
 $line5.$read$$iw$$iw$$iw$$iw$$anonfun$2


 It is my guess that for some reason Spark is not able to find the
 anonymous function (_ % 2 == 1). Please note I'm running locally so I did
 not provide any jars. For some reason when using first() instead of count()
 it works. Needless to say it also works in Spark Shell but as I stated,
 working with it is not an option.

 This issue brings much sadness to my heart, and I could not find a
 solution on the mailing list archives or elsewhere. I am hoping someone
 here might offer some help.

 Thanks,
 Jon






Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
You'll also need:

libraryDependencies += org.apache.spark %% spark-repl % spark version



On Sat, Apr 26, 2014 at 3:32 PM, Michael Armbrust mich...@databricks.comwrote:

 This is a little bit of a hack, but might work for you.  You'll need to be
 on sbt 0.13.2.

 connectInput in run := true

 outputStrategy in run := Some (StdoutOutput)

 console := {
   (runMain in Compile).toTask( org.apache.spark.repl.Main
 -usejavacp).value
 }


 On Sat, Apr 26, 2014 at 1:05 PM, Jonathan Chayat 
 jonatha...@supersonicads.com wrote:

 Hi Michael, thanks for your prompt reply.

 It seems like IntelliJ Scala Console actually runs the Scala REPL (they
 print the same stuff when starting up).
 It is probably the SBT console.

 When I tried the same code in the Scala REPL of my project using sbt
 console it didn't work either.
 It only worked in spark project's bin/spark-shell

 Is there a way to customize the SBT console of a project listing spark as
 a dependency?

 Thx,
 Jon


 On Sat, Apr 26, 2014 at 9:42 PM, Michael Armbrust mich...@databricks.com
  wrote:

 The spark-shell is a special version of the Scala REPL that serves the
 classes created for each line over HTTP.  Do you know if the InteliJ Spark
 console is just the normal Scala repl in a GUI wrapper, or if it is
 something else entirely?  If its the former, perhaps it might be possible
 to tell InteliJ to bring up the spark version instead.


 On Sat, Apr 26, 2014 at 10:47 AM, Jonathan Chayat 
 jonatha...@supersonicads.com wrote:

 Hi all,

 TLDR: running spark locally through IntelliJ IDEA Scala Console results
 in java.lang.ClassNotFoundException

 Long version:

 I'm an algorithms developer in SupersonicAds - an ad network. We are
 building a major new big data project and we are now in the process of
 selecting our tech stack  tools.

 I'm new to Spark, but I'm very excited about it. It is my opinion that
 Spark can be a great tool for us, and that we might be able to build most
 of our toolchain on top of it.

 We currently develop in Scala and we are using IntelliJ IDEA as our IDE
 (we love it). One of the features I love about IDEA is the Scala Console
 which lets me work interactively with all of my project's code available
 and all of the IDE's features  convenience. That is as opposed to the
 Scala Shell  Spark Shell which I dislike because it is based on JLine and
 doesn't behave like a good shell would (I cant even Ctrl-c to abort a line
 without crashing the whole thing). Of course, as an algo guy, having a good
 REPL is crucial to me.

 To get started, I added the following line to build.sbt:

 org.apache.spark %% spark-core % 0.9.1


 Then, added the following main class:

 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._



 object Main extends App {
   val sc = new SparkContext(local, myApp)
   val r = sc.parallelize(1 to 1000)
   println(r.filter(_ % 2 == 1).first() =  + r.filter(_ % 2 ==
 1).first())
   println(r.filter(_ % 2 == 1).count() =  + r.filter(_ % 2 ==
 1).count())
 }


 Make, Run, Works perfectly.

 Next, I try running the same in the scala console.
 Bad news - the last line throws an exception:

 ERROR executor.Executor: Exception in task ID 0
 java.lang.ClassNotFoundException:
 $line5.$read$$iw$$iw$$iw$$iw$$anonfun$2


 It is my guess that for some reason Spark is not able to find the
 anonymous function (_ % 2 == 1). Please note I'm running locally so I did
 not provide any jars. For some reason when using first() instead of count()
 it works. Needless to say it also works in Spark Shell but as I stated,
 working with it is not an option.

 This issue brings much sadness to my heart, and I could not find a
 solution on the mailing list archives or elsewhere. I am hoping someone
 here might offer some help.

 Thanks,
 Jon







Re: how to get subArray without copy

2014-04-26 Thread wxhsdp
the way i can find out is to use 2-D Array if the split has regularity



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-subArray-without-copy-tp4873p4888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running out of memory Naive Bayes

2014-04-26 Thread DB Tsai
Which version of mllib are you using? For Spark 1.0, mllib will
support sparse feature vector which will improve performance a lot
when computing the distance between points and centroid.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sat, Apr 26, 2014 at 5:49 AM, John King usedforprinting...@gmail.com wrote:
 I'm just wondering are the SparkVector calculations really taking into
 account the sparsity or just converting to dense?


 On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.com
 wrote:

 I've been trying to use the Naive Bayes classifier. Each example in the
 dataset is about 2 million features, only about 20-50 of which are non-zero,
 so the vectors are very sparse. I keep running out of memory though, even
 for about 1000 examples on 30gb RAM while the entire dataset is 4 million
 examples. And I would also like to note that I'm using the sparse vector
 class.




questions about debugging a spark application

2014-04-26 Thread wxhsdp
Hi, all
  i have some questions about debug in spark:
  1) when application finished, application UI is shut down, i can not see
the details about the app, like
  shuffle size, duration time, stage information... there are not
sufficient informations in the master UI.
 do i need to hang the application on?

  2) how to get details about each task the executor run? like memory
usage...

  3) since i'am not familiar with JVM. do i need to run the program step by
step or hang on the program
  to use JVM utilities like jstack, jmap...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Running out of memory Naive Bayes

2014-04-26 Thread Xiangrui Meng
How many labels does your dataset have? -Xiangrui

On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai dbt...@stanford.edu wrote:
 Which version of mllib are you using? For Spark 1.0, mllib will
 support sparse feature vector which will improve performance a lot
 when computing the distance between points and centroid.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sat, Apr 26, 2014 at 5:49 AM, John King usedforprinting...@gmail.com 
 wrote:
 I'm just wondering are the SparkVector calculations really taking into
 account the sparsity or just converting to dense?


 On Fri, Apr 25, 2014 at 10:06 PM, John King usedforprinting...@gmail.com
 wrote:

 I've been trying to use the Naive Bayes classifier. Each example in the
 dataset is about 2 million features, only about 20-50 of which are non-zero,
 so the vectors are very sparse. I keep running out of memory though, even
 for about 1000 examples on 30gb RAM while the entire dataset is 4 million
 examples. And I would also like to note that I'm using the sparse vector
 class.




Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Jonathan Chayat
In IntelliJ, nothing changed. In SBT console I got this error:

$sbt

 console

[info] Running org.apache.spark.repl.Main -usejavacp

14/04/27 08:29:44 INFO spark.HttpServer: Starting HTTP Server

14/04/27 08:29:44 INFO server.Server: jetty-7.6.8.v20121106

14/04/27 08:29:44 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:50966


Failed to initialize compiler: object scala.annotation.Annotation in
compiler mirror not found.

** Note that as of 2.8 scala does not assume use of the java classpath.

** For the old behavior pass -usejavacp to scala, or if using a Settings

** object programatically, settings.usejavacp.value = true.

14/04/27 08:29:44 WARN repl.SparkILoop$SparkILoopInterpreter: Warning:
compiler accessed before init set up.  Assuming no postInit code.


Failed to initialize compiler: object scala.annotation.Annotation in
compiler mirror not found.

** Note that as of 2.8 scala does not assume use of the java classpath.

** For the old behavior pass -usejavacp to scala, or if using a Settings

** object programatically, settings.usejavacp.value = true.

[error] (run-main-0) java.lang.AssertionError: assertion failed: null

java.lang.AssertionError: assertion failed: null

at scala.Predef$.assert(Predef.scala:179)

at
org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:197)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:924)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)

at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:881)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:973)

at org.apache.spark.repl.Main$.main(Main.scala:31)

at org.apache.spark.repl.Main.main(Main.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:483)

[trace] Stack trace suppressed: run last compile:runMain for the full
output.

java.lang.RuntimeException: Nonzero exit code: 1

at scala.sys.package$.error(package.scala:27)

[trace] Stack trace suppressed: run last compile:runMain for the full
output.

[error] (compile:runMain) Nonzero exit code: 1

[error] Total time: 1 s, completed Apr 27, 2014 8:29:44 AM


On Sun, Apr 27, 2014 at 1:42 AM, Michael Armbrust mich...@databricks.comwrote:

 You'll also need:

 libraryDependencies += org.apache.spark %% spark-repl % spark version



 On Sat, Apr 26, 2014 at 3:32 PM, Michael Armbrust 
 mich...@databricks.comwrote:

 This is a little bit of a hack, but might work for you.  You'll need to
 be on sbt 0.13.2.

 connectInput in run := true

 outputStrategy in run := Some (StdoutOutput)

 console := {
   (runMain in Compile).toTask( org.apache.spark.repl.Main
 -usejavacp).value
 }


 On Sat, Apr 26, 2014 at 1:05 PM, Jonathan Chayat 
 jonatha...@supersonicads.com wrote:

 Hi Michael, thanks for your prompt reply.

 It seems like IntelliJ Scala Console actually runs the Scala REPL (they
 print the same stuff when starting up).
 It is probably the SBT console.

 When I tried the same code in the Scala REPL of my project using sbt
 console it didn't work either.
 It only worked in spark project's bin/spark-shell

 Is there a way to customize the SBT console of a project listing spark
 as a dependency?

 Thx,
 Jon


 On Sat, Apr 26, 2014 at 9:42 PM, Michael Armbrust 
 mich...@databricks.com wrote:

 The spark-shell is a special version of the Scala REPL that serves the
 classes created for each line over HTTP.  Do you know if the InteliJ Spark
 console is just the normal Scala repl in a GUI wrapper, or if it is
 something else entirely?  If its the former, perhaps it might be possible
 to tell InteliJ to bring up the spark version instead.


 On Sat, Apr 26, 2014 at 10:47 AM, Jonathan Chayat 
 jonatha...@supersonicads.com wrote:

 Hi all,

 TLDR: running spark locally through IntelliJ IDEA Scala Console
 results in java.lang.ClassNotFoundException

 Long version:

 I'm an algorithms developer in SupersonicAds - an ad network. We are
 building a major new big data project and we are now in the process of
 selecting our tech stack  tools.

 I'm new to Spark, but I'm very excited about it. It is my opinion that
 Spark can be a great tool for us, and that we might be able to build most
 of our toolchain on top of it.

 We currently develop in Scala and we are using IntelliJ IDEA as our
 IDE (we love it). One of the features I love about IDEA is the Scala
 Console which lets me work interactively with all of my project's code
 available and all of the IDE's features  convenience. That is as opposed
 to the Scala Shell  Spark Shell which I dislike