Re: Running out of memory Naive Bayes

2014-04-28 Thread DB Tsai
Our customer asked us to implement Naive Bayes which should be able to at least train news20 one year ago, and we implemented for them in Hadoop using distributed cache to store the model. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Running out of memory Naive Bayes

2014-04-28 Thread Matei Zaharia
Not sure if this is always ideal for Naive Bayes, but you could also hash the features into a lower-dimensional space (e.g. reduce it to 50,000 features). For each feature simply take MurmurHash3(featureID) % 5 for example. Matei On Apr 27, 2014, at 11:24 PM, DB Tsai dbt...@stanford.edu

Spark with Parquet

2014-04-28 Thread Sai Prasanna
Hi All, I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark. Somehow my search to find the way to do was futile. More help was available for parquet with impala. Any guidance here? Thanks !!

Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi I am running a simple word count program on spark standalone cluster. The cluster is made up of 6 node, each run 4 worker and each worker own 10G memory and 16 core thus total 96 core and 240G memory. ( well, also used to configed as 1 worker with 40G memory on each node )

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sean Owen
On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung coded...@cs.stanford.eduwrote: e.g. something like rdd.mapPartition((rows : Iterator[String]) = { var idx = 0 rows.map((row: String) = { val valueMap = SparkWorker.getMemoryContent(valMap) val prevVal = valueMap(idx) idx +=

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
Actually, I do not know how to do something like this or whether this is possible - thus my suggestive statement. Can you already declare persistent memory objects per worker? I tried something like constructing a singleton object within map functions, but that didn't work as it seemed to

Spark 1.0 run job fail

2014-04-28 Thread Shihaoliang (Shihaoliang)
Hi all, I get spark 1.0 snapshot code from git; and I compiled it using command: mvn -Pbigtop-dist -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests package -e in cluster, I add [export SPARK_YARN_MODE=true] to spark-env.sh, and run HdfsTest examples; and I got error, any one got

Re: what does broadcast_0 stand for

2014-04-28 Thread wxhsdp
thank you for your help, Sourav. i found broadcast_0 binary file in /tmp directory. it's size is 33.4kB, not equal to estimated size 135.6 KB. i opened it and found it's content has no relations with my read in file. i guess broadcast_0 is a config file about spark, is that right? -- View this

NoSuchMethodError from Spark Java

2014-04-28 Thread Jared Rodriguez
I am seeing the following exception from a very basic test project when it runs on spark local. java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; The project is built with Java 1.6, Scala 2.10.3 and spark 0.9.1

Re: what does broadcast_0 stand for

2014-04-28 Thread Sourav Chandra
Apart from user defined broadcast variable, there are others which is being created by spark. This could be one of those. As I had mentioned you can do a small program where you create a broadcast variable. Check the broadcast variable id(say its x). Then go to the /tmp to open broadcast_x

Cannot compile SIMR with Spark 9.1

2014-04-28 Thread lukas nalezenec
Hi, I am trying to recompile SIMR with Spark 9.1 but it fails on incompatible method: [error] /home/lukas/src/simr/src/main/scala/org/apache/spark/simr/RelayServer.scala:213: not enough arguments for method createActorSystem: (name: String, host: String, port: Int, indestructible: Boolean, conf:

Java Spark Streaming - SparkFlumeEvent

2014-04-28 Thread Kulkarni, Vikram
Hi Spark-users, Within my Spark Streaming program, I am able to ingest data sent by my Flume Avro Client. I configured a 'spooling directory source' to write data to a Flume Avro Sink (the Spark Streaming Driver program in this case). The default deserializer i.e. LINE is used to parse the

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Daniel Darabos
That is quite mysterious, and I do not think we have enough information to answer. JavaPairRDDString, Tuple2.lookup() works fine on a remote Spark cluster: $ MASTER=spark://localhost:7077 bin/spark-shell scala val rdd = org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0 until 10, 3).map(x

Re: questions about debugging a spark application

2014-04-28 Thread Daniel Darabos
Good question! I am also new to the JVM and would appreciate some tips. On Sun, Apr 27, 2014 at 5:19 AM, wxhsdp wxh...@gmail.com wrote: Hi, all i have some questions about debug in spark: 1) when application finished, application UI is shut down, i can not see the details about the app,

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Yadid Ayzenberg
Thanks for your answer. I tried running on a single machine - master and worker on one host. I get exactly the same results. Very little CPU activity on the machine in question. The web UI shows a single task and its state is RUNNING. it will remain so indefinitely. I have a single partition,

Running parallel jobs in the same driver with Futures?

2014-04-28 Thread Ian Ferreira
I recall asking about this, and I think Matei suggest it was, but is the scheduler thread safe? I am running mllib libraries as futures in the same driver using the same dataset as input and this error 14/04/28 08:29:48 ERROR TaskSchedulerImpl: Exception in statusUpdate

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Ian O'Connell
A mutable map in an object should do what your looking for then I believe. You just reference the object as an object in your closure so it won't be swept up when your closure is serialized and you can reference variables of the object on the remote host then. e.g.: object MyObject { val mmap =

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
As to your last line: I've used RDD zipping to avoid GC since MyBaseData is large and doesn't change. I think this is a very good solution to what is being asked for. On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell i...@ianoconnell.com wrote: A mutable map in an object should do what your

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Yadid Ayzenberg
Could this be related to the size of the lookup result ? I tried to recreate a similar scenario on the spark shell which causes an exception: scala val rdd = org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0 until 4, 3).map(x = ( ( 0,52fb9b1a3004f07d1a87c8f3 ),

RE: K-means with large K

2014-04-28 Thread Buttler, David
One thing I have used this for was to create codebooks for SIFT features in images. It is a common, though fairly naïve, method for converting high dimensional features into a simple word-like features. Thus, if you have 200 SIFT features for an image, you can reduce that to 200 ‘words’ that

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
I'm not sure what I said came through. RDD zip is not hacky at all, as it only depends on a user not changing the partitioning. Basically, you would keep your losses as an RDD[Double] and zip whose with the RDD of examples, and update the losses. You're doing a copy (and GC) on the RDD of

Re: K-means with large K

2014-04-28 Thread Matei Zaharia
Try turning on the Kryo serializer as described at http://spark.apache.org/docs/latest/tuning.html. Also, are there any exceptions in the driver program’s log before this happens? Matei On Apr 28, 2014, at 9:19 AM, Buttler, David buttl...@llnl.gov wrote: Hi, I am trying to run the K-means

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Right---They are zipped at each iteration. On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen chesterxgc...@yahoo.comwrote: Tom, Are you suggesting two RDDs, one with loss and another for the rest info, using zip to tie them together, but do update on loss RDD (copy) ? Chester Sent from

running SparkALS

2014-04-28 Thread Diana Carroll
Hi everyone. I'm trying to run some of the Spark example code, and most of it appears to be undocumented (unless I'm missing something). Can someone help me out? I'm particularly interested in running SparkALS, which wants parameters: M U F iter slices What are these variables? They appear to

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Ian, I tried playing with your suggestion, but I get a task not serializable error (and some obvious things didn't fix it). Can you get that working? On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek minnesota...@gmail.com wrote: As to your last line: I've used RDD zipping to avoid GC since

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
That might be a good alternative to what we are looking for. But I wonder if this would be as efficient as we want to. For instance, will RDDs of the same size usually get partitioned to the same machines - thus not triggering any cross machine aligning, etc. We'll explore it, but I would still

Re: running SparkALS

2014-04-28 Thread Xiangrui Meng
Hi Diana, SparkALS is an example implementation of ALS. It doesn't call the ALS algorithm implemented in MLlib. M, U, and F are used to generate synthetic data. I'm updating the examples. In the meantime, you can take a look at the updated MLlib guide:

Re: running SparkALS

2014-04-28 Thread Sean Owen
Yeah you'd have to look at the source code in this case; it's not explained in the scaladoc or usage message as far as I can see either. The args refer specifically to the example of recommending Movies to Users. This example makes up a bunch of ratings and then makes recommendations using ALS. M

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
If you create your auxiliary RDD as a map from the examples, the partitioning will be inherited. On Mon, Apr 28, 2014 at 12:38 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: That might be a good alternative to what we are looking for. But I wonder if this would be as efficient as we want

Re: running SparkALS

2014-04-28 Thread Li Pu
http://spark.apache.org/docs/0.9.0/mllib-guide.html#collaborative-filtering-1 One thing which is undocumented: the integers representing users and items have to be positive. Otherwise it throws exceptions. Li On 28 avr. 2014, at 10:30, Diana Carroll dcarr...@cloudera.com wrote: Hi everyone.

Re: running SparkALS

2014-04-28 Thread Diana Carroll
Thanks, Deb. But I'm looking at org.apache.spark.examples.SparkALS, which is not in the mllib examples, and does not take any file parameters. I don't see the class you refer to in the examples ...however, if I did want to run that example, where would I find the file in question? It would be

Read from list of files in parallel

2014-04-28 Thread Pat Ferrel
Warning noob question: The sc.textFile(URI) method seems to support reading from files in parallel but you have to supply some wildcard URI, which greatly limits how the storage is structured. Is there a simple way to pass in a URI list or is it an exercise left for the student?

Re: K-means with large K

2014-04-28 Thread Dean Wampler
You might also investigate other clustering algorithms, such as canopy clustering and nearest neighbors. Some of them are less accurate, but more computationally efficient. Often they are used to compute approximate clusters followed by k-means (or a variant thereof) for greater accuracy. dean

RE: help

2014-04-28 Thread Laird, Benjamin
Joe, Do you have your SPARK_HOME variable set correctly in the spark-env.sh script? I was getting that error when I was first setting up my cluster, turned out I had to make some changes in the spark-env script to get things working correctly. Ben -Original Message- From: Joe L

Re: running SparkALS

2014-04-28 Thread Debasish Das
Don't use SparkALS...that's the first version of the code and does not scale... Li is right...you have to do the dictionary generation on users, products and then generate indexed fileI wrote some utilities but looks like it is application dependentthe indexed netflix format is more

Re: running SparkALS

2014-04-28 Thread Diana Carroll
Should I file a JIRA to remove the example? I think it is confusing to include example code without explanation of how to run it, and it sounds like this one isn't worth running or reviewing anyway. On Mon, Apr 28, 2014 at 2:34 PM, Debasish Das debasish.da...@gmail.comwrote: Don't use

Re: pySpark memory usage

2014-04-28 Thread Patrick Wendell
Hey Jim, This IOException thing is a general issue that we need to fix and your observation is spot-in. There is actually a JIRA for it here I created a few days ago: https://issues.apache.org/jira/browse/SPARK-1579 Aaron is assigned on that one but not actively working on it, so we'd welcome a

Spark 0.9.1 -- assembly fails?

2014-04-28 Thread kamatsuoka
After upgrading to Spark 0.9.1, sbt assembly is failing. I'm trying to fix it with merge strategy, etc., but is anyone else seeing this? For example, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-1-assembly-fails-tp4979.html Sent from the

Re: Spark 0.9.1 -- assembly fails?

2014-04-28 Thread kamatsuoka
So, good news and bad news. I have a customized Build.scala http://apache-spark-user-list.1001560.n3.nabble.com/libraryDependencies-configuration-is-different-for-sbt-assembly-vs-sbt-run-tt565.html#a1542 that allows me to use the 'run' and 'assembly' commands in sbt without toggling the

Re: Java Spark Streaming - SparkFlumeEvent

2014-04-28 Thread Tathagata Das
You can get the internal AvroFlumeEvent inside the SparkFlumeEvent using SparkFlumeEvent.event. That should probably give you all the original text data. On Mon, Apr 28, 2014 at 5:46 AM, Kulkarni, Vikram vikram.kulka...@hp.comwrote: Hi Spark-users, Within my Spark Streaming program, I am

Re: Spark 0.9.1 -- assembly fails?

2014-04-28 Thread kamatsuoka
Um. When I updated the spark dependency, I unintentially deleted the provided attribute. Oops. Nothing to see here . . . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-1-assembly-fails-tp4979p4982.html Sent from the Apache Spark User List

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Roger Hoover
Matei, thank you. That seemed to work but I'm not able to import a class from my jar. Using the verbose options, I can see that my jar should be included Parsed arguments: ... jars /Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar And I see the class I want to load in

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Roger Hoover
A couple of issues: 1) the jar doesn't show up on the classpath even though SparkSubmit had it in the --jars options. I tested this by running :cp in spark-shell 2) After adding it the classpath using (:cp /Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar), it still fails.

Re: launching concurrent jobs programmatically

2014-04-28 Thread Andrew Ash
For the second question, you can submit multiple jobs through the same SparkContext via different threads and this is a supported way of interacting with Spark. From the documentation: Second, *within* each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they

Re: how to declare tuple return type

2014-04-28 Thread wxhsdp
you need to import org.apache.spark.rdd.RDD to include RDD. http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD here are some examples you can learn https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib SK wrote I am a new user of

Re: questions about debugging a spark application

2014-04-28 Thread wxhsdp
thanks for your reply, daniel what do you mean by the logs contain everything to reconstruct the same data. ? i also use times to look into the logs, but only get a little. as i can see, it logs the flow to run the application, but there are no more details about each task, for example, see the

Re: processing s3n:// files in parallel

2014-04-28 Thread Nicholas Chammas
It would be useful to have some way to open multiple files at once into a single RDD (e.g. sc.textFile(iterable_over_uris)). Logically, it would be equivalent to opening a single file which is made by concatenating the various files together. This would only be useful, of course, if the source

Re: processing s3n:// files in parallel

2014-04-28 Thread Matei Zaharia
Actually wildcards work too, e.g. s3n://bucket/file1*, and I believe so do comma-separated lists (e.g. s3n://file1,s3n://file2). These are all inherited from FileInputFormat in Hadoop. Matei On Apr 28, 2014, at 6:05 PM, Andrew Ash and...@andrewash.com wrote: This is already possible with the

How to declare Tuple return type for a function

2014-04-28 Thread SK
Hi, I am a new user of Spark. I have a class that defines a function as follows. It returns a tuple : (Int, Int, Int). class Sim extends VectorSim { override def input(master:String): (Int,Int,Int) = { sc = new SparkContext(master, Test) val ratings =

question on setup() and cleanup() methods for map() and reduce()

2014-04-28 Thread Parsian, Mahmoud
In classic MapReduce/Hadoop, you may optionally define setup() and cleanup() methods. They ( setup() and cleanup() ) are called for each task, so if you have 20 mappers running, the setup/cleanup will be called for each one. What is the equivalent of these in Spark? Thanks, best regards,

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Patrick Wendell
What about if you run ./bin/spark-shell --driver-class-path=/path/to/your/jar.jar I think either this or the --jars flag should work, but it's possible there is a bug with the --jars flag when calling the Repl. On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover roger.hoo...@gmail.comwrote: A

Re: question on setup() and cleanup() methods for map() and reduce()

2014-04-28 Thread Ameet Kini
I don't think there is a setup() or cleanup() in Spark but you can usually achieve the same using mapPartitions and having the setup code at the top of the mapPartitions and cleanup at the end. The reason why this usually works is that in Hadoop map/reduce, each map task runs over an input split.

Re: File list read into single RDD

2014-04-28 Thread Nicholas Chammas
Not that I know of. We were discussing it on another thread and it came up. I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html But that's not

Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
The problem is this object can't be Serializerable, it holds a RDD field and SparkContext. But Spark shows an error that it need Serialization. The order of my debug output is really strange. ~ Training Start! Round 0 Hehe? Hehe? started? failed? Round 1 Hehe? ~ here is my code 69

Re: question on setup() and cleanup() methods for map() and reduce()

2014-04-28 Thread Parsian, Mahmoud
Thank you very much Ameet! Can you please point me to an example? Best, Mahmoud Sent from my iPhone On Apr 28, 2014, at 6:32 PM, Ameet Kini ameetk...@gmail.commailto:ameetk...@gmail.com wrote: I don't think there is a setup() or cleanup() in Spark but you can usually achieve the same using

Re: launching concurrent jobs programmatically

2014-04-28 Thread Patrick Wendell
In general, as Andrew points out, it's possible to submit jobs from multiple threads and many Spark applications do this. One thing to check out is the job server from Ooyala, this is an application on top of Spark that has an automated submission API: https://github.com/ooyala/spark-jobserver

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Roger Hoover
Patrick, Thank you for replying. That didn't seem to work either. I see the option parsed using verbose mode. Parsed arguments: ... driverExtraClassPath /Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar But the jar still doesn't show up if I run :cp in the repl and

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
I've moved SparkContext and RDD as parameter of train. And now it tells me that SparkContext need to serialize! I think the the problem is RDD is trying to make itself lazy. and some BroadCast Object need to be generate dynamicly, so the closure have SparkContext inside, so the task complete

RE: Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi Patrick I am just doing simple word count , the data is generated by hadoop random text writer. This seems to me not quite related to compress , If I turn off compress on shuffle, the metrics is something like below for the smaller 240MB Dataset. Executor ID Address

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread DB Tsai
Your code is unformatted. Can u paste the whole file in gist and i can take a look for u. On Apr 28, 2014 10:42 PM, Earthson earthson...@gmail.com wrote: I've moved SparkContext and RDD as parameter of train. And now it tells me that SparkContext need to serialize! I think the the problem is