Re: Cassandra examples don't work for me

2014-06-05 Thread Nick Pentreath
Yyou need cassandra 1.2.6 for Spark examples — Sent from Mailbox On Thu, Jun 5, 2014 at 12:02 AM, Tim Kellogg t...@2lemetry.com wrote: Hi, I’m following the directions to run the cassandra example “org.apache.spark.examples.CassandraTest” and I get this error Exception in thread main

Re: Logistic Regression MLLib Slow

2014-06-05 Thread DB Tsai
Hi Krishna, Also, the default optimizer with SGD converges really slow. If you are willing to write scala code, there is a full working example for training Logistic Regression with L-BFGS (a quasi-Newton method) in scala. It converges a way faster than SGD. See

Re: Logistic Regression MLLib Slow

2014-06-05 Thread DB Tsai
Hi Krishna, It should work, and we use it in production with great success. However, the constructor of LogisticRegressionModel is private[mllib], so you have to write your code, and have the package name under org.apache.spark.mllib instead of using scala console. Sincerely, DB Tsai

Re: Join : Giving incorrect result

2014-06-05 Thread Ajay Srivastava
Sorry for replying late. It was night here. Lian/Matei, Here is the code snippet -     sparkConf.set(spark.executor.memory, 10g)     sparkConf.set(spark.cores.max, 5)         val sc = new SparkContext(sparkConf)         val accId2LocRDD =

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng, Sorry Again. In this method, i see that the values for a - positions.iterator b - positions.iterator always remain the same. I tried to do a b - positions.iterator.next, it throws an error: value filter is not a member of (Double, Double) Is there something I

Re: Re: mismatched hdfs protocol

2014-06-05 Thread bluejoe2008
ok, i see i imported wrong jar files which only work well on default hadoop version 2014-06-05 bluejoe2008 From: prabeesh k Date: 2014-06-05 16:14 To: user Subject: Re: Re: mismatched hdfs protocol If you are not setting the Spark hadoop version, Spark built using default hadoop

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen
Lakshmi, this is orthogonal to your question, but in case it's useful. It sounds like you're trying to determine the home location of a user, or something similar. If that's the problem statement, the data pattern may suggest a far more computationally efficient approach. For example, first map

Re: Can't seem to link external/twitter classes from my own app

2014-06-05 Thread Jeremy Lee
I shan't be far. I'm committed now. Spark and I are going to have a very interesting future together, but hopefully future messages will be about the algorithms and modules, and less how do I run make?. I suspect doing this at the exact moment of the 0.9 - 1.0.0 transition hasn't helped me. (I

Re: Unable to run a Standalone job

2014-06-05 Thread prabeesh k
try sbt clean command before build the app. or delete .ivy2 ans .sbt folders(not a good methode). Then try to rebuild the project. On Thu, Jun 5, 2014 at 11:45 AM, Sean Owen so...@cloudera.com wrote: I think this is SPARK-1949 again: https://github.com/apache/spark/pull/906 I think this

Re: Error related to serialisation in spark streaming

2014-06-05 Thread nilmish
Thanx a lot for your reply. I can see kryo serialiser in the UI. I have 1 another query : I wanted to know the meaning of the following log message when running a spark streaming job : [spark-akka.actor.default-dispatcher-18] INFO org.apache.spark.streaming.scheduler.JobScheduler - Total

Native library can not be loaded when using Mllib PCA

2014-06-05 Thread yangliuyu
Hi, We're using Mllib (1.0.0 release version) on a k-means clustering problem. We want to reduce the matrix column size before send the points to k-means solver. It works on my mac with the local mode: spark-test-run-assembly-1.0.jar contains my application code, com.github.fommil, netlib code

How to shut down Spark Streaming with Kafka properly?

2014-06-05 Thread Tobias Pfeiffer
Hi, I am trying to use Spark Streaming with Kafka, which works like a charm -- except for shutdown. When I run my program with sbt run-main, sbt will never exit, because there are two non-daemon threads left that don't die. I created a minimal example at

Spark Kafka streaming - ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2014-06-05 Thread Gaurav Dasgupta
Hi, I have written my own custom Spark streaming code which connects to Kafka server and fetch data. I have tested the code on local mode and it is working fine. But when I am executing the same code on YARN mode, I am getting KafkaReceiver class not found exception. I am providing the Spark

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng, Thanks a lot. That solved my problem. Thanks again for the quick response and solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7047.html Sent from the Apache Spark User List mailing

Re: Spark not working with mesos

2014-06-05 Thread praveshjain1991
Hi Ajatix. Yes the HADOOP_HOME is set on the nodes and i did update the bash. As I said, adding MESOS_HADOOP_HOME did not work. But what is causing the original error : Java.lang.Error: java.io.IOException: failure to login ? -- Thanks -- View this message in context:

Serialization problem in Spark

2014-06-05 Thread Vibhor Banga
Hi, I am trying to do something like following in Spark: JavaPairRDDbyte[], MyObject eventRDD = hBaseRDD.map(new PairFunctionTuple2ImmutableBytesWritable, Result, byte[], MyObject () { @Override public Tuple2byte[], MyObject call(Tuple2ImmutableBytesWritable, Result

Problem with serialization and deserialization

2014-06-05 Thread ANEESH .V.V
hi, I have a JTree. I want to serialize it using sc.saveAsObjectFile(path). I could save it in some location. The real problem is that when I deserialize it back using sc.objectFile(), I am not getting the jtree. Can anyone please help me on this.. Thanks

Re: Problem with serialization and deserialization

2014-06-05 Thread Stefan van Wouw
Dear Aneesh, Your particular use case of using Swing GUI components with Spark is a bit unclear to me. Assuming that you want Spark to operate on a tree object, you could use an implementation of the TreeModel ( http://docs.oracle.com/javase/8/docs/api/javax/swing/tree/DefaultTreeModel.html

Re: Better line number hints for logging?

2014-06-05 Thread Daniel Darabos
On Wed, Jun 4, 2014 at 10:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote: That’s a good idea too, maybe we can change CallSiteInfo to do that. I've filed an issue: https://issues.apache.org/jira/browse/SPARK-2035 Matei On Jun 4, 2014, at 8:44 AM, Daniel Darabos

Re: Serialization problem in Spark

2014-06-05 Thread Vibhor Banga
Any inputs on this will be helpful. Thanks, -Vibhor On Thu, Jun 5, 2014 at 3:41 PM, Vibhor Banga vibhorba...@gmail.com wrote: Hi, I am trying to do something like following in Spark: JavaPairRDDbyte[], MyObject eventRDD = hBaseRDD.map(new PairFunctionTuple2ImmutableBytesWritable, Result,

Re: Spark Streaming not processing file with particular number of entries

2014-06-05 Thread praveshjain1991
The same issue persists in spark-1.0.0 as well (was using 0.9.1 earlier). Any suggestions are welcomed. -- Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-tp6694p7056.html Sent

spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
I am slightly confused about the --executor-memory setting. My yarn cluster has a maximum container memory of 8192MB. When I specify --executor-memory 8G in my spark-shell, no container can be started at all. It only works when I lower the executor memory to 7G. But then, on yarn, I see 2

Loading Python libraries into Spark

2014-06-05 Thread mrm
Hi, I am new to Spark (and almost-new in python!). How can I download and install a Python library in my cluster so I can just import it later? Any help would be much appreciated. Thanks! -- View this message in context:

compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
I have a working set larger than available memory, thus I am hoping to turn on rdd compression so that I can store more in-memory. Strangely it made no difference. The number of cached partitions, fraction cached, and size in memory remain the same. Any ideas? I confirmed that rdd compression

Re: compress in-memory cache?

2014-06-05 Thread Nick Pentreath
Have you set the persistence level of the RDD to MEMORY_ONLY_SER ( http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)? If you're calling cache, the default persistence level is MEMORY_ONLY so that setting will have no impact. On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon)

Scala By the Bay Developer Conference and Training Registration

2014-06-05 Thread Alexy Khrabrov
Scala by the Bay registration and training is now open! We are assembling a great two-day program for Scala By the Bay www.scalabythebay.org -- the yearly SF Scala developer conference. This year the conference itself is on August 8-9 in Fort Mason, near the Golden Gate bridge, with the Scala

Re: Unable to run a Standalone job([NOT FOUND ] org.eclipse.jetty.orbit#javax.mail.glassfish;1.4.1.v201005082020)

2014-06-05 Thread Shrikar archak
Hi Prabeesh/ Sean, I tried both the steps you guys mentioned looks like its not able to resolve it. [warn] [NOT FOUND ] org.eclipse.jetty.orbit#javax.transaction;1.1.1.v201105210645!javax.transaction.orbit (131ms) [warn] public: tried [warn]

Re: reuse hadoop code in Spark

2014-06-05 Thread Wei Tan
Thanks Matei. Using your pointers I can import data frrom HDFS, what I want to do now is something like this in Spark: --- import myown.mapper rdd.map (mapper.map) --- The reason why I want this: myown.mapper is a java class I already developed. I used

Re: Native library can not be loaded when using Mllib PCA

2014-06-05 Thread Xiangrui Meng
For standalone and yarn mode, you need to install native libraries on all nodes. The best solution is installing them to /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3 . If your matrix is sparse, the native libraries cannot help because they are for dense linear algebra. You can create RDD

Re: compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
Thanks.. it works now. -Simon On Thu, Jun 5, 2014 at 10:47 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Have you set the persistence level of the RDD to MEMORY_ONLY_SER ( http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)? If you're calling cache, the default

Re: Loading Python libraries into Spark

2014-06-05 Thread mrm
Hi Andrei, Thank you for your help! Just to make sure I understand, when I run this command sc.addPyFile(/path/to/yourmodule.py), I need to be already logged into the master node and have my python files somewhere, is that correct? -- View this message in context:

Re: reuse hadoop code in Spark

2014-06-05 Thread Matei Zaharia
Use RDD.mapPartitions to go over all the items in a partition with one Mapper object. It will look something like this: rdd.mapPartitions(iterator = val mapper = new myown.Mapper() mapper.configure(conf) val output = // {{create an OutputCollector that stores stuff in an ArrayBuffer}}

Re: Loading Python libraries into Spark

2014-06-05 Thread Andrei
In my answer I assumed you run your program with pyspark command (e.g. pyspark mymainscript.py, pyspark should be on your path). In this case workflow is as follows: 1. You create SparkConf object that simply contains your app's options. 2. You create SparkContext, which initializes your

creating new ami image for spark ec2 commands

2014-06-05 Thread Matt Work Coarr
How would I go about creating a new AMI image that I can use with the spark ec2 commands? I can't seem to find any documentation. I'm looking for a list of steps that I'd need to perform to make an Amazon Linux image ready to be used by the spark ec2 tools. I've been reading through the spark

Examples

2014-06-05 Thread Tim Kellogg
Hi, I’m still having trouble running the CassandraTest example from the Spark-1.0.0 binary package. I’ve made a Stackoverflow question for it so you can get some street cred for helping me :) http://stackoverflow.com/q/24069039/503826 Thanks! Tim Kellogg Sr. Software Engineer, Protocols

Setting executor memory when using spark-shell

2014-06-05 Thread Oleg Proudnikov
Hi All, Please help me set Executor JVM memory size. I am using Spark shell and it appears that the executors are started with a predefined JVM heap of 512m as soon as Spark shell starts. How can I change this setting? I tried setting SPARK_EXECUTOR_MEMORY before launching Spark shell: export

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
Hi Oleg, I set the size of my executors on a standalone cluster when using the shell like this: ./bin/spark-shell --master $MASTER --total-executor-cores $CORES_ACROSS_CLUSTER --driver-java-options -Dspark.executor.memory=$MEMORY_PER_EXECUTOR It doesn't seem particularly clean, but it works.

Spark Streaming, download a s3 file to run a script shell on it

2014-06-05 Thread Gianluca Privitera
Hi, I've got a weird question but maybe someone has already dealt with it. My Spark Streaming application needs to - download a file from a S3 bucket, - run a script with the file as input, - create a DStream from this script output. I've already got the second part done with the rdd.pipe() API

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Oleg Proudnikov
Thank you, Andrew, I am using Spark 0.9.1 and tried your approach like this: bin/spark-shell --driver-java-options -Dspark.executor.memory=$MEMORY_PER_EXECUTOR I get bad option: '--driver-java-options' There must be something different in my setup. Any ideas? Thank you again, Oleg On 5

Re: implicit ALS dataSet

2014-06-05 Thread redocpot
Thank you for your quick reply. As far as I know, the update does not require negative observations, because the update rule Xu = (YtCuY + λI)^-1 Yt Cu P(u) can be simplified by taking advantage of its algebraic structure, so negative observations are not needed. This is what I think at the

Re: implicit ALS dataSet

2014-06-05 Thread Sean Owen
On Thu, Jun 5, 2014 at 10:38 PM, redocpot julien19890...@gmail.com wrote: can be simplified by taking advantage of its algebraic structure, so negative observations are not needed. This is what I think at the first time I read the paper. Correct, a big part of the reason that is efficient is

Re: spark worker and yarn memory

2014-06-05 Thread Sandy Ryza
Hi Xu, As crazy as it might sound, this all makes sense. There are a few different quantities at play here: * the heap size of the executor (controlled by --executor-memory) * the amount of memory spark requests from yarn (the heap size plus 384 mb to account for fixed memory costs outside if

Re: Join : Giving incorrect result

2014-06-05 Thread Matei Zaharia
Hey Ajay, thanks for reporting this. There was indeed a bug, specifically in the way join tasks spill to disk (which happened when you had more concurrent tasks competing for memory). I’ve posted a patch for it here: https://github.com/apache/spark/pull/986. Feel free to try that if you’d like;

When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL. When these happen things get extremely slow. Does this mean that the executor got terminated and restarted? Is there a way to prevent this from happening

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
Oh my apologies that was for 1.0 For Spark 0.9 I did it like this: MASTER=spark://mymaster:7077 SPARK_MEM=8g ./bin/spark-shell -c $CORES_ACROSS_CLUSTER The downside of this though is that SPARK_MEM also sets the driver's JVM to be 8g, rather than just the executors. I think this is the reason

Re: Join : Giving incorrect result

2014-06-05 Thread Andrew Ash
Hi Ajay, Can you please try running the same code with spark.shuffle.spill=false and see if the numbers turn out correctly? That parameter controls whether or not the buggy code that Matei fixed in ExternalAppendOnlyMap is used. FWIW I saw similar issues in 0.9.0 but no longer in 0.9.1 after I

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Sung Hwan Chung
On a related note, I'd also minimize any kind of executor movement. I.e., once an executor is spawned and data cached in the executor, I want that executor to live all the way till the job is finished, or the machine fails in a fatal manner. What would be the best way to ensure that this is the

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Roger Hoover
Hi Aaron, When you say that sorting is being worked on, can you elaborate a little more please? If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once. Thanks, Roger On Sat, May 31, 2014 at 11:10 PM, Aaron

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Roger Hoover
I think it would very handy to be able to specify that you want sorting during a partitioning stage. On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover roger.hoo...@gmail.com wrote: Hi Aaron, When you say that sorting is being worked on, can you elaborate a little more please? If particular, I

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Andrew Ash
Hi Roger, You should be able to sort within partitions using the rdd.mapPartitions() method, and that shouldn't require holding all data in memory at once. It does require holding the entire partition in memory though. Do you need the partition to never be held in memory all at once? As far as

Re: How to shut down Spark Streaming with Kafka properly?

2014-06-05 Thread Tobias Pfeiffer
Sean, your patch fixes the issue, thank you so much! (This is the second time within one week I run into network libraries not shutting down threads properly, I'm really glad your code fixes the issue.) I saw your pull request is closed, but not merged yet. Can I do anything to get your fix into

Re: Setting executor memory when using spark-shell

2014-06-05 Thread hassan
just use -Dspark.executor.memory= -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-executor-memory-when-using-spark-shell-tp7082p7103.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Twitter feed options?

2014-06-05 Thread Jeremy Lee
Me again, Things have been going well, actually. I've got my build chain sorted, 1.0.0 and streaming is working reliably. I managed to turn off the INFO messages by messing with every log4j properties file on the system. :-) On thing I would like to try now is some natural language processing on

RE: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Liu, Raymond
If some task have no locality preference, it will also show up as PROCESS_LOCAL, yet, I think we probably need to name it NO_PREFER to make it more clear. Not sure is this your case. Best Regards, Raymond Liu From: coded...@gmail.com [mailto:coded...@gmail.com] On Behalf Of Sung Hwan Chung

Re: spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
Nice explanation... Thanks! On Thu, Jun 5, 2014 at 5:50 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Xu, As crazy as it might sound, this all makes sense. There are a few different quantities at play here: * the heap size of the executor (controlled by --executor-memory) * the amount

Re: Twitter feed options?

2014-06-05 Thread Jeremy Lee
Nope, sorry, nevermind! I looked at the source, and it was pretty obvious that it didn't implement that yet, so I've ripped the classes out and am mutating them into a new receivers right now... ... starting to get the hang of this. On Fri, Jun 6, 2014 at 1:07 PM, Jeremy Lee

KyroException: Unable to find class

2014-06-05 Thread Justin Yip
Hello, I have been using Externalizer from Chill to as serialization wrapper. It appears to me that Spark have some conflict with the classloader with Chill. I have the (a simplified version) following program: import java.io._ import com.twitter.chill.Externalizer class X(val i: Int) {

Spark Streaming NeteorkReceiver problems

2014-06-05 Thread zzzzzqf12345
hi, here is problem description, I write a custom networkreceiver to receive image data from camera. I had confirmed all the data received correctly. 1)when data received, only the networkreceiver node run at full speed, while other nodes keep idle, my spark cluster has 6 nodes. 2)And every