Re: Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where

Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-22 Thread Niranda Perera
Hi, would like to know if there is an update on this? rgds On Mon, Jan 12, 2015 at 10:44 AM, Niranda Perera niranda.per...@gmail.com wrote: Hi, I found out that SparkSQL supports only a relatively small subset of SQL dialect currently. I would like to know the roadmap for the coming

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-22 Thread TJ Klein
Seems like it is a bug rather than a feature. I filed a bug report: https://issues.apache.org/jira/browse/SPARK-5363 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21317.html Sent from the Apache Spark

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
and post the code (if possible). In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Gerard, Thanks for the response. The messages get desrialised from msgpack format, and one of the strings is desrialised to json. Certain fields are checked to decide if further processing is required. If so, it goes through a series of in mem filters to check if more processing is

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Tathagata Das
This is not normal. Its a huge scheduling delay!! Can you tell me more about the application? - cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Pankaj
http://spark.apache.org/docs/latest/ Follow this. Its easy to get started. Use prebuilt version of spark as of now :D On Thu, Jan 22, 2015 at 5:06 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache

Re: KNN for large data set

2015-01-22 Thread DEVAN M.S.
Thanks Xiangrui Meng will try this. And, found this https://github.com/kaushikranjan/knnJoin also. Will this work with double data ? Can we find out z value of *Vector(10.3,4.5,3,5)* ? On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng men...@gmail.com wrote: For large datasets, you need

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi TD, Here's some information: 1. Cluster has one standalone master, 4 workers. Workers are co-hosted with Apache Cassandra. Master is set up with external Zookeeper. 2. Each machine has 2 cores and 4GB of ram. This is for testing. All machines are vmware vms. Spark has 2GB dedicated to it on

Fwd: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache Spark? I have attached the screen shot of Gparted. Thanks and regards, Sudipta -- Sudipta Banerjee Consultant, Business Analytics and Cloud Based Architecture Call me +919019578099

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either

Re: Trying to run SparkSQL over Spark Streaming

2015-01-22 Thread nirandap
Hi, I'm also trying to use the insertInto method, but end up getting the assertion error Is there any workaround to this?? rgds -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p21316.html Sent from the

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-22 Thread Robin East
Hi There are many different variants of gradient descent mostly dealing with what the step size is and how it might be adjusted as the algorithm proceeds. Also if it uses a stochastic variant (as opposed to batch descent) then there are variations there too. I don’t know off-hand what MLlib’s

spark-shell has syntax error on windows.

2015-01-22 Thread Vladimir Protsenko
I have a problem with running spark shell in windows 7. I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash)

unable to write SequenceFile using saveAsNewAPIHadoopFile

2015-01-22 Thread Skanda
Hi All, I'm using the saveAsNewAPIHadoopFile API to write SequenceFiles but I'm getting the following runtime exception: *Exception in thread main org.apache.spark.SparkException: Task not serializable* at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)

GraphX: ShortestPaths does not terminate on a grid graph

2015-01-22 Thread NicolasC
Hello, I try to execute a simple program that runs the ShortestPaths algorithm (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph. I use Spark 1.2.0 downloaded from spark.apache.org. The program's code is the following : object GraphXGridSP { def main(args : Array[String])

Re: unable to write SequenceFile using saveAsNewAPIHadoopFile

2015-01-22 Thread Sean Owen
First as an aside I am pretty sure you cannot reuse one Text and IntWritable object here. Spark does not necessarily finish with one's value before the next call(). Although it should not be directly related to the serialization problem I suspect it is. Your function is not serializable since it

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Petar Zecevic
Ok, thanks for the clarifications. I didn't know this list has to remain as the only official list. Nabble is really not the best solution in the world, but we're stuck with it, I guess. That's it from me on this subject. Petar On 22.1.2015. 3:55, Nicholas Chammas wrote: I think a few

Re: unable to write SequenceFile using saveAsNewAPIHadoopFile

2015-01-22 Thread Skanda
Yeah it worked like charm!! Thank you! On Thu, Jan 22, 2015 at 2:28 PM, Sean Owen so...@cloudera.com wrote: First as an aside I am pretty sure you cannot reuse one Text and IntWritable object here. Spark does not necessarily finish with one's value before the next call(). Although it should

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Petar Zecevic
But voting is done on dev list, right? That could stay there... Overlay might be a fine solution, too, but that still gives two user lists (SO and Nabble+overlay). On 22.1.2015. 10:42, Sean Owen wrote: Yes, there is some project business like votes of record on releases that needs to be

Re: Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-01-22 Thread thanhtien522
Update: I deployed a stand-alone spark in localhost then set Master as spark://localhost:7077 and it met the same issue Don't know how to solve it. -- View this message in context:

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Sean Owen
Yes, there is some project business like votes of record on releases that needs to be carried on in standard, simple accessible place and SO is not at all suitable. Nobody is stuck with Nabble. The suggestion is to enable a different overlay on the existing list. SO remains a place you can ask

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Sudipta Banerjee
Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, So I changed the interval

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Gerard Maas
I've have been contributing to SO for a while now. Here're few observations I'd like to contribute to the discussion: The level of questions on SO is often of more entry-level. Harder questions (that require expertise in a certain area) remain unanswered for a while. Same questions here on the

Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-22 Thread Sean Owen
If you are using CDH, you would be shutting down services with Cloudera Manager. I believe you can do it manually using Linux 'services' if you do the steps correctly across your whole cluster. I'm not sure if the stock stop-all.sh script is supposed to work. Certainly, if you are using CM, by far

Re: [mllib] Decision Tree - prediction probabilites of label classes

2015-01-22 Thread Sean Owen
You are right that this isn't implemented. I presume you could propose a PR for this. The impurity calculator implementations already receive category counts. The only drawback I see is having to store N probabilities at each leaf, not 1. On Wed, Jan 21, 2015 at 3:36 PM, Zsolt Tóth

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Nicholas Chammas
I agree with Sean that a Spark-specific Stack Exchange likely won't help and almost certainly won't make it out of Area 51. The idea certainly sounds nice from our perspective as Spark users, but it doesn't mesh with the structure of Stack Exchange or the criteria for creating new sites. On Thu

RE: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Wang, Ningjun (LNG-NPV)
Sean You said Ø If you know that this number is too high you can request a number of partitions when you read it. How to do that? Can you give a code snippet? I want to read it into 8 partitions, so I do val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydirfile:///\\tmp\mydir”, 8)

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Nicholas Chammas
we could implement some ‘load balancing’ policies: I think Gerard’s suggestions are good. We need some “official” buy-in from the project’s maintainers and heavy contributors and we should move forward with them. I know that at least Josh Rosen, Sean Owen, and Tathagata Das, who are active on

Re: spark-shell has syntax error on windows.

2015-01-22 Thread Yana Kadiyska
I am not sure if you get the same exception as I do -- spark-shell2.cmd works fine for me. Windows 7 as well. I've never bothered looking to fix it as it seems spark-shell just calls spark-shell2 anyway... On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com wrote: I have a

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Another quick question... I've got 4 nodes with 2 cores each. I've assinged the streaming app 4 cores. It seems to be using one per node. I imagine forwarding from the receivers to the executors are causing unnecessary processing. Is there a way to specify that I want 2 cores from the same

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sean Owen
Yes, this isn't a well-formed question, and got maybe the response it deserved, but the tone is veering off the rails. I just got a much ruder reply from Sudipta privately, which I will not forward. Sudipta, I suggest you take the responses you've gotten so far as about as much answer as can be

Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu
Hi I get this exception when I run a Spark test case on my local machine: An exception or error caused a run to abort:

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Frank Austin Nothaft
Venkat, No problem! So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution. We also need the modified version of RDD.pipe to support binary data? Is my understanding correct? Yep! That is correct. The custom InputFormat allows Spark to load binary

Re: spark streaming with checkpoint

2015-01-22 Thread Balakrishnan Narendran
Thank you Jerry, Does the window operation create new RDDs for each slide duration..? I am asking this because i see a constant increase in memory even when there is no logs received. If not checkpoint is there any alternative that you would suggest.? On Tue, Jan 20, 2015 at 7:08 PM,

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Sudipta, Standalone spark master. Separate Zookeeper cluster. 4 worker nodes with cassandra + spark on each. No hadoop / hdfs / yarn. Regards, Ashic. Date: Thu, 22 Jan 2015 20:42:43 +0530 Subject: Re: Are these numbers abnormal for spark streaming? From: asudipta.baner...@gmail.com To:

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
Given that the process, and in particular, the setup of connections, is bound to the number of partitions (in x.foreachPartition{ x= ???}), I think it would be worth trying reducing them. Increasing the 'spark.streaming.BlockInterval' will do the trick (you can read the tuning details here:

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Venkat, Ankam
How much time it takes to port it? Spark committers: Please let us know your thoughts. Regards, Venkat From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu] Sent: Thursday, January 22, 2015 9:08 AM To: Venkat, Ankam Cc: Nick Allen; user@spark.apache.org Subject: Re: How to 'Pipe' Binary

Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Kane Kim
I'm trying to process a large dataset, mapping/filtering works ok, but as long as I try to reduceByKey, I get out of memory errors: http://pastebin.com/70M5d0Bn Any ideas how I can fix that? Thanks. - To unsubscribe, e-mail:

Re: PySpark Client

2015-01-22 Thread Chris Beavers
Hey Andrew, Thanks for the response. Is this the issue you're referring to (the duplicate linked there has an associated patch): https://issues.apache.org/jira/browse/SPARK-5162 ? Just to confirm that I understand this: with this patch, Python jobs can be submitted to YARN, and a node from the

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Guys, So I changed the interval to 15 seconds. There's obviously a lot more messages per batch, but (I think) it looks a lot healthier. Can you see any major warning signs? I think that with 2 second intervals, the setup / teardown per partition was what was causing the delays.

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Hi Marco, Thanks for the confirmation. Please let me know what are the lot more detail you need to answer a very specific question WHAT IS THE MINIMUM HARDWARE CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN on a system? Please let me know if you need any further information and if

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Yup...looks like it. I can do some tricks to reduce setup costs further, but this is much better than where I was yesterday. Thanks for your awesome input :) -Ashic. From: gerard.m...@gmail.com Date: Thu, 22 Jan 2015 16:34:38 +0100 Subject: Re: Are these numbers abnormal for spark streaming?

RE: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Babu, Prashanth
Sudipta, Use the Docker image [1] and play around with Hadoop and Spark in the VM for a while. Decide on your use case(s) and then you can move ahead for installing on a cluster, etc. This Docker image has all you want [HDFS + MapReduce + Spark + YARN]. All the best! [1]:

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Venkat, Ankam
Thanks Frank for your response. So, creating a custom InputFormat or using sc.binaryFiles alone is not the right solution. We also need the modified version of RDD.pipe to support binary data? Is my understanding correct? If yes, this can be added as new enhancement Jira request? Nick:

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Jerry Lam
Hi Sudipta, I would also like to suggest to ask this question in Cloudera mailing list since you have HDFS, MAPREDUCE and Yarn requirements. Spark can work with HDFS and YARN but it is more like a client to those clusters. Cloudera can provide services to answer your question more clearly. I'm

RE: spark 1.1.0 save data to hdfs failed

2015-01-22 Thread ey-chih chow
I looked into the namenode log and found this message: 2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.33.140.233:53776 got version 9 expected version 4 What should I do to fix this? Thanks. Ey-Chih From: eyc...@hotmail.com To:

Re: Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Sean McNamara
Hi Kane- http://spark.apache.org/docs/latest/tuning.html has excellent information that may be helpful. In particular increasing the number of tasks may help, as well as confirming that you don’t have more data than you're expecting landing on a key. Also, if you are using spark 1.2.0,

Re: reducing number of output files

2015-01-22 Thread Sean Owen
One output file is produced per partition. If you want fewer, use coalesce() before saving the RDD. On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote: How I can reduce number of output files? Is there a parameter to saveAsTextFile? Thanks.

Results never return to driver | Spark Custom Reader

2015-01-22 Thread Harihar Nahak
Hi All, I wrote a custom reader to read a DB, and it is able to return key and value as expected but after it finished it never returned to driver here is output of worker log : 15/01/23 15:51:38 INFO worker.ExecutorRunner: Launch command: java -cp

processing large dataset

2015-01-22 Thread Kane Kim
I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've tried to deploy to ec2 with the script provided with spark on pretty beefy machines (100 r3.2xlarge nodes). Really frustrated

Re: KNN for large data set

2015-01-22 Thread Sudipta Banerjee
Hi Devan and Xiangrui, Can you please explain the cost and optimization function of the KNN alogorithim that is being used? Thank and Regards, Sudipta On Thu, Jan 22, 2015 at 6:59 PM, DEVAN M.S. msdeva...@gmail.com wrote: Thanks Xiangrui Meng will try this. And, found this

Exception in parsley pyspark cassandra hadoop connector

2015-01-22 Thread Nishant Sinha
I am following the repo on github about pyspark cassandra connector at https://github.com/Parsely/pyspark-cassandra On executing the line : ./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py run test It ends up wit an exception: ERROR Executor: Exception in task 9.0 in

Re: spark 1.1.0 save data to hdfs failed

2015-01-22 Thread Sean Owen
It means your client app is using Hadoop 2.x and your HDFS is Hadoop 1.x. On Thu, Jan 22, 2015 at 10:32 PM, ey-chih chow eyc...@hotmail.com wrote: I looked into the namenode log and found this message: 2015-01-22 22:18:39,441 WARN org.apache.hadoop.ipc.Server: Incorrect header or version

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-22 Thread Ankur Dave
At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote: I try to execute a simple program that runs the ShortestPaths algorithm (org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph. I use Spark 1.2.0 downloaded from spark.apache.org. This program runs more than 2

save a histogram to a file

2015-01-22 Thread SK
Hi, histogram() returns an object that is a pair of Arrays. There appears to be no saveAsTextFile() for this paired object. Currently I am using the following to save the output to a file: val hist = a.histogram(10) val arr1 = sc.parallelize(hist._1).saveAsTextFile(file1) val arr2 =

RE: spark streaming with checkpoint

2015-01-22 Thread Shao, Saisai
Hi, A new RDD will be created in each slide duration, if there’s no data coming, an empty RDD will be generated. I’m not sure there’s way to alleviate your problem from Spark side. Is your application design have to build such a large window, can you change your implementation if it is easy

Re: reading a csv dynamically

2015-01-22 Thread Imran Rashid
Spark can definitely process data with optional fields. It kinda depends on what you want to do with the results -- its more of a object design / knowing scala types question. Eg., scala has a built in type Option specifically for handling optional data, which works nicely in pattern matching

Missing output partition file in S3

2015-01-22 Thread Nicolas Mai
Hi, My team is using Spark 1.0.1 and the project we're working on needs to compute exact numbers, which are then saved to S3, to be reused later in other Spark jobs to compute other numbers. The problem we noticed yesterday: one of the output partition files in S3 was missing :/ (some

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-22 Thread mingyu
Also, Setting spark.locality.wait=100 did not work for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-spark-partition-sticky-i-e-stay-with-node-tp21322p21325.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Imran Rashid
I think you should also just be able to provide an input format that never splits the input data. This has come up before on the list, but I couldn't find it.* I think this should work, but I can't try it out at the moment. Can you please try and let us know if it works? class

RE: spark 1.1.0 save data to hdfs failed

2015-01-22 Thread ey-chih chow
Thanks. But after I replace the maven dependence from dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.5.0-cdh5.2.0/version

Re: reducing number of output files

2015-01-22 Thread DEVAN M.S.
Rdd.coalesce(1) will coalesce RDD and give only one output file. coalesce(2) will give 2 wise versa. On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote: One output file is produced per partition. If you want fewer, use coalesce() before saving the RDD. On Thu, Jan 22, 2015 at 10:46

Re: Using third party libraries in pyspark

2015-01-22 Thread Davies Liu
You need to install these libraries on all the slaves, or submit via spark-submit: spark-submit --py-files xxx On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries.

Re: processing large dataset

2015-01-22 Thread Jörn Franke
Did you try it with a smaller subset of the data first? Le 23 janv. 2015 05:54, Kane Kim kane.ist...@gmail.com a écrit : I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've

Re: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Sean Owen
Yes, that second argument is what I was referring to, but yes it's a *minimum*, oops, right. OK, you will want to coalesce then, indeed. On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Ø If you know that this number is too high you can request a

How to make spark partition sticky, i.e. stay with node?

2015-01-22 Thread mingyu
I posted an question on stackoverflow and haven't gotten any answer yet. http://stackoverflow.com/questions/28079037/how-to-make-spark-partition-sticky-i-e-stay-with-node Is there a way to make a partition stay with a node in Spark Streaming? I need these since I have to load large amount

Using third party libraries in pyspark

2015-01-22 Thread Mohit Singh
Hi, I might be asking something very trivial, but whats the recommend way of using third party libraries. I am using tables to read hdf5 format file.. And here is the error trace: print rdd.take(2) File /tmp/spark/python/pyspark/rdd.py, line , in take res =

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Marcelo Vanzin
On Thu, Jan 22, 2015 at 10:21 AM, Sean Owen so...@cloudera.com wrote: I think a Spark site would have a lot less traffic. One annoyance is that people can't figure out when to post on SO vs Data Science vs Cross Validated. Another is that a lot of the discussions we see on the Spark users list

Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId(4)

2015-01-22 Thread Zijing Guo
HiI'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast method. So when I call broadcast function on a small dataset to a 5 nodes cluster, I experiencing the Error sending message as driverActor is null after broadcast the variables several times (apps running under jboss). 

Durablility of Spark Streaming Applications

2015-01-22 Thread Wang, Daniel
Deployed Spark Streaming applications to a standalone cluster, after a cluster restart, all the deployed applications are gone and I could not see any applications through the Spark Web UI. How to make the Spark Streaming applications durable and auto-restart after a cluster restart?

Re: spark streaming with checkpoint

2015-01-22 Thread Jörn Franke
Maybe you use a wrong approach - try something like hyperloglog or bitmap structures as you can find them, for instance, in redis. They are much smaller Le 22 janv. 2015 17:19, Balakrishnan Narendran balu.na...@gmail.com a écrit : Thank you Jerry, Does the window operation create new

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Silvio Fiorito
Nick, Have you tried https://github.com/kaitoy/pcap4j I’ve used this in a Spark app already and didn’t have any issues. My use case was slightly different than yours, but you should give it a try. From: Nick Allen n...@nickallen.orgmailto:n...@nickallen.org Date: Friday, January 16, 2015 at

Installing Spark Standalone to a Cluster

2015-01-22 Thread riginos
I have downloaded spark-1.2.0.tgz on each of my node and execute ./sbt/sbt assembly on each of them. So I execute. /sbin/start-master.sh on my master and ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT. Althought when I got to http://localhost:8080 I cannot see any

Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Sean Owen
NoSuchMethodError almost always means that you have compiled some code against one version of a library but are running against another. I wonder if you are including different versions of Spark in your project, or running against a cluster on an older version? On Thu, Jan 22, 2015 at 3:57 PM,

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Nicos
Folks, Just a gentle reminder we owe to ourselves: - this is a public forum and we need to behave accordingly, it is not place to vent frustration in rude way - getting attention here is an earned privilege and not entitlement - this is not a “Platinum Support” department of your vendor

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread pierred
Love it! There is a reason why SO is so effective and popular. Search is excellent, you can quickly find very thoughtful answers about sometimes thorny problems, and it is easy to contribute, format code, etc. Perhaps the most useful feature is that the best answers naturally bubble up to the

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Sean Owen
FWIW I am a moderator for datascience.stackexchange.com, and even that hasn't really achieved the critical mass that SE sites are supposed to: http://area51.stackexchange.com/proposals/55053/data-science I think a Spark site would have a lot less traffic. One annoyance is that people can't figure

RE: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu
I use spark 1.1.0-SNAPSHOT and the test I'm running is in local mode. My test case uses org.apache.spark.streaming.TestSuiteBase val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided excludeAll( val sparkStreaming= org.apache.spark % spark-streaming_2.10 % 1.1.0-SNAPSHOT %

RE: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu
I use spark 1.1.0-SNAPSHOT val spark=org.apache.spark %% spark-core % 1.1.0-SNAPSHOT % provided excludeAll( -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: January-22-15 11:39 AM To: Adrian Mocanu Cc: u...@spark.incubator.apache.org Subject: Re: Exception:

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use unprofessional language just for the sake of being a moderator. Paco Nathan is respected for the dignity he carries in sharing his knowledge and making it available free for a$$es like us right! So just mind your tongue next

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Thank you very much Marco! Really appreciate your support. On Thu, Jan 22, 2015 at 10:57 PM, Marco Shaw marco.s...@gmail.com wrote: (Starting over...) The best place to look for the requirements would be at the individual pages of each technology. As for absolute minimum requirements, I

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Sudipta - Please don't ever come here or post here again. On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use unprofessional language just for the sake of being a moderator. Paco Nathan

Re: Installing Spark Standalone to a Cluster

2015-01-22 Thread Yana Kadiyska
You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're missing --master. In addition, it's a good idea to pass with --master exactly the spark master's endpoint as shown on your UI under http://localhost:8080. But that should do it. If that's not working, you can look at the

Would Join on PairRDD's result in co-locating data by keys?

2015-01-22 Thread Ankur Srivastava
Hi, I wanted to understand how the join on two pair rdd's work? Would it result in shuffling data from both the RDD's with same key into same partition? If that is the case would it be better to use partitionBy function to partition (by the join attribute) the RDD at creation for lesser

Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId

2015-01-22 Thread Edwin
I'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast method. So when I call broadcast function on a small dataset to a 5 nodes cluster, I experiencing the Error sending message as driverActor is null after broadcast the variables several times (apps running under jboss). Any

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Joseph Ottinger
Sudipta, with all due respect... don't respond to me if you don't like what I say is not the same as not being a jerk about it. One earns social capital, by being respectful and by respecting the social norms during interaction; by everything I've seen, you've been demanding and disrespectful

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
(Starting over...) The best place to look for the requirements would be at the individual pages of each technology. As for absolute minimum requirements, I would suggest 50GB of disk space and at least 8GB of memory. This is the absolute minimum. Architecting a solution like you are looking

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Lukas Nalezenec
+1 On 22.1.2015 18:30, Marco Shaw wrote: Sudipta - Please don't ever come here or post here again. On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee asudipta.baner...@gmail.com mailto:asudipta.baner...@gmail.com wrote: Hi Nicos, Taking forward your argument,please be a smart a$$ and

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Sudipta Banerjee
Dont ever reply to my queries :D On Thu, Jan 22, 2015 at 11:02 PM, Lukas Nalezenec lukas.naleze...@firma.seznam.cz wrote: +1 On 22.1.2015 18:30, Marco Shaw wrote: Sudipta - Please don't ever come here or post here again. On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee

Re: Using third party libraries in pyspark

2015-01-22 Thread Felix C
Python couldn't find your module. Do you have that on each worker node? You will need to have that on each one --- Original Message --- From: Davies Liu dav...@databricks.com Sent: January 22, 2015 9:12 PM To: Mohit Singh mohit1...@gmail.com Cc: user@spark.apache.org Subject: Re: Using third

Re: processing large dataset

2015-01-22 Thread Russell Jurney
Often when this happens to me, it is actually an exception parsing a few messages. Easy to miss this, as error messages aren't always informative. I would be blaming spark, but in reality it was missing fields in a CSV file. As has been said, make a file with a few records and see if your job