Re: Getting data into Spark Streaming

2015-05-08 Thread Akhil Das
I don't think you can use rawSocketStream since the RSVP is from a web server and you will have to send a GET request first to initialize the communication. You are better off writing a custom receiver https://spark.apache.org/docs/latest/streaming-custom-receivers.html for your usecase. For a

Re: Troubling Logging w/Simple Example (spark-1.2.2-bin-hadoop2.4)...

2015-05-06 Thread Akhil Das
You have an issue with your cluster setup. Can you paste your conf/spark-env.sh and the conf/slaves files here? The reason why your job is running fine is because you set the master inside the job as local[*] which runs in local mode (not in standalone cluster mode). Thanks Best Regards On

Re: com.datastax.spark % spark-streaming_2.10 % 1.1.0 in my build.sbt ??

2015-05-06 Thread Akhil Das
I don't see spark-streaming dependency at com.datastax.spark http://mvnrepository.com/artifact/com.datastax.spark, but it does has a kafka-streaming dependency though. Thanks Best Regards On Tue, May 5, 2015 at 12:42 AM, Eric Ho eric...@intel.com wrote: Can I specify this in my build file ?

Re: Spark Mongodb connection

2015-05-06 Thread Akhil Das
Here's a complete example https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html Thanks Best Regards On Mon, May 4, 2015 at 12:57 PM, Yasemin Kaya godo...@gmail.com wrote: Hi! I am new at Spark and I want to begin Spark with simple wordCount example in Java. But I want to give

Re: java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-05 Thread Akhil Das
It could be filling up your /tmp directory. You need to set your spark.local.dir or you can also specify SPARK_WORKER_DIR to another location which has sufficient space. Thanks Best Regards On Mon, May 4, 2015 at 7:27 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am getting No space left

Re: Problem in Standalone Mode

2015-05-04 Thread Akhil Das
Can you paste the complete stacktrace? It looks like you are having version incompatibility with hadoop. Thanks Best Regards On Sat, May 2, 2015 at 4:36 PM, drarse drarse.a...@gmail.com wrote: When I run my program with Spark-Submit everythink are ok. But when I try run in satandalone mode I

Re: spark filestrea problem

2015-05-04 Thread Akhil Das
With filestream you can actually pass a filter parameter to avoid loading up .tmp file/directories. Also, when you move/rename a file, the file creation date doesn't change and hence spark won't detect them i believe. Thanks Best Regards On Sat, May 2, 2015 at 9:37 PM, Evo Eftimov

Re: Remoting warning when submitting to cluster

2015-05-04 Thread Akhil Das
Looks like a version incompatibility, just make sure you have the proper version of spark. Also look further in the stacktrace what is causing Futures timed out (it could be a network issue also if the ports aren't opened properly) Thanks Best Regards On Sat, May 2, 2015 at 12:04 AM,

Re: Hardware requirements

2015-05-04 Thread Akhil Das
500GB of data will have nearly 3900 partitions and if you can have nearly that many number of cores and around 500GB of memory then things will be lightening fast. :) Thanks Best Regards On Sun, May 3, 2015 at 12:49 PM, sherine ahmed sherine.sha...@hotmail.com wrote: I need to use spark to

Re: Hardware requirements

2015-05-04 Thread Akhil Das
and block sizes are same, shouldn't we end up with 8k partitions? On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote: 500GB of data will have nearly 3900 partitions and if you can have nearly that many number of cores and around 500GB of memory then things will be lightening fast

Re: Exiting driver main() method...

2015-05-02 Thread Akhil Das
It used to exit without any problem for me. You can basically check in the driver UI (that runs on 4040) and see what exactly its doing. Thanks Best Regards On Fri, May 1, 2015 at 6:22 PM, James Carman ja...@carmanconsulting.com wrote: In all the examples, it seems that the spark application

Re: spark.logConf with log4j.rootCategory=WARN

2015-05-02 Thread Akhil Das
It could be. Thanks Best Regards On Fri, May 1, 2015 at 9:11 PM, roy rp...@njit.edu wrote: Hi, I have recently enable log4j.rootCategory=WARN, console in spark configuration. but after that spark.logConf=True has becomes ineffective. So just want to confirm if this is because

Re: Spark Streaming Kafka Avro NPE on deserialization of payload

2015-05-02 Thread Akhil Das
There was a similar discussion over here http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccakz4c0s_cuo90q2jxudvx9wc4fwu033kx3-fjujytxxhr7p...@mail.gmail.com%3E Thanks Best Regards On Fri, May 1, 2015 at 7:12 PM, Todd Nist tsind...@gmail.com wrote: *Resending as I do not

Re: how to pass configuration properties from driver to executor?

2015-05-02 Thread Akhil Das
Infact, sparkConf.set(spark.whateverPropertyYouWant,Value) gets shipped to the executors. Thanks Best Regards On Fri, May 1, 2015 at 2:55 PM, Michael Ryabtsev mich...@totango.com wrote: Hi, We've had a similar problem, but with log4j properties file. The only working way we've found, was

Re: Spark worker error on standalone cluster

2015-05-02 Thread Akhil Das
Just make sure your are having the same version of spark in your cluster and the project's build file. Thanks Best Regards On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango) mich...@totango.com wrote: Hi everyone, I have a spark application that works fine on a standalone Spark

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-02 Thread Akhil Das
-memory 12g --executor-cores 4 12G is the limit imposed by YARN cluster, I cant go beyond this. ANY suggestions ? Regards, Deepak On Thu, Apr 30, 2015 at 6:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Did not work. Same problem. On Thu, Apr 30, 2015 at 1:28 PM, Akhil Das ak

Re: default number of reducers

2015-04-30 Thread Akhil Das
This is spark mailing list :/ Yes, you can configure the following in the mapred-site.xml for that: property namemapred.tasktracker.map.tasks.maximum/name value4/value /property Thanks Best Regards On Tue, Apr 28, 2015 at 11:00 PM, Shushant Arora shushantaror...@gmail.com wrote: In

Re: Performance advantage by loading data from local node over S3.

2015-04-30 Thread Akhil Das
If the data is too huge and is in S3, that'll be a lot of network traffic, instead, if the data is available in HDFS (with proper replication available) then it will be faster as most of the time, data will be available as PROCESS_LOCAL/NODE_LOCAL to the executor. Thanks Best Regards On Wed, Apr

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread Akhil Das
You could try increasing your heap space explicitly. like export _JAVA_OPTIONS=-Xmx10g, its not the correct approach but try. Thanks Best Regards On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB

Re: rdd.count with 100 elements taking 1 second to run

2015-04-30 Thread Akhil Das
Does this speed up? val rdd = sc.parallelize(1 to 100*, 30)* rdd.count Thanks Best Regards On Wed, Apr 29, 2015 at 1:47 AM, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running the following code in my cluster (standalone mode) via spark shell - val rdd = sc.parallelize(1 to

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-30 Thread Akhil Das
Have a look at KafkaRDD https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kafka/KafkaRDD.html Thanks Best Regards On Wed, Apr 29, 2015 at 10:04 AM, dgoldenberg dgoldenberg...@gmail.com wrote: Hi, I'm wondering about the use-case where you're not doing continuous,

Re: How to run customized Spark on EC2?

2015-04-30 Thread Akhil Das
This is how i used to do it: - Login to the ec2 cluster (master) - Make changes to the spark, and build it. - Stop the old installation of spark (sbin/stop-all.sh) - Copy old installation conf/* to modified version's conf/ - Rsync modified version to all slaves - do sbin/start-all.sh from the

Re: How to run self-build spark on EC2?

2015-04-30 Thread Akhil Das
You can replace your clusters(on master and workers) assembly jar with your custom build assembly jar. Thanks Best Regards On Tue, Apr 28, 2015 at 9:45 PM, Bo Fu b...@uchicago.edu wrote: Hi all, I have an issue. I added some timestamps in Spark source code and built it using: mvn package

Re: External Application Run Status

2015-04-30 Thread Akhil Das
One way you could try would be, Inside the map, you can have a synchronized thread and you can block the map till the thread finishes up processing. Thanks Best Regards On Wed, Apr 29, 2015 at 9:38 AM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi In a multi-node setup, I am

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-29 Thread Akhil Das
It is possible to access the filename, its a bit tricky though. val fstream = ssc.fileStream[LongWritable, IntWritable, SequenceFileInputFormat[LongWritable, IntWritable]](/home/akhld/input/) fstream.foreach(x ={ //You can get it with this object.

Re: Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-28 Thread Akhil Das
How about: JavaPairDStreamLongWritable, Text input = jssc.fileStream(inputDirectory, LongWritable.class, Text.class, TextInputFormat.class); See the complete example over here

Re: Understanding Spark's caching

2015-04-28 Thread Akhil Das
Option B would be fine, as in the SO itself the answer says, Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. Also note, In Option A, you are not specifying any

Re: Spark timeout issue

2015-04-27 Thread Akhil Das
You need to look more deep into your worker logs, you may find GC error, IO exceptions etc if you look closely which is triggering the timeout. Thanks Best Regards On Mon, Apr 27, 2015 at 3:18 AM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello Patrick, Sure. I've posted this on user as

Re: Understand the running time of SparkSQL queries

2015-04-27 Thread Akhil Das
Isn't it already available on the driver UI (that runs on 4040)? Thanks Best Regards On Mon, Apr 27, 2015 at 9:55 AM, Wenlei Xie wenlei@gmail.com wrote: Hi, I am wondering how should we understand the running time of SparkSQL queries? For example the physical query plan and the running

Re: Convert DStream[Long] to Long

2015-04-25 Thread Akhil Das
Like this? messages.foreachRDD(rdd = { if(rdd.count() 0) //Do whatever you want. }) Thanks Best Regards On Fri, Apr 24, 2015 at 11:20 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Hi, I need compare the count of messages recived if is 0 or not, but messages.count() return a

Re: DAG

2015-04-25 Thread Akhil Das
May be this will give you a good start https://github.com/apache/spark/pull/2077 Thanks Best Regards On Sat, Apr 25, 2015 at 1:29 AM, Giovanni Paolo Gibilisco gibb...@gmail.com wrote: Hi, I would like to know if it is possible to build the DAG before actually executing the application. My

Re: StreamingContext.textFileStream issue

2015-04-25 Thread Akhil Das
Make sure you are having =2 core for your streaming application. Thanks Best Regards On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote: I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py

Re: problem writing to s3

2015-04-24 Thread Akhil Das
, 2015 at 1:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try writing to a different S3 bucket and confirm that? Thanks Best Regards On Thu, Apr 23, 2015 at 12:11 AM, Daniel Mahler dmah...@gmail.com wrote: Hi Akhil, It works fine when outprefix is a hdfs:///localhost/... url

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Akhil Das
The directory in ZooKeeper to store recovery state (default: /spark). -Jeff From: Sean Owen so...@cloudera.com To: Akhil Das ak...@sigmoidanalytics.com Cc: Michal Klos michal.klo...@gmail.com, User user@spark.apache.org Date: Wed, 22 Apr 2015 11:05:46 +0100 Subject: Re: Multiple HA spark

Re: sparksql - HiveConf not found during task deserialization

2015-04-22 Thread Akhil Das
are in that dir. For me the most confusing thing is that the executor can actually create HiveConf objects but when it cannot find that when the task deserializer is at work. On 20 April 2015 at 14:18, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try sc.addJar(/path/to/your/hive/jar), i

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Akhil Das
You can enable this flag to run multiple jobs concurrently, It might not be production ready, but you can give it a try: sc.set(spark.streaming.concurrentJobs,2) ​Refer to TD's answer here

Re: Spark and accumulo

2015-04-21 Thread Akhil Das
You can simply use a custom inputformat (AccumuloInputFormat) with the hadoop RDDs (sc.newApiHadoopFile etc) for that, all you need to do is to pass the jobConfs. Here's pretty clean discussion:

Re: Understanding the build params for spark with sbt.

2015-04-21 Thread Akhil Das
With maven you could like: mvn -Dhadoop.version=2.3.0 -DskipTests clean package -pl core Thanks Best Regards On Mon, Apr 20, 2015 at 8:10 PM, Shiyao Ma i...@introo.me wrote: Hi. My usage is only about the spark core and hdfs, so no spark sql or mlib or other components invovled. I saw

Re: meet weird exception when studying rdd caching

2015-04-21 Thread Akhil Das
It could be a similar issue as https://issues.apache.org/jira/browse/SPARK-4300 Thanks Best Regards On Tue, Apr 21, 2015 at 8:09 AM, donhoff_h 165612...@qq.com wrote: Hi, I am studying the RDD Caching function and write a small program to verify it. I run the program in a Spark1.3.0

Re: Custom paritioning of DSTream

2015-04-21 Thread Akhil Das
I think DStream.transform is the one that you are looking for. Thanks Best Regards On Mon, Apr 20, 2015 at 9:42 PM, Evo Eftimov evo.efti...@isecc.com wrote: Is the only way to implement a custom partitioning of DStream via the foreach approach so to gain access to the actual RDDs comprising

Re: Running spark over HDFS

2015-04-21 Thread Akhil Das
Your spark master should be spark://swetha:7077 :) Thanks Best Regards On Mon, Apr 20, 2015 at 2:44 PM, madhvi madhvi.gu...@orkash.com wrote: PFA screenshot of my cluster UI Thanks On Monday 20 April 2015 02:27 PM, Akhil Das wrote: Are you seeing your task being submitted to the UI

Re: Running spark over HDFS

2015-04-20 Thread Akhil Das
2015 12:28 PM, Akhil Das wrote: In your eclipse, while you create your SparkContext, set the master uri as shown in the web UI's top left corner like: spark://someIPorHost:7077 and it should be fine. Thanks Best Regards On Mon, Apr 20, 2015 at 12:22 PM, madhvi madhvi.gu...@orkash.com

Re: NEWBIE/not able to connect to postgresql using jdbc

2015-04-20 Thread Akhil Das
try doing a sc.addJar(path\to\your\postgres\jar) Thanks Best Regards On Mon, Apr 20, 2015 at 12:26 PM, shashanksoni shashankso...@gmail.com wrote: I am using spark 1.3 standalone cluster on my local windows and trying to load data from one of our server. Below is my code - import os

Re: sparksql - HiveConf not found during task deserialization

2015-04-20 Thread Akhil Das
was suspecting some foul play with classloaders. On 20 April 2015 at 12:20, Akhil Das ak...@sigmoidanalytics.com wrote: Looks like a missing jar, try to print the classpath and make sure the hive jar is present. Thanks Best Regards On Mon, Apr 20, 2015 at 11:52 AM, Manku Timma manku.tim

Re: How to run spark programs in eclipse like mapreduce

2015-04-20 Thread Akhil Das
Why not build the project and submit the build jar with Spark submit? If you want to run it within eclipse, then all you have to do is, create a SparkContext pointing to your cluster, do a sc.addJar(/path/to/your/project/jar) and then you can hit the run button to run the job (note that network

Re: sparksql - HiveConf not found during task deserialization

2015-04-20 Thread Akhil Das
Looks like a missing jar, try to print the classpath and make sure the hive jar is present. Thanks Best Regards On Mon, Apr 20, 2015 at 11:52 AM, Manku Timma manku.tim...@gmail.com wrote: I am using spark-1.3 with hadoop-provided and hive-provided and hive-0.13.1 profiles. I am running a

Re: SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-20 Thread Akhil Das
Would be good, if you can paste your custom receiver code and the code that you used to invoke it. Thanks Best Regards On Mon, Apr 20, 2015 at 9:43 AM, Ankit Patel patel7...@hotmail.com wrote: I am experiencing problem with SparkStreaming (Spark 1.2.0), the onStart method is never called on

Re: shuffle.FetchFailedException in spark on YARN job

2015-04-20 Thread Akhil Das
Which version of Spark are you using? Did you try using spark.shuffle.blockTransferService=nio Thanks Best Regards On Sat, Apr 18, 2015 at 11:14 PM, roy rp...@njit.edu wrote: Hi, My spark job is failing with following error message org.apache.spark.shuffle.FetchFailedException:

Re: Running spark over HDFS

2015-04-20 Thread Akhil Das
In your eclipse, while you create your SparkContext, set the master uri as shown in the web UI's top left corner like: spark://someIPorHost:7077 and it should be fine. Thanks Best Regards On Mon, Apr 20, 2015 at 12:22 PM, madhvi madhvi.gu...@orkash.com wrote: Hi All, I am new to spark and

Re: Spark 1.3 saveAsTextFile with codec gives error - works with Spark 1.2

2015-04-17 Thread Akhil Das
Not sure if this will help, but try clearing your jar cache (for sbt ~/.ivy2 and for maven ~/.m2) directories. Thanks Best Regards On Wed, Apr 15, 2015 at 9:33 PM, Manoj Samel manojsamelt...@gmail.com wrote: Env - Spark 1.3 Hadoop 2.3, Kerbeos xx.saveAsTextFile(path, codec) gives following

Re: Distinct is very slow

2015-04-17 Thread Akhil Das
, Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste your complete code? Did you try repartioning/increasing level of parallelism to speed up the processing. Since you have 16 cores, and I'm assuming your 400k records isn't bigger than a 10G dataset. Thanks Best Regards On Thu, Apr 16

Re: SparkR: Server IPC version 9 cannot communicate with client version 4

2015-04-17 Thread Akhil Das
There's a version incompatibility between your hadoop jars. You need to make sure you build your spark with Hadoop 2.5.0-cdh5.3.1 version. Thanks Best Regards On Fri, Apr 17, 2015 at 5:17 AM, lalasriza . lala.s.r...@gmail.com wrote: Dear everyone, right now I am working with SparkR on

SparkStreaming 1.3.0 fileNotFound Exception while using WAL Checkpoints

2015-04-17 Thread Akhil Das
Hi With SparkStreaming on 1.3.0 version when I'm using WAL and checkpoints, sometimes, I'm hitting fileNotFound exceptions. Here's the complete stacktrace: https://gist.github.com/akhld/126b945f7fef408a525e The application simply reads data from Kafka and does a simple wordcount over it. Batch

Re: Streaming problems running 24x7

2015-04-16 Thread Akhil Das
I used to hit this issue when my processing time exceeds the batch duration. Here's a few workarounds: - Use storage level MEMORY_AND_DISK - Enable WAL and check pointing Above two will slow down things a little bit. If you want low latency, what you can try is: - Use storage level as

Re: MLLib SVMWithSGD : java.lang.OutOfMemoryError: Java heap space

2015-04-16 Thread Akhil Das
Try increasing your driver memory. Thanks Best Regards On Thu, Apr 16, 2015 at 6:09 PM, sarath sarathkrishn...@gmail.com wrote: Hi, I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But I'm getting java.lang.OutOfMemoryError: Java heap space error. The dataset is

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
...@gmail.com wrote: I already checked and G is taking 1 secs for each task. is this too much? if yes how to avoid this? On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote: Open the driver ui and see which stage is taking time, you can look whether its adding any GC time etc

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Akhil Das
You could try repartitioning your RDD using a custom partitioner (HashPartitioner etc) and caching the dataset into memory to speedup the joins. Thanks Best Regards On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have an RDD that contains

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Open the driver ui and see which stage is taking time, you can look whether its adding any GC time etc. Thanks Best Regards On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have below code whether distinct is running for more time. blockingRdd is the

Re: custom input format in spark

2015-04-16 Thread Akhil Das
You can simply override the isSplitable method in your custom inputformat class and make it return false. Here's a sample code snippet: http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop#answers-header Thanks Best Regards On Thu, Apr 16, 2015 at 4:18 PM,

Re: custom input format in spark

2015-04-16 Thread Akhil Das
You can plug in the native hadoop input formats with Spark's sc.newApiHadoopFile etc which takes in the inputformat. Thanks Best Regards On Thu, Apr 16, 2015 at 10:15 PM, Shushant Arora shushantaror...@gmail.com wrote: Is it for spark? On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das ak

Re: Execption while using kryo with broadcast

2015-04-15 Thread Akhil Das
Is it working without kryo? Thanks Best Regards On Wed, Apr 15, 2015 at 6:38 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I am getting below exception while using Kyro serializable with broadcast variable. I am broadcating a hasmap with below line. MapLong, MatcherReleventData

Re: Saving RDDs as custom output format

2015-04-15 Thread Akhil Das
You can try using ORCOutputFormat with yourRDD.saveAsNewAPIHadoopFile Thanks Best Regards On Tue, Apr 14, 2015 at 9:29 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

Re: spark streaming printing no output

2015-04-15 Thread Akhil Das
Just make sure you have atleast 2 cores available for processing. You can try launching it in local[2] and make sure its working fine. Thanks Best Regards On Tue, Apr 14, 2015 at 11:41 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi I am running a spark streaming application but on

Re: Running Spark on Gateway - Connecting to Resource Manager Retries

2015-04-15 Thread Akhil Das
Make sure your yarn service is running on 8032. Thanks Best Regards On Tue, Apr 14, 2015 at 12:35 PM, Vineet Mishra clearmido...@gmail.com wrote: Hi Team, I am running Spark Word Count example( https://github.com/sryza/simplesparkapp), if I go with master as local it works fine. But when

Re: Running beyond physical memory limits

2015-04-15 Thread Akhil Das
Did you try reducing your spark.executor.memory? Thanks Best Regards On Wed, Apr 15, 2015 at 2:29 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hello Sparkers I am newbie to spark and need help.. We are using spark 1.2, we are getting the following error and executor is

Re: spark streaming with kafka

2015-04-15 Thread Akhil Das
Once you start your streaming application to read from Kafka, it will launch receivers on the executor nodes. And you can see them on the streaming tab of your driver ui (runs on 4040). [image: Inline image 1] These receivers will be fixed till the end of your pipeline (unless its crashed etc.)

Re: Re: spark streaming with kafka

2015-04-15 Thread Akhil Das
, Akhil, I would ask a question here: Assume Receiver-0 is crashed, will it be restarted on other worker nodes(In your picture, there would be 2 receivers on the same node) or will it start on the same node? -- bit1...@163.com *From:* Akhil Das ak

Re: RDD generated on every query

2015-04-14 Thread Akhil Das
You can use a tachyon based storage for that and everytime the client queries, you just get it from there. Thanks Best Regards On Mon, Apr 6, 2015 at 6:01 PM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi , In Spark Web Application the RDD is generating every time client is

Re: set spark.storage.memoryFraction to 0 when no cached RDD and memory area for broadcast value?

2015-04-14 Thread Akhil Das
You could try leaving all the configuration values to default and running your application and see if you are still hitting the heap issue, If so try adding a Swap space to the machines which will definitely help. Another way would be to set the heap space manually (export _JAVA_OPTIONS=-Xmx5g)

Re: Seeing message about receiver not being de-registered on invoking Streaming context stop

2015-04-14 Thread Akhil Das
When you say done fetching documents, does it mean that you are stopping the streamingContext? (ssc.stop) or you meant completed fetching documents for a batch? If possible, you could paste your custom receiver code so that we can have a look at it. Thanks Best Regards On Tue, Apr 7, 2015 at

Re: start-slave.sh not starting

2015-04-14 Thread Akhil Das
Why are you not using sbin/start-all.sh? Thanks Best Regards On Wed, Apr 8, 2015 at 10:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to start the worker by: sbin/start-slave.sh spark://ip-10-241-251-232:7077 In the logs it's complaining about: Master must be a URL of the

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-04-14 Thread Akhil Das
One hack you can put in would be to bring Result class http://grepcode.com/file_/repository.cloudera.com/content/repositories/releases/com.cloudera.hbase/hbase/0.89.20100924-28/org/apache/hadoop/hbase/client/Result.java/?v=source locally and serialize it (implements serializable) and use it.

Re: value reduceByKeyAndWindow is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2015-04-14 Thread Akhil Das
Just make sure you import the followings: import org.apache.spark.SparkContext._ import org.apache.spark.StreamingContext._ Thanks Best Regards On Wed, Apr 8, 2015 at 6:38 AM, Su She suhsheka...@gmail.com wrote: Hello Everyone, I am trying to implement this example (Spark Streaming with

Re: Cannot change the memory of workers

2015-04-14 Thread Akhil Das
If you want to use 2g of memory on each worker, you can simply export SPARK_WORKER_MEMORY=2g inside your spark-env.sh on all machine in the cluster. Thanks Best Regards On Wed, Apr 8, 2015 at 7:27 AM, Jia Yu jia...@asu.edu wrote: Hi guys, Currently I am running Spark program on Amazon EC2.

Re: Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] when create context

2015-04-14 Thread Akhil Das
Can you share a bit more information on the type of application that you are running? From the stacktrace i can only say, for some reason your connection timedout (prolly a GC pause or network issue) Thanks Best Regards On Wed, Apr 8, 2015 at 9:48 PM, Shuai Zheng szheng.c...@gmail.com wrote:

Re: SparkSQL + Parquet performance

2015-04-14 Thread Akhil Das
That totally depends on your disk IO and the number of CPUs that you have in the cluster. For example, if you are having a disk IO of 100MB/s and a handful of CPUs ( say 40 cores, on 10 machines), then it could take you to ~ 1GB/Sec i believe. Thanks Best Regards On Tue, Apr 7, 2015 at 2:48 AM,

Re: streamSQL - is it available or is it in POC ?

2015-04-14 Thread Akhil Das
We have a similar version (Sigstream), you could find more over here https://sigmoid.com/ Thanks Best Regards On Wed, Apr 8, 2015 at 9:25 AM, haopu hw...@qilinsoft.com wrote: I'm also interested in this project. Do you have any update on it? Is it still active? Thank you! -- View this

Re: save as text file throwing null pointer error.

2015-04-14 Thread Akhil Das
Where exactly is it throwing null pointer exception? Are you starting your program from another program or something? looks like you are invoking ProcessingBuilder etc. Thanks Best Regards On Thu, Apr 9, 2015 at 6:46 PM, Somnath Pandeya somnath_pand...@infosys.com wrote: JavaRDDString

Re: Spark Streaming not picking current date properly

2015-04-14 Thread Akhil Das
You can try something like this: ​eventsDStream.foreachRDD(rdd = { val curdate = new DateTime() val fmt = DateTimeFormat.forPattern(dd_MM_); rdd.saveAsTextFile(s3n://bucket_name/test/events_+fmt.print(curdate)+/events) }) Thanks Best Regards On Fri, Apr 10, 2015 at 4:22

Re: Spark Data Formats ?

2015-04-14 Thread Akhil Das
There's sc.objectFile also. Thanks Best Regards On Tue, Apr 14, 2015 at 2:59 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Can you please share the native support of data formats available with Spark. Two i can see are parquet and textFile sc.parquetFile sc.textFile I see that Hadoop

Re: Sending RDD object over the network

2015-04-06 Thread Akhil Das
Are you expecting to receive 1 to 100 values in your second program? RDD is just an abstraction, you would need to do like: num.foreach(x = send(x)) Thanks Best Regards On Mon, Apr 6, 2015 at 1:56 AM, raggy raghav0110...@gmail.com wrote: For a class project, I am trying to utilize 2 spark

Re: Learning Spark

2015-04-06 Thread Akhil Das
We had few sessions at Sigmoid, you could go through the meetup page for details: http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/ On 6 Apr 2015 18:01, Abhideep Chakravarty abhideep.chakrava...@mindtree.com wrote: Hi all, We are here planning to setup a Spark learning

Re: Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-05 Thread Akhil Das
How are you submitting the application? Use a standard build tool like maven or sbt to build your project, it will download all the dependency jars, when you submit your application (if you are using spark-submit, then use --jars option to add those jars which are causing classNotFoundException).

Re: Spark Streaming FileStream Nested File Support

2015-04-04 Thread Akhil Das
We've a custom version/build of sparktreaming doing the nested s3 lookups faster (uses native S3 APIs). You can find the source code over here : https://github.com/sigmoidanalytics/spark-modified, In particular the changes from here

Re: Delaying failed task retries + giving failing tasks to different nodes

2015-04-03 Thread Akhil Das
I think these are the following configurations that you are looking for: *spark.locality.wait*: Number of milliseconds to wait to launch a data-local task before giving up and launching it on a less-local node. The same wait will be used to step through multiple locality levels (process-local,

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Akhil Das
= SchemaRDD[5] at RDD at SchemaRDD.scala:108== Query Plan Physical Plan == java.lang.ClassNotFoundException: json_tuple Any other suggestions or am I doing something else wrong here? -Todd On Thu, Apr 2, 2015 at 2:00 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Try adding all

Re: Spark Streaming Worker runs out of inodes

2015-04-03 Thread Akhil Das
Did you try these? - Disable shuffle : spark.shuffle.spill=false - Enable log rotation: sparkConf.set(spark.executor.logs.rolling.strategy, size) .set(spark.executor.logs.rolling.size.maxBytes, 1024) .set(spark.executor.logs.rolling.maxRetainedFiles, 3) Thanks Best Regards On Fri, Apr 3, 2015

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-03 Thread Akhil Das
:34 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try building Spark https://spark.apache.org/docs/1.2.0/building-spark.html#building-with-hive-and-jdbc-support%23building-with-hive-and-jdbc-support with hive support? Before that try to run the following: ./bin/spark-shell --master

Re: Which OS for Spark cluster nodes?

2015-04-03 Thread Akhil Das
There isn't any specific Linux distro, but i would prefer Ubuntu for a beginner as its very easy to apt-get install stuffs on it. Thanks Best Regards On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias tobias.horsm...@uni-due.de wrote: Hi, Are there any recommendations for operating systems

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Akhil Das
This thread might give you some insights http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E Thanks Best Regards On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: My Spark Job

Re: Spark Job Failed - Class not serializable

2015-04-03 Thread Akhil Das
iPhone On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote: I was able to write record that extends specificrecord (avro) this class was not auto generated. Do we need to do something extra for auto generated classes Sent from my iPhone On 03-Apr-2015, at 5:06 pm, Akhil Das ak

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-02 Thread Akhil Das
Try adding all the jars in your $HIVE/lib directory. If you want the specific jar, you could look fr jackson or json serde in it. Thanks Best Regards On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote: I have a feeling I’m missing a Jar that provides the support or could this

Re: SparkStreaming batch processing time question

2015-04-01 Thread Akhil Das
It will add scheduling delay for the new batch. The new batch data will be processed after finish up the previous batch, when the time is too high, sometimes it will throw fetch failures as the batch data could get removed from memory. Thanks Best Regards On Wed, Apr 1, 2015 at 11:35 AM,

Re: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Akhil Das
Once you submit the job do a ps aux | grep spark-submit and see how much is the heap space allocated to the process (the -Xmx params), if you are seeing a lower value you could try increasing it yourself with: export _JAVA_OPTIONS=-Xmx5g Thanks Best Regards On Wed, Apr 1, 2015 at 1:57 AM, Shuai

Re: Spark throws rsync: change_dir errors on startup

2015-04-01 Thread Akhil Das
Error 23 is defined as a partial transfer and might be caused by filesystem incompatibilities, such as different character sets or access control lists. In this case it could be caused by the double slashes (// at the end of sbin), You could try editing your sbin/spark-daemon.sh file, look for

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Akhil Das
You can add an internal ip to public hostname mapping in your /etc/hosts file, if your forwarding is proper then it wouldn't be a problem there after. Thanks Best Regards On Tue, Mar 31, 2015 at 9:18 AM, anny9699 anny9...@gmail.com wrote: Hi, For security reasons, we added a server between

Re: How to setup a Spark Cluter?

2015-03-31 Thread Akhil Das
Its pretty simple, pick one machine as master (say machine A), and lets call the workers are B,C, and D *Login to A:* - Enable passwd less authentication (ssh-keygen) - Add A's ~/.ssh/id_rsa.pub to B,C,D's ~/.ssh/authorized_keys file - Download spark binary (that supports your hadoop

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Akhil Das
the spark-env.sh file? Thanks! Anny On Mon, Mar 30, 2015 at 11:15 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can add an internal ip to public hostname mapping in your /etc/hosts file, if your forwarding is proper then it wouldn't be a problem there after. Thanks Best Regards On Tue

Re: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread Akhil Das
What happens when you do: sc.textFile(hdfs://path/to/the_file.txt) Thanks Best Regards On Mon, Mar 30, 2015 at 11:04 AM, Nick Travers n.e.trav...@gmail.com wrote: Hi List, I'm following this example here https://github.com/databricks/learning-spark/tree/master/mini-complete-example

Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Akhil Das
Are you having enough messages in kafka to consume? Can you make sure you kafka setup is working with your console consumer? Also try this example https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala Thanks

<    2   3   4   5   6   7   8   9   10   11   >