Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread Akhil Das
? On Tue, Jun 9, 2015 at 1:07 PM, amit tewari amittewar...@gmail.com wrote: Thanks Akhil, as you suggested, I have to go keyBy(route) as need the columns intact. But wil keyBy() take accept multiple fields (eg x(0), x(1))? Thanks Amit On Tue, Jun 9, 2015 at 12:26 PM, Akhil Das ak

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
downloaded the application jar but not the other jars specified by “—jars”. Or I misunderstand the usage of “--jars”, and the jars should be already in every worker, driver will not download them? Is there some useful docs? Thanks Dong Lei *From:* Akhil Das [mailto:ak

Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
Regards On Tue, Jun 9, 2015 at 2:09 PM, luohui20...@sina.com wrote: Only 1 minor GC, 0.07s. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: How

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread Akhil Das
Try this way: scalaval input1 = sc.textFile(/test7).map(line = line.split(,).map(_.trim)); scalaval input2 = sc.textFile(/test8).map(line = line.split(,).map(_.trim)); scalaval input11 = input1.map(x=(*(x(0) + x(1)*),x(2),x(3))) scalaval input22 = input2.map(x=(*(x(0) + x(1)*),x(2),x(3))) scala

Re: Saving compressed textFiles from a DStream in Scala

2015-06-09 Thread Akhil Das
like this? myDStream.foreachRDD(rdd = rdd.saveAsTextFile(/sigmoid/, codec )) Thanks Best Regards On Mon, Jun 8, 2015 at 8:06 PM, Bob Corsaro rcors...@gmail.com wrote: It looks like saveAsTextFiles doesn't support the compression parameter of RDD.saveAsTextFile. Is there a way to add the

Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
May be you should check in your driver UI and see if there's any GC time involved etc. Thanks Best Regards On Mon, Jun 8, 2015 at 5:45 PM, luohui20...@sina.com wrote: hi there I am trying to descrease my app's running time in worker node. I checked the log and found the most

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
Once you submits the application, you can check in the driver UI (running on port 4040) Environment Tab to see whether those jars you added got shipped or not. If they are shipped and still you are getting NoClassDef exceptions then it means that you are having a jar conflict which you can resolve

Re: Driver crash at the end with InvocationTargetException when running SparkPi

2015-06-08 Thread Akhil Das
Can you look in your worker logs for more detailed stack-trace? If its about winutils.exe you can look at these links to get it resolved. - http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7 - https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Mon, Jun 8,

Re: spark ssh to slave

2015-06-08 Thread Akhil Das
it just lets me straight in. On Mon, Jun 8, 2015 at 11:58 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its able to login without password? Thanks Best Regards On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin

Re: spark ssh to slave

2015-06-08 Thread Akhil Das
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its able to login without password? Thanks Best Regards On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin...@gmail.com wrote: I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker) These two hosts have

Re: Monitoring Spark Jobs

2015-06-07 Thread Akhil Das
It could be a CPU, IO, Network bottleneck, you need to figure out where exactly its chocking. You can use certain monitoring utilities (like top) to understand it better. Thanks Best Regards On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, I have a Spark

Re: Accumulator map

2015-06-07 Thread Akhil Das
​Another approach would be to use a zookeeper. If you have zookeeper running somewhere in the cluster you can simply create a path like */dynamic-list*​ in it and then write objects/values to it, you can even create/access nested objects. Thanks Best Regards On Fri, Jun 5, 2015 at 7:06 PM,

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Akhil Das
Which consumer are you using? If you can paste the complete code then may be i can try reproducing it. Thanks Best Regards On Sun, Jun 7, 2015 at 1:53 AM, EH eas...@gmail.com wrote: And here is the Thread Dump, where seems every worker is waiting for Executor #6 Thread 95:

Re: Setting S3 output file grantees for spark output files

2015-06-05 Thread Akhil Das
You could try adding the configuration in the spark-defaults.conf file. And once you run the application you can actually check on the driver UI (runs on 4040) Environment tab to see if the configuration is set properly. Thanks Best Regards On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel

Re: Saving calculation to single local file

2015-06-05 Thread Akhil Das
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be efficient if your output data is huge since one task will be doing the whole writing. Thanks Best Regards On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo ole...@gmail.com wrote: Hi all I'm running spark in a single local

Re: StreamingListener, anyone?

2015-06-04 Thread Akhil Das
Hi Here's a working example: https://gist.github.com/akhld/b10dc491aad1a2007183 [image: Inline image 1] Thanks Best Regards On Wed, Jun 3, 2015 at 10:09 PM, dgoldenberg dgoldenberg...@gmail.com wrote: Hi, I've got a Spark Streaming driver job implemented and in it, I register a streaming

Re: Python Image Library and Spark

2015-06-04 Thread Akhil Das
Replace this line: img_data = sc.parallelize( list(im.getdata()) ) With: img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have ) Thanks Best Regards On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur jmspar...@gmail.com wrote: Hi all, I'm playing around with

Re: Adding new Spark workers on AWS EC2 - access error

2015-06-04 Thread Akhil Das
That's because you need to add the master's public key (~/.ssh/id_rsa.pub) to the newly added slaves ~/.ssh/authorized_keys. I add slaves this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master

Re: Spark Client

2015-06-03 Thread Akhil Das
is, Is there an alternate api though which a spark application can be launched which can return a exit status back to the caller as opposed to initiating JVM halt. On Wed, Jun 3, 2015 at 12:58 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Run it as a standalone application. Create an sbt project

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
You need to look into your executor/worker logs to see whats going on. Thanks Best Regards On Wed, Jun 3, 2015 at 12:01 PM, patcharee patcharee.thong...@uni.no wrote: Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best, Patcharee

Re: Scripting with groovy

2015-06-03 Thread Akhil Das
I think when you do a ssc.stop it will stop your entire application and by update a transformation function you mean modifying the driver program? In that case even if you checkpoint your application, it won't be able to recover from its previous state. A simpler approach would be to add certain

Re: Spark Client

2015-06-03 Thread Akhil Das
Run it as a standalone application. Create an sbt project and do sbt run? Thanks Best Regards On Wed, Jun 3, 2015 at 11:36 AM, pavan kumar Kolamuri pavan.kolam...@gmail.com wrote: Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs. But for my use case i don't want it

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more Best, Patcharee On 03. juni 2015 09:21, Akhil Das wrote: You need to look into your executor/worker logs to see whats going on. Thanks Best Regards On Wed, Jun 3, 2015 at 12:01 PM, patcharee

Re: using pyspark with standalone cluster

2015-06-02 Thread Akhil Das
If you want to submit applications to a remote cluster where your port 7077 is opened publically, then you would need to set the *spark.driver.host *(with the public ip of your laptop) and *spark.driver.port* (optional, if there's no firewall between your laptop and the remote cluster). Keeping

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Akhil Das
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think StorageLevel MEMORY_AND_DISK means spark will try to keep the data in memory and if there isn't sufficient space then it will be shipped to the disk. Thanks Best Regards On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea

Re: HDFS Rest Service not available

2015-06-02 Thread Akhil Das
It says your namenode is down (connection refused on 8020), you can restart your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and then sbin/start-dfs.sh Thanks Best Regards On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote: Hello All, A bit scared I did

Re: What is shuffle read and what is shuffle write ?

2015-06-02 Thread Akhil Das
I found an interesting presentation http://www.slideshare.net/colorant/spark-shuffle-introduction and go through this thread also http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html Thanks Best Regards On Tue, Jun 2, 2015 at 3:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-02 Thread Akhil Das
You can try to skip the tests, try with: mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package Thanks Best Regards On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch java...@gmail.com wrote: I downloaded the 1.3.1 distro tarball $ll ../spark-1.3.1.tar.gz -rw-r-@ 1 steve staff

Re: Shared / NFS filesystems

2015-06-02 Thread Akhil Das
You can run/submit your code from one of the worker which has access to the file system and it should be fine i think. Give it a try. Thanks Best Regards On Tue, Jun 2, 2015 at 3:22 PM, Pradyumna Achar pradyumna.ac...@gmail.com wrote: Hello! I have Spark running in standalone mode, and there

Re: How to read sequence File.

2015-06-02 Thread Akhil Das
Basically, you need to convert it to a serializable format before doing the collect/take. You can fire up a spark shell and paste this: val sFile = sc.sequenceFile[LongWritable, Text](/home/akhld/sequence /sigmoid) *.map(_._2.toString)* sFile.take(5).foreach(println) Use the

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Akhil Das
This thread http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application has various methods on accessing S3 from spark, it might help you. Thanks Best Regards On Sun, May 24, 2015 at 8:03 AM, ogoh oke...@gmail.com wrote: Hello, I am

Re: RDD boundaries and triggering processing using tags in the data

2015-06-01 Thread Akhil Das
May be you can make use of the Window operations https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#window-operations, Also another approach would be to keep your incoming data in Hbase/Redis/Cassandra kind of database and then whenever you need to average it, you just query the

Re: Cassanda example

2015-06-01 Thread Akhil Das
Here's a more detailed documentation https://github.com/datastax/spark-cassandra-connector from Datastax, You can also shoot an email directly to their mailing list http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user since its more related to their code. Thanks Best

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
I do this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master public key to .ssh/authorized_keys - Add the slaves internal IP to the master's conf/slaves file - do sbin/start-all.sh and it will

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
MEMORY_AND_DISK_SER_2. Besides, the driver's memory is also growing. I don't think Kafka messages will be cached in driver. On Thu, May 28, 2015 at 12:24 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using the createStream or createDirectStream api? If its the former, you can try setting

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
aggressive? Let's say I have 20 slaves up, and I want to add one more, why should we stop the entire cluster for this? thanks, nizan On Thu, May 28, 2015 at 10:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I do this way: - Launch a new instance by clicking on the slave instance

Re: Get all servers in security group in bash(ec2)

2015-05-28 Thread Akhil Das
You can use python boto library for that, in fact spark-ec2 script uses it underneath. Here's the https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L706 call spark-ec2 is making to get all machines under a given security group. Thanks Best Regards On Thu, May 28, 2015 at 2:22 PM,

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
= rdd.foreachPartition { iter = iter.foreach(count = logger.info(count.toString)) } } It receives messages from Kafka, parse the json, filter and count the records, and then print it to logs. Thanks. On Thu, May 28, 2015 at 3:07 PM, Akhil Das ak

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Akhil Das
After submitting the job, if you do a ps aux | grep spark-submit then you can see all JVM params. Are you using the highlevel consumer (receiver based) for receiving data from Kafka? In that case if your throughput is high and the processing delay exceeds batch interval then you will hit this

Re: How to give multiple directories as input ?

2015-05-27 Thread Akhil Das
How about creating two and union [ sc.union(first, second) ] them? Thanks Best Regards On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have this piece sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](

Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-26 Thread Akhil Das
files like result1.txt,result2.txt...result21.txt. Sorry for not adding some comments for my code. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: Re

Re: Remove COMPLETED applications and shuffle data

2015-05-26 Thread Akhil Das
Try these: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set(spark.executor.logs.rolling.strategy, size) .set(spark.executor.logs.rolling.size.maxBytes, 1024) .set(spark.executor.logs.rolling.maxRetainedFiles, 3) You can also look into

Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread Akhil Das
;Best regards! San.Luo - 原始邮件 - 发件人:madhu phatak phatak@gmail.com 收件人:luohui20...@sina.com 抄送人:Akhil Das ak...@sigmoidanalytics.com, user user@spark.apache.org 主题:Re: Re: how to distributed run a bash shell in spark 日期:2015年05月25日 14点11分 Hi, You can use pipe operator, if you

Re: Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Akhil Das
Try this way: object Holder extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName)} val someRdd = spark.parallelize(List(1, 2, 3)) someRdd.map { element = Holder.*log.info http://log.info/(s$element will be processed)* element + 1

Re: How to use zookeeper in Spark Streaming

2015-05-25 Thread Akhil Das
If you want to notify after every batch is completed, then you can simply implement the StreamingListener https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener interface, which has methods like onBatchCompleted, onBatchStarted etc in which

Re: IPv6 support

2015-05-25 Thread Akhil Das
Hi Kevin, Did you try adding a host name for the ipv6? I have a few ipv6 boxes, spark failed for me when i use just the ipv6 addresses, but it works fine when i use the host names. Here's an entry in my /etc/hosts: 2607:5300:0100:0200::::0a4d hacked.work My spark-env.sh file:

Re: how to distributed run a bash shell in spark

2015-05-24 Thread Akhil Das
You mean you want to execute some shell commands from spark? Here's something i tried a while back. https://github.com/akhld/spark-exploit Thanks Best Regards On Sun, May 24, 2015 at 4:53 PM, luohui20...@sina.com wrote: hello there I am trying to run a app in which part of it needs to

Re: Trying to connect to many topics with several DirectConnect

2015-05-24 Thread Akhil Das
I used to hit a NPE when i don't add all the dependency jars to my context while running it in standalone mode. Can you try adding all these dependencies to your context? sc.addJar(/home/akhld/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_2.10-1.3.1.jar)

Re: Spark Memory management

2015-05-22 Thread Akhil Das
You can look at the logic for offloading data from Memory by looking at ensureFreeSpace https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L416 call. And dropFromMemory

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Akhil Das
without problem. Best Regards, Allan On 21 May 2015 at 01:30, Akhil Das ak...@sigmoidanalytics.com wrote: This is more like an issue with your HDFS setup, can you check in the datanode logs? Also try putting a new file in HDFS and see if that works. Thanks Best Regards On Wed, May 20

Re: How to set the file size for parquet Part

2015-05-21 Thread Akhil Das
How many part files are you having? Did you try re-partitioning to a smaller number so that you will have bigger files of smaller number. Thanks Best Regards On Wed, May 20, 2015 at 3:06 AM, Richard Grossman richie...@gmail.com wrote: Hi I'm using spark 1.3.1 and now I can't set the size of

Re: Read multiple files from S3

2015-05-21 Thread Akhil Das
textFile does reads all files in a directory. We have modified the sparkstreaming code base to read nested files from S3, you can check this function

Re: rdd.saveAsTextFile problem

2015-05-21 Thread Akhil Das
This thread happened a year back, can you please share what issue you are facing? which version of spark you are using? What is your system environment? Exception stack-trace? Thanks Best Regards On Thu, May 21, 2015 at 12:19 PM, Keerthi keerthi.reddy1...@gmail.com wrote: Hi , I had tried

Re: rdd.saveAsTextFile problem

2015-05-21 Thread Akhil Das
* and edit *Path* Variable to add *bin* directory of *HADOOP_HOME* (say*C:\hadoop\bin*). fix this issue in my env 2015-05-21 9:55 GMT+03:00 Akhil Das ak...@sigmoidanalytics.com: This thread happened a year back, can you please share what issue you are facing? which version of spark you are using

Re: java program got Stuck at broadcasting

2015-05-21 Thread Akhil Das
Can you try commenting the saveAsTextFile and do a simple count()? If its a broadcast issue, then it would throw up the same error. On 21 May 2015 14:21, allanjie allanmcgr...@gmail.com wrote: Sure, the code is very simple. I think u guys can understand from the main function. public class

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
-packages.org/package/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com wrote: On Tue, May 19, 2015 at 8:10 PM, Shushant Arora shushantaror...@gmail.com wrote: So for Kafka+spark streaming, Receiver based streaming used

Re: Reading Binary files in Spark program

2015-05-20 Thread Akhil Das
of JavaPairRDD is as expected. It is when we are calling collect() or toArray() methods, the exception is coming. Something to do with Text class even though I haven't used it in the program. Regards Tapan On Tue, May 19, 2015 at 6:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try something

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
the rate with spark.streaming.kafka.maxRatePerPartition)​ Read more here https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md​ On Wed, May 20, 2015 at 12:36 PM, Akhil Das ak...@sigmoidanalytics.com wrote: One receiver basically runs on 1 core, so if your single node is having

Re: Reading Binary files in Spark program

2015-05-20 Thread Akhil Das
-files Regards Tapan On Wed, May 20, 2015 at 12:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If you can share the complete code and a sample file, may be i can try to reproduce it on my end. Thanks Best Regards On Wed, May 20, 2015 at 7:00 AM, Tapan Sharma tapan.sha

Re: java program Get Stuck at broadcasting

2015-05-20 Thread Akhil Das
This is more like an issue with your HDFS setup, can you check in the datanode logs? Also try putting a new file in HDFS and see if that works. Thanks Best Regards On Wed, May 20, 2015 at 11:47 AM, allanjie allanmcgr...@gmail.com wrote: ​Hi All, The variable I need to broadcast is just 468

Re: Spark users

2015-05-20 Thread Akhil Das
Yes, this is the user group. Feel free to ask your questions in this list. Thanks Best Regards On Wed, May 20, 2015 at 5:58 AM, Ricardo Goncalves da Silva ricardog.si...@telefonica.com wrote: Hi I'm learning spark focused on data and machine learning. Migrating from SAS. There is a group

Re: TwitterUtils on Windows

2015-05-19 Thread Akhil Das
Hi Justin, Can you try with sbt, may be that will help. - Install sbt for windows http://www.scala-sbt.org/0.13/tutorial/Installing-sbt-on-Windows.html - Create a lib directory in your project directory - Place these jars in it: - spark-streaming-twitter_2.10-1.3.1.jar -

Re: group by and distinct performance issue

2015-05-19 Thread Akhil Das
Hi Peer, If you open the driver UI (running on port 4040) you can see the stages and the tasks happening inside it. Best way to identify the bottleneck for a stage is to see if there's any time spending on GC, and how many tasks are there per stage (it should be a number total # cores to achieve

Re: org.apache.spark.shuffle.FetchFailedException :: Migration from Spark 1.2 to 1.3

2015-05-19 Thread Akhil Das
There were some similar discussion happened on JIRA https://issues.apache.org/jira/browse/SPARK-3633 may be that will give you some insights. Thanks Best Regards On Mon, May 18, 2015 at 10:49 PM, zia_kayani zia.kay...@platalytics.com wrote: Hi, I'm getting this exception after shifting my code

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
It will be a single job running at a time by default (you can also configure the spark.streaming.concurrentJobs to run jobs parallel which is not recommended to put in production). Now, your batch duration being 1 sec and processing time being 2 minutes, if you are using a receiver based

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
not be started at its desired interval. And Whats the difference and usage of Receiver vs non-receiver based streaming. Is there any documentation for that? On Tue, May 19, 2015 at 1:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: It will be a single job running at a time by default (you can also

Re: Reading Real Time Data only from Kafka

2015-05-19 Thread Akhil Das
in result. Deciding where to save offsets (or not) is up to you. You can checkpoint, or store them yourself. On Mon, May 18, 2015 at 12:00 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I have played a bit with the directStream kafka api. Good work cody. These are my findings and also can you

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
specify the number of receiver that you want to spawn for consuming the messages. On Tue, May 19, 2015 at 2:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: spark.streaming.concurrentJobs takes an integer value, not boolean. If you set it as 2 then 2 jobs will run parallel. Default value

Re: Reading Binary files in Spark program

2015-05-19 Thread Akhil Das
Try something like: JavaPairRDDIntWritable, Text output = sc.newAPIHadoopFile(inputDir, org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class, IntWritable.class, Text.class, new Job().getConfiguration()); With the type of input format that you require. Thanks Best

Re: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
assure you that at least as of Spark Streaming 1.2.0, as Evo says Spark Streaming DOES crash in “unceremonious way” when the free RAM available for In Memory Cashed RDDs gets exhausted *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, May 18, 2015 2:03 PM *To:* Evo Eftimov

Re: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
of the condition is: Loss was due to java.lang.Exception java.lang.Exception: *Could not compute split, block* *input-4-1410542878200 not found* *From:* Evo Eftimov [mailto:evo.efti...@isecc.com] *Sent:* Monday, May 18, 2015 12:13 PM *To:* 'Dmitry Goldenberg'; 'Akhil Das' *Cc:* 'user@spark.apache.org

Re: Reading Real Time Data only from Kafka

2015-05-18 Thread Akhil Das
are processed is order ( and offsets commits in order ) .. etc .. So whoever use whichever consumer need to study pros and cons of both approach before taking a call .. Regards, Dibyendu On Tue, May 12, 2015 at 8:10 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Cody, I was just

RE: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
Streaming does “NOT” crash UNCEREMNOUSLY – please maintain responsible and objective communication and facts *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, May 18, 2015 2:28 PM *To:* Evo Eftimov *Cc:* Dmitry Goldenberg; user@spark.apache.org *Subject:* Re: Spark

Re: Spark streaming over a rest API

2015-05-18 Thread Akhil Das
Why not use sparkstreaming to do the computation and dump the result somewhere in a DB perhaps and take it from there? Thanks Best Regards On Mon, May 18, 2015 at 7:51 PM, juandasgandaras juandasganda...@gmail.com wrote: Hello, I would like to use spark streaming over a REST api to get

Re: number of executors

2015-05-17 Thread Akhil Das
Did you try --executor-cores param? While you submit the job, do a ps aux | grep spark-submit and see the exact command parameters. Thanks Best Regards On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi, I have a 5 nodes yarn cluster, I used spark-submit to submit

Re: Forbidded : Error Code: 403

2015-05-17 Thread Akhil Das
I think you can try this way also: DataFrame df = sqlContext.load(s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro, com.databricks.spark.avro); Thanks Best Regards On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq donta...@gmail.com wrote: Thanks for the suggestion Steve. I'll try that out.

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Akhil Das
Why not just trigger your batch job with that event? If you really need streaming, then you can create a custom receiver and make the receiver sleep till the event has happened. That will obviously run your streaming pipelines without having any data to process. Thanks Best Regards On Fri, May

Re: textFileStream Question

2015-05-17 Thread Akhil Das
With file timestamp, you can actually see the finding new files logic from here https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172 Thanks Best Regards On Fri, May 15, 2015 at 2:25 AM, Vadim Bichutskiy

Re: Spark Streaming and reducing latency

2015-05-17 Thread Akhil Das
With receiver based streaming, you can actually specify spark.streaming.blockInterval which is the interval at which the receiver will fetch data from the source. Default value is 200ms and hence if your batch duration is 1 second, it will produce 5 blocks of data. And yes, with sparkstreaming

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
What do you mean by not detected? may be you forgot to trigger some action on the stream to get it executed. Like: val list_join_action_stream = ssc.fileStream[LongWritable, Text, TextInputFormat](gc.input_dir, (t: Path) = true, false).map(_._2.toString) *list_join_action_stream.count().print()*

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-14 Thread Akhil Das
Did you happened to have a look at the spark job server? https://github.com/ooyala/spark-jobserver Someone wrote a python wrapper https://github.com/wangqiang8511/spark_job_manager around it, give it a try. Thanks Best Regards On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW

Re: spark-streaming whit flume error

2015-05-14 Thread Akhil Das
Can you share the client code that you used to send the data? May be this discussion would give you some insights http://apache-avro.679487.n3.nabble.com/Avro-RPC-Python-to-Java-isn-t-working-for-me-td4027454.html Thanks Best Regards On Thu, May 14, 2015 at 8:44 AM, 鹰 980548...@qq.com wrote:

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
at 1:04 PM, lisendong lisend...@163.com wrote: I have action on DStream. because when I put a text file into the hdfs, it runs normally, but if I put a lz4 file, it does nothing. 在 2015年5月14日,下午3:32,Akhil Das ak...@sigmoidanalytics.com 写道: What do you mean by not detected? may be you forgot

Re: Unsubscribe

2015-05-14 Thread Akhil Das
Have a look https://spark.apache.org/community.html Send an email to user-unsubscr...@spark.apache.org Thanks Best Regards On Thu, May 14, 2015 at 1:08 PM, Saurabh Agrawal saurabh.agra...@markit.com wrote: How do I unsubscribe from this mailing list please? Thanks!! Regards,

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
: LzoTextInputFormat where is this class? what is the maven dependency? 在 2015年5月14日,下午3:40,Akhil Das ak...@sigmoidanalytics.com 写道: That's because you are using TextInputFormat i think, try with LzoTextInputFormat like: val list_join_action_stream = ssc.fileStream[LongWritable, Text

Re: force the kafka consumer process to different machines

2015-05-13 Thread Akhil Das
With this lowlevel Kafka API https://github.com/dibbhatt/kafka-spark-consumer/, you can actually specify how many receivers that you want to spawn and most of the time it spawns evenly, usually you can put a sleep just after creating the context for the executors to connect to the driver and then

Re: How to speed up data ingestion with Spark

2015-05-12 Thread Akhil Das
This article http://www.virdata.com/tuning-spark/ gives you a pretty good start on the Spark streaming side. And this article https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines is for the kafka, it has nice explanation how message size and

Re: how to load some of the files in a dir and monitor new file in that dir in spark streaming without missing?

2015-05-12 Thread Akhil Das
I believe fileStream would pickup the new files (may be you should increase the batch duration). You can see the implementation details for finding new files from here

Re: Spark and RabbitMQ

2015-05-12 Thread Akhil Das
I found two examples Java version https://github.com/deepakkashyap/Spark-Streaming-with-RabbitMQ-/blob/master/example/Spark_project/CustomReceiver.java, and Scala version. https://github.com/d1eg0/spark-streaming-toy Thanks Best Regards On Tue, May 12, 2015 at 2:31 AM, dgoldenberg

Re: Master HA

2015-05-12 Thread Akhil Das
Mesos has a HA option (of course it includes zookeeper) Thanks Best Regards On Tue, May 12, 2015 at 4:53 PM, James King jakwebin...@gmail.com wrote: I know that it is possible to use Zookeeper and File System (not for production use) to achieve HA. Are there any other options now or in the

Re: TwitterPopularTags Long Processing Delay

2015-05-12 Thread Akhil Das
Are you using checkpointing/WAL etc? If yes, then it could be blocking on disk IO. Thanks Best Regards On Mon, May 11, 2015 at 10:33 PM, Seyed Majid Zahedi zah...@cs.duke.edu wrote: Hi, I'm running TwitterPopularTags.scala on a single node. Everything works fine for a while (about 30min),

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
Yep, you can try this lowlevel Kafka receiver https://github.com/dibbhatt/kafka-spark-consumer. Its much more flexible/reliable than the one comes with Spark. Thanks Best Regards On Tue, May 12, 2015 at 5:15 PM, James King jakwebin...@gmail.com wrote: What I want is if the driver dies for some

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
before that, only took it down to change code. http://tinypic.com/r/2e4vkht/8 Regarding flexibility, both of the apis available in spark will do what James needs, as I described. On Tue, May 12, 2015 at 8:55 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Cody, If you are so sure

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
wrote: Very nice! will try and let you know, thanks. On Tue, May 12, 2015 at 2:25 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Yep, you can try this lowlevel Kafka receiver https://github.com/dibbhatt/kafka-spark-consumer. Its much more flexible/reliable than the one comes with Spark

Re: Is it possible to set the akka specify properties (akka.extensions) in spark

2015-05-11 Thread Akhil Das
Try SparkConf.set(spark.akka.extensions,Whatever), underneath i think spark won't ship properties which don't start with spark.* to the executors. Thanks Best Regards On Mon, May 11, 2015 at 8:33 AM, Terry Hole hujie.ea...@gmail.com wrote: Hi all, I'd like to monitor the akka using kamon,

Re: Cassandra number of Tasks

2015-05-11 Thread Akhil Das
Did you try repartitioning? You might end up with a lot of time spending on GC though. Thanks Best Regards On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar vijaypawnar...@gmail.com wrote: I am using the Spark Cassandra connector to work with a table with 3 million records. Using .where() API

Re: EVent generation

2015-05-11 Thread Akhil Das
Have a look over here https://storm.apache.org/community.html Thanks Best Regards On Sun, May 10, 2015 at 3:21 PM, anshu shukla anshushuk...@gmail.com wrote: http://stackoverflow.com/questions/30149868/generate-events-tuples-using-csv-file-with-timestamps -- Thanks Regards, Anshu Shukla

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread Akhil Das
Have a look at this SO http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application question, it has discussion on various ways of accessing S3. Thanks Best Regards On Fri, May 8, 2015 at 1:21 AM, in4maniac sa...@skimlinks.com wrote: Hi

Re: Master node memory usage question

2015-05-08 Thread Akhil Das
Whats your usecase and what are you trying to achieve? May be there's a better way of doing it. Thanks Best Regards On Fri, May 8, 2015 at 10:20 AM, Richard Alex Hofer rho...@andrew.cmu.edu wrote: Hi, I'm working on a project in Spark and am trying to understand what's going on. Right now to

Re: (无主题)

2015-05-08 Thread Akhil Das
Since its loading 24 records, it could be that your CSV is corrupted? (may be the new line char isn't \n, but \r\n if it comes from a windows environment. You can check this with *cat -v yourcsvfile.csv | more*). Thanks Best Regards On Fri, May 8, 2015 at 11:23 AM, luohui20...@sina.com wrote:

<    1   2   3   4   5   6   7   8   9   10   >