Re: Tools to manage workflows on Spark

2015-02-28 Thread Mayur Rustagi
Sorry not really. Spork is a way to migrate your existing pig scripts to Spark or write new pig jobs then can execute on spark. For orchestration you are better off using Oozie especially if you are using other execution engines/systems besides spark. Regards, Mayur Rustagi Ph: +1 (760) 203 3257

Re: Tools to manage workflows on Spark

2015-02-28 Thread Mayur Rustagi
We do maintain it but in apache repo itself. However Pig cannot do orchestration for you. I am not sure what you are looking at from Pig in this context. Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoid.com http://www.sigmoidanalytics.com/ @mayur_rustagi http://www.twitter.com

Re: Can spark job server be used to visualize streaming data?

2015-02-13 Thread Mayur Rustagi
Frankly no good/standard way to visualize streaming data. So far I have found HBase as good intermediate store to store data from streams visualize it by a play based framework d3.js. Regards Mayur On Fri Feb 13 2015 at 4:22:58 PM Kevin (Sangwoo) Kim kevin...@apache.org wrote: I'm not very

Re: Modifying an RDD in forEach

2014-12-06 Thread Mayur Rustagi
You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how it optimizes execution of iterative jobs. Simple answer is 1. Spark doesn't materialize RDD when you do an iteration but lazily captures the transformation functions in RDD.(only function and closure , no data operation

Re: Joined RDD

2014-11-13 Thread Mayur Rustagi
First of all any action is only performed when you trigger a collect, When you trigger collect, at that point it retrieves data from disk joins the datasets together delivers it to you. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com

Re: Communication between Driver and Executors

2014-11-13 Thread Mayur Rustagi
heavily, in spark Native application its rarely required. Do you have a usecase like that? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Nov 14, 2014 at 10:28 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi

Re: flatMap followed by mapPartitions

2014-11-12 Thread Mayur Rustagi
flatmap would have to shuffle data only if output RDD is expected to be partitioned by some key. RDD[X].flatmap(X=RDD[Y]) If it has to shuffle it should be local. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Nov 13

Re: Using partitioning to speed up queries in Shark

2014-11-07 Thread Mayur Rustagi
- dev list + user list Shark is not officially supported anymore so you are better off moving to Spark SQL. Shark doesnt support Hive partitioning logic anyways, it has its version of partitioning on in-memory blocks but is independent of whether you partition your data in hive or not. Mayur

Re: Why RDD is not cached?

2014-10-28 Thread Mayur Rustagi
What is the partition count of the RDD, its possible that you dont have enough memory to store the whole RDD on a single machine. Can you try forcibly repartitioning the RDD then cacheing. Regards Mayur On Tue Oct 28 2014 at 1:19:09 AM shahab shahab.mok...@gmail.com wrote: I used Cache

Re: input split size

2014-10-18 Thread Mayur Rustagi
Does it retain the order if its pulling from the hdfs blocks, meaning if file1 = a, b, c partition in order if I convert to 2 partition read will it map to ab, c or a, bc or it can also be a, cb ? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https

Re: rule engine based on spark

2014-10-14 Thread Mayur Rustagi
We are developing something similar on top of Streaming. Could you detail some rule functionality you are looking for. We are developing a dsl for data processing on top of streaming as well as static data enabled on Spark. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http

Re: JavaPairDStream saveAsTextFile

2014-10-09 Thread Mayur Rustagi
Thats a cryptic way to say thr should be a Jira for it :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Oct 9, 2014 at 11:46 AM, Sean Owen so...@cloudera.com wrote: Yeah it's not there. I imagine it was simply

Re: Setup/Cleanup for RDD closures?

2014-10-03 Thread Mayur Rustagi
Current approach is to use mappartition, initialize the connection in the beginning, iterate through the data close off the connector. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Oct 3, 2014 at 10:16 AM, Stephen

Re: Spark Streaming for time consuming job

2014-10-01 Thread Mayur Rustagi
on it. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo eko.harmawan.sus...@gmail.com wrote: Hi All, I have a problem that i would like to consult about spark streaming. I have

Re: Processing multiple request in cluster

2014-09-25 Thread Mayur Rustagi
for 2. you can use fair scheduler so that application tasks can be scheduled more fairly. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Sep 25, 2014 at 12:32 PM, Akhil Das ak...@sigmoidanalytics.com

Re: Serving data

2014-09-13 Thread Mayur Rustagi
You can cache data in memory query it using Spark Job Server.  Most folks dump data down to a queue/db for retrieval  You can batch up data store into parquet partitions as well. query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe.  -- Regards, Mayur

Re: single worker vs multiple workers on each machine

2014-09-12 Thread Mayur Rustagi
Another aspect to keep in mind is JVM above 8-10GB starts to misbehave. Typically better to split up ~ 15GB intervals. if you are choosing machines 10GB/Core is a approx to maintain. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com

Re: Network requirements between Driver, Master, and Slave

2014-09-12 Thread Mayur Rustagi
into embedded driver model in yarn where the driver will also run inside the cluster hence reliability connectivity is a given.  -- Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Sep 12, 2014 at 6:46 PM, Jim Carroll jimfcarr...@gmail.com wrote: Hi

Re: Spark caching questions

2014-09-10 Thread Mayur Rustagi
Cached RDD do not survive SparkContext deletion (they are scoped on a per sparkcontext basis). I am not sure what you mean by disk based cache eviction, if you cache more RDD than disk space the result will not be very pretty :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-10 Thread Mayur Rustagi
I think she is checking for blanks? But if the RDD is blank then nothing will happen, no db connections etc. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Sep 8, 2014 at 1:32 PM, Tobias Pfeiffer t...@preferred.jp

Re: Records - Input Byte

2014-09-09 Thread Mayur Rustagi
What do you mean by control your input”, are you trying to pace your spark streaming by number of words. If so that is not supported as of now, you can only control time consume all files within that time period.  -- Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Mayur Rustagi
updates to mysql may cause data corruption. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sun, Sep 7, 2014 at 11:54 AM, jchen jc...@pivotal.io wrote: Hi, Has someone tried using Spark Streaming with MySQL

Re: Q: About scenarios where driver execution flow may block...

2014-09-07 Thread Mayur Rustagi
is not expected to alter anything apart from the RDD it is created upon, hence spark may not realize this dependency try to parallelize the two operations, causing error . Bottom line as long as you make all your depedencies explicit in RDD, spark will take care of the magic. Mayur Rustagi Ph: +1 (760

Re: Array and RDDs

2014-09-07 Thread Mayur Rustagi
also create a (node, bytearray) combo join the two rdd together. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, Sep 6, 2014 at 10:51 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have an input file which

Re: how to choose right DStream batch interval

2014-09-07 Thread Mayur Rustagi
with 5 sec processing, alternative is to process data in two pipelines (.5 5 ) in two spark streaming jobs overwrite results of one with the other. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, Sep 6, 2014 at 12:39 AM

Update on Pig on Spark initiative

2014-08-27 Thread Mayur Rustagi
(Sigmoid Analytics) Not to mention Spark Pig communities. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: Spark Streaming Output to DB

2014-08-26 Thread Mayur Rustagi
I would suggest you to use JDBC connector in mappartition instead of maps as JDBC connections are costly can really impact your performance. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Aug 26, 2014 at 6:45 PM

Re: DStream start a separate DStream

2014-08-22 Thread Mayur Rustagi
Why dont you directly use DStream created as output of windowing process? Any reason Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Aug 21, 2014 at 8:38 PM, Josh J joshjd...@gmail.com wrote: Hi, I

Re: Mapping with extra arguments

2014-08-21 Thread Mayur Rustagi
: Int, number: Int) : Int = { if (number == 1) return accumulator else factorialWithAccumulator(accumulator * number, number - 1) } factorialWithAccumulator(1, number) } MyRDD.map(factorial(5)) Mayur Rustagi Ph: +1 (760) 203 3257 http

Re: DStream cannot write to text file

2014-08-21 Thread Mayur Rustagi
provide the fullpath of where to write( like hdfs:// etc) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Aug 21, 2014 at 8:29 AM, cuongpham92 cuongpha...@gmail.com wrote: Hi, I tried to write to text file from

Re: Accessing to elements in JavaDStream

2014-08-21 Thread Mayur Rustagi
transform your way :) MyDStream.transform(RDD = RDD.map(wordChanger)) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Aug 20, 2014 at 1:25 PM, cuongpham92 cuongpha...@gmail.com wrote: Hi, I am a newbie to Spark

Re: spark - reading hfds files every 5 minutes

2014-08-21 Thread Mayur Rustagi
) teenagers.saveAsParquetFile(people.parquet) }) You can also try insertInto API instead of registerAsTable..but havnt used it myself.. also you need to dynamically change parquet file name for every dstream... Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: DStream cannot write to text file

2014-08-21 Thread Mayur Rustagi
is your hdfs running, can spark access it? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Aug 21, 2014 at 1:15 PM, cuongpham92 cuongpha...@gmail.com wrote: I'm sorry, I just forgot /data after hdfs://localhost

Re: Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-16 Thread Mayur Rustagi
the task overhead of scheduling so many tasks mostly kills the performance. import org.apache.spark.RangePartitioner; var file=sc.textFile(my local path) var partitionedFile=file.map(x=(x,1)) var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile)) Mayur Rustagi Ph: +1 (760) 203

Re: Shared variable in Spark Streaming

2014-08-08 Thread Mayur Rustagi
You can also use Update by key interface to store this shared variable. As for count you can use foreachRDD to run counts on RDD then store that as another RDD or put it in updatebykey Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com

Re: Low Performance of Shark over Spark.

2014-08-08 Thread Mayur Rustagi
predicates smartly hence get better performance (similar to impala) 2. cache data at a partition level from Hive operate on those instead. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Aug 8, 2014

Re: Configuration setup and Connection refused

2014-08-05 Thread Mayur Rustagi
that spark cannot access Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Aug 5, 2014 at 11:33 PM, alamin.ishak alamin.is...@gmail.com wrote: Hi, Anyone? Any input would be much appreciated Thanks, Amin On 5 Aug 2014

Re: Configuration setup and Connection refused

2014-08-05 Thread Mayur Rustagi
Then dont specify hdfs when you read file. Also the community is quite active in response in general, just be a little patient. Also if possible look at spark training as part of spark summit 2014 vids and/or amplabs training on spark website. Mayur Rustagi Ph: +1 (760) 203 3257 http

Re: Installing Spark 0.9.1 on EMR Cluster

2014-08-01 Thread Mayur Rustagi
Have you tried https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 Thr is also a 0.9.1 version they talked about in one of the meetups. Check out the s3 bucket inthe guide.. it should have a 0.9.1 version as well. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread Mayur Rustagi
What is the usecase you are looking at? Tsdb is not designed for you to query data directly from HBase, Ideally you should use REST API if you are looking to do thin analysis. Are you looking to do whole reprocessing of TSDB ? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: RDD to DStream

2014-08-01 Thread Mayur Rustagi
into the folder, its a hack but much less headache . Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Aug 1, 2014 at 10:21 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi everyone I haven't been receiving

Re: Accumulator and Accumulable vs classic MR

2014-08-01 Thread Mayur Rustagi
Only blocker is accumulator can be only added to from slaves only read on the master. If that constraint fit you well you can fire away. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Aug 1, 2014 at 7:38 AM, Julien

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread Mayur Rustagi
Http Api would be the best bet, I assume by graph you mean the charts created by tsdb frontends. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Aug 1, 2014 at 4:48 PM, bumble123 tc1...@att.com wrote: I'm trying

Re: The function of ClosureCleaner.clean

2014-07-28 Thread Mayur Rustagi
objects inside the class, so you may want to send across those objects but not the whole parent class. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jul 28, 2014 at 8:28 PM, Wang, Jensen jensen.w...@sap.com wrote

Spark as a application library vs infra

2014-07-27 Thread Mayur Rustagi
Based on some discussions with my application users, I have been trying to come up with a standard way to deploy applications built on Spark 1. Bundle the version of spark with your application and ask users store it in hdfs before referring it in yarn to boot your application 2. Provide ways

Re: persistent HDFS instance for cluster restarts/destroys

2014-07-23 Thread Mayur Rustagi
Yes you lose the data You can add machines but will require you to restart the cluster. Also adding is manual on you add nodes Regards Mayur On Wednesday, July 23, 2014, durga durgak...@gmail.com wrote: Hi All, I have a question, For my company , we are planning to use spark-ec2 scripts to

Re: What if there are large, read-only variables shared by all map functions?

2014-07-23 Thread Mayur Rustagi
Have a look at broadcast variables . On Tuesday, July 22, 2014, Parthus peng.wei@gmail.com wrote: Hi there, I was wondering if anybody could help me find an efficient way to make a MapReduce program like this: 1) For each map function, it need access some huge files, which is around

Re: Filtering data during the read

2014-07-09 Thread Mayur Rustagi
Hi, Spark does that out of the box for you :) It compresses down the execution steps as much as possible. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev

Re: Spark job tracker.

2014-07-09 Thread Mayur Rustagi
val sem = 0 sc.addSparkListener(new SparkListener { override def onTaskStart(taskStart: SparkListenerTaskStart) { sem +=1 } }) sc = spark context Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi
Hi, We have fixed many major issues around Spork deploying it with some customers. Would be happy to provide a working version to you to try out. We are looking for more folks to try it out submit bugs. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi
That version is old :). We are not forking pig but cleanly separating out pig execution engine. Let me know if you are willing to give it a go. Also would love to know what features of pig you are using ? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: Is the order of messages guaranteed in a DStream?

2014-07-07 Thread Mayur Rustagi
If you receive data through multiple receivers across the cluster. I don't think any order can be guaranteed. Order in distributed systems is tough. On Tuesday, July 8, 2014, Yan Fang yanfang...@gmail.com wrote: I know the order of processing DStream is guaranteed. Wondering if the order of

Re: window analysis with Spark and Spark streaming

2014-07-05 Thread Mayur Rustagi
Key idea is to simulate your app time as you enter data . So you can connect spark streaming to a queue and insert data in it spaced by time. Easier said than done :). What are the parallelism issues you are hitting with your static approach. On Friday, July 4, 2014, alessandro finamore

Re: Spark memory optimization

2014-07-04 Thread Mayur Rustagi
to work, so you may be hitting some of those walls too. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jul 4, 2014 at 2:36 PM, Igor Pernek i...@pernek.net wrote: Hi all! I have a folder with 150 G of txt

Re: LIMIT with offset in SQL queries

2014-07-04 Thread Mayur Rustagi
What I typically do is use row_number subquery to filter based on that. It works out pretty well, reduces the iteration. I think a offset solution based on windowsing directly would be useful. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com

Re: Visualize task distribution in cluster

2014-07-04 Thread Mayur Rustagi
You'll get most of that information from mesos interface. You may not get transfer of data information particularly. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Jul 3, 2014 at 6:28 AM, Tobias Pfeiffer t

Re: Spark job tracker.

2014-07-04 Thread Mayur Rustagi
The application server doesnt provide json api unlike the cluster interface(8080). If you are okay to patch spark, you can use our patch to create json API, or you can use sparklistener interface in your application to get that info out. Mayur Rustagi Ph: +1 (760) 203 3257 http

Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-02 Thread Mayur Rustagi
Ideally you should be converting RDD to schemardd ? You are creating UnionRDD to join across dstream rdd? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi honeyjo...@ideata

Re: spark streaming counter metrics

2014-07-02 Thread Mayur Rustagi
You may be able to mix StreamingListener SparkListener to get meaningful information about your task. however you need to connect a lot of pieces to make sense of the flow.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: Callbacks on freeing up of RDDs

2014-07-02 Thread Mayur Rustagi
A lot of RDD that you create in Code may not even be constructed as the tasks layer is optimized in the DAG scheduler.. The closest is onUnpersistRDD in SparkListner. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon

Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-02 Thread Mayur Rustagi
two job context cannot share data, are you collecting the data to the master then sending it to the other context? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 2, 2014 at 11:57 AM, Honey Joshi honeyjo

Re: Serializer or Out-of-Memory issues?

2014-07-02 Thread Mayur Rustagi
Your executors are going out of memory then subsequent tasks scheduled on the scheduler are also failing, hence the lost tid(task id). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jun 30, 2014 at 7:47 PM, Sguj

Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mayur Rustagi
.. or in the worker log @ 192.168.222.164 or any of the machines where the crash log is displayed. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: A lot of things

Re: Distribute data from Kafka evenly on cluster

2014-06-28 Thread Mayur Rustagi
how abou this? https://groups.google.com/forum/#!topic/spark-users/ntPQUZFJt4M Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, Jun 28, 2014 at 10:19 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I have

Re: Map with filter on JavaRdd

2014-06-27 Thread Mayur Rustagi
It happens in a single operation itself. You may write it separately but the stages are performed together if its possible. You will see only one task in the output of your application. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com

Google Cloud Engine adds out of the box Spark/Shark support

2014-06-26 Thread Mayur Rustagi
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/EfQms8tK5cE I suspect they are using thr own builds.. has anybody had a chance to look at it? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: Spark job tracker.

2014-06-26 Thread Mayur Rustagi
You can use SparkListener interface to track the tasks.. another is to use JSON patch (https://github.com/apache/spark/pull/882) track tasks with json api Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 27, 2014

Re: Persistent Local Node variables

2014-06-24 Thread Mayur Rustagi
it for quite a long time you can - Simplistically store it as hdfs load it each time - Either store that in a table try to pull it with sparksql every time(experimental). - Use Ooyala Jobserver to cache the data do all processing using that. Regards Mayur Mayur Rustagi Ph: +1 (760) 203

Re: Kafka Streaming - Error Could not compute split

2014-06-24 Thread Mayur Rustagi
I have seen this when I prevent spilling of shuffle data on disk. Can you change shuffle memory fraction. Is your data spilling to disk? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jun 23, 2014 at 12:09 PM

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-24 Thread Mayur Rustagi
Hi Sebastien, Are you using Pyspark by any chance, is that working for you (post the patch?) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jun 23, 2014 at 1:51 PM, Fedechicco fedechi...@gmail.com wrote: I'm

Re: Serialization problem in Spark

2014-06-24 Thread Mayur Rustagi
did you try to register the class in Kryo serializer? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jun 23, 2014 at 7:00 PM, rrussell25 rrussel...@gmail.com wrote: Thanks for pointer...tried Kryo and ran

Re: Efficiently doing an analysis with Cartesian product (pyspark)

2014-06-24 Thread Mayur Rustagi
is in those. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jun 24, 2014 at 3:33 AM, Aaron aaron.doss...@target.com wrote: Sorry, I got my sample outputs wrong (1,1) - 400 (1,2) - 500 (2,2)- 600 On Jun 23, 2014

Re: balancing RDDs

2014-06-24 Thread Mayur Rustagi
This would be really useful. Especially for Shark where shift of partitioning effects all subsequent queries unless task scheduling time beats spark.locality.wait. Can cause overall low performance for all subsequent tasks. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: How to Reload Spark Configuration Files

2014-06-24 Thread Mayur Rustagi
Not really. You are better off using a cluster manager like Mesos or Yarn for this. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jun 24, 2014 at 11:35 AM, Sirisha Devineni sirisha_devin...@persistent.co.in wrote

Re: Questions regarding different spark pre-built packages

2014-06-24 Thread Mayur Rustagi
HDFS driver keeps changing breaking compatibility, hence all the build versions. If you dont use HDFS/YARN then you can safely ignore it. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jun 24, 2014 at 12:16 PM

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Mayur Rustagi
To be clear number of map tasks are determined by number of partitions inside the rdd hence the suggestion by Nicholas. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas

Re: ElasticSearch enrich

2014-06-24 Thread Mayur Rustagi
is safely serializable. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jun 25, 2014 at 4:12 AM, boci boci.b...@gmail.com wrote: Hi guys, I have a small question. I want to create a Worker class which using ElasticClient

Re: ElasticSearch enrich

2014-06-24 Thread Mayur Rustagi
. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng pc...@uow.edu.au wrote: I'm afraid persisting connection across two tasks is a dangerous act as they can't be guaranteed

Re: Running Spark alongside Hadoop

2014-06-20 Thread Mayur Rustagi
from same machines etc. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 20, 2014 at 3:41 PM, Sameer Tilak ssti...@live.com wrote: Dear Spark users, I have a small 4 node Hadoop cluster. Each node is a VM -- 4

Re: Spark and RDF

2014-06-20 Thread Mayur Rustagi
You are looking to create Shark operators for RDF? Since Shark backend is shifting to SparkSQL it would be slightly hard but much better effort would be to shift Gremlin to Spark (though a much beefier one :) ) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Spark and RDF

2014-06-20 Thread Mayur Rustagi
or a seperate RDD for sparql operations ala SchemaRDD .. operators for sparql can be defined thr.. not a bad idea :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 20, 2014 at 3:56 PM, andy petrella andy.petre

Re: Set the number/memory of workers under mesos

2014-06-20 Thread Mayur Rustagi
You should be able to configure in spark context in Spark shell. spark.cores.max memory. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 20, 2014 at 4:30 PM, Shuo Xiang shuoxiang...@gmail.com wrote

Re: list of persisted rdds

2014-06-13 Thread Mayur Rustagi
val myRdds = sc.getPersistentRDDs assert(myRdds.size === 1) It'll return a map. Its pretty old 0.8.0 onwards. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 13, 2014 at 9:42 AM, mrm ma

Re: Master not seeing recovered nodes(Got heartbeat from unregistered worker ....)

2014-06-13 Thread Mayur Rustagi
I have also had trouble in worker joining the working set. I have typically moved to Mesos based setup. Frankly for high availability you are better off using a cluster manager. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: multiple passes in mapPartitions

2014-06-13 Thread Mayur Rustagi
Sorry if this is a dumb question but why not several calls to map-partitions sequentially. Are you looking to avoid function serialization or is your function damaging partitions? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: specifying fields for join()

2014-06-13 Thread Mayur Rustagi
You can resolve the columns to create keys using them.. then join. Is that what you did? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Jun 12, 2014 at 9:24 PM, SK skrishna...@gmail.com wrote: This issue

Re: Writing data to HBase using Spark

2014-06-12 Thread Mayur Rustagi
Are you able to use HadoopInputoutput reader for hbase in new hadoop Api reader? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Jun 12, 2014 at 7:49 AM, gaurav.dasgupta gaurav.d...@gmail.com wrote: Is there anyone

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-07 Thread Mayur Rustagi
, Mayur Rustagi wrote: You can look to create a Dstream directly from S3 location using file stream. If you want to use any specific logic you can rely on Queuestream read data yourself from S3, process it push it into RDDQueue. Mayur Rustagi Ph: +1 (760) 203 3257 http

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-07 Thread Mayur Rustagi
QueueStream example is in Spark Streaming examples: http://www.boyunjian.com/javasrc/org.spark-project/spark-examples_2.9.3/0.7.2/_/spark/streaming/examples/QueueStream.scala Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-06 Thread Mayur Rustagi
You can look to create a Dstream directly from S3 location using file stream. If you want to use any specific logic you can rely on Queuestream read data yourself from S3, process it push it into RDDQueue. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https

Re: Serialization problem in Spark

2014-06-06 Thread Mayur Rustagi
Where are you getting serialization error. Its likely to be a different problem. Which class is not getting serialized? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Jun 5, 2014 at 6:32 PM, Vibhor Banga vibhorba

Re: Using mongo with PySpark

2014-06-06 Thread Mayur Rustagi
or may not work for you depending on internals of Mongodb client. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jun 4, 2014 at 10:27 PM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: Thanks a lot, sorry

Re: stage kill link is awfully close to the stage name

2014-06-06 Thread Mayur Rustagi
And then a are you sure after that :) On 7 Jun 2014 06:59, Mikhail Strebkov streb...@gmail.com wrote: Nick Chammas wrote I think it would be better to have the kill link flush right, leaving a large amount of space between it the stage detail link. I think even better would be to have a

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Mayur Rustagi
cleaner doesn't find it and can't clean it properly. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jun 4, 2014 at 4:18 PM, nilmish nilmish@gmail.com wrote: The error is resolved. I was using a comparator which

Re: Error related to serialisation in spark streaming

2014-06-03 Thread Mayur Rustagi
So are you using Java 7 or 8. 7 doesnt clean closures properly. So you need to define a static class as a function then call that in your operations. Else it'll try to send the whole class along with the function. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Reg: Add/Remove slave nodes spark-ec2

2014-06-03 Thread Mayur Rustagi
You'll have to restart the cluster.. create copy of your existing slave.. add it to slave files in master restart the cluster Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jun 3, 2014 at 4:30 PM, Sirisha Devineni

Re: WebUI's Application count doesn't get updated

2014-06-03 Thread Mayur Rustagi
Did you use docker or plain lxc specifically? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jun 3, 2014 at 1:40 PM, MrAsanjar . afsan...@gmail.com wrote: thanks guys, that fixed my problem. As you might have

Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Mayur Rustagi
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba

Re: Failed to remove RDD error

2014-05-31 Thread Mayur Rustagi
You can increase your akka timeout, should give you some more life.. are you running out of memory by any chance? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 31, 2014 at 6:52 AM, Michael Chang m

Re: Reading bz2 files that do not end with .bz2

2014-05-28 Thread Mayur Rustagi
You can use Hadoop APi provide input/output reader hadoop configuration file to read the data. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, May 28, 2014 at 7:22 PM, Laurent T laurent.thou

  1   2   >