Sorry not really. Spork is a way to migrate your existing pig scripts to
Spark or write new pig jobs then can execute on spark.
For orchestration you are better off using Oozie especially if you are
using other execution engines/systems besides spark.
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
We do maintain it but in apache repo itself. However Pig cannot do
orchestration for you. I am not sure what you are looking at from Pig in
this context.
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoid.com http://www.sigmoidanalytics.com/
@mayur_rustagi http://www.twitter.com
Frankly no good/standard way to visualize streaming data. So far I have
found HBase as good intermediate store to store data from streams
visualize it by a play based framework d3.js.
Regards
Mayur
On Fri Feb 13 2015 at 4:22:58 PM Kevin (Sangwoo) Kim kevin...@apache.org
wrote:
I'm not very
You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how
it optimizes execution of iterative jobs.
Simple answer is
1. Spark doesn't materialize RDD when you do an iteration but lazily
captures the transformation functions in RDD.(only function and closure ,
no data operation
First of all any action is only performed when you trigger a collect,
When you trigger collect, at that point it retrieves data from disk joins
the datasets together delivers it to you.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com
heavily, in spark Native application
its rarely required. Do you have a usecase like that?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Nov 14, 2014 at 10:28 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi
flatmap would have to shuffle data only if output RDD is expected to be
partitioned by some key.
RDD[X].flatmap(X=RDD[Y])
If it has to shuffle it should be local.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Nov 13
- dev list + user list
Shark is not officially supported anymore so you are better off moving to
Spark SQL.
Shark doesnt support Hive partitioning logic anyways, it has its version of
partitioning on in-memory blocks but is independent of whether you
partition your data in hive or not.
Mayur
What is the partition count of the RDD, its possible that you dont have
enough memory to store the whole RDD on a single machine. Can you try
forcibly repartitioning the RDD then cacheing.
Regards
Mayur
On Tue Oct 28 2014 at 1:19:09 AM shahab shahab.mok...@gmail.com wrote:
I used Cache
Does it retain the order if its pulling from the hdfs blocks, meaning
if file1 = a, b, c partition in order
if I convert to 2 partition read will it map to ab, c or a, bc or it can
also be a, cb ?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https
We are developing something similar on top of Streaming. Could you detail
some rule functionality you are looking for.
We are developing a dsl for data processing on top of streaming as well as
static data enabled on Spark.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http
Thats a cryptic way to say thr should be a Jira for it :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Oct 9, 2014 at 11:46 AM, Sean Owen so...@cloudera.com wrote:
Yeah it's not there. I imagine it was simply
Current approach is to use mappartition, initialize the connection in the
beginning, iterate through the data close off the connector.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Oct 3, 2014 at 10:16 AM, Stephen
on it.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo eko.harmawan.sus...@gmail.com
wrote:
Hi All,
I have a problem that i would like to consult about spark streaming.
I have
for 2. you can use fair scheduler so that application tasks can be
scheduled more fairly.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Sep 25, 2014 at 12:32 PM, Akhil Das ak...@sigmoidanalytics.com
You can cache data in memory query it using Spark Job Server.
Most folks dump data down to a queue/db for retrieval
You can batch up data store into parquet partitions as well. query it using
another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe.
--
Regards,
Mayur
Another aspect to keep in mind is JVM above 8-10GB starts to misbehave.
Typically better to split up ~ 15GB intervals.
if you are choosing machines 10GB/Core is a approx to maintain.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com
into embedded driver model
in yarn where the driver will also run inside the cluster hence reliability
connectivity is a given.
--
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
On Fri, Sep 12, 2014 at 6:46 PM, Jim Carroll jimfcarr...@gmail.com
wrote:
Hi
Cached RDD do not survive SparkContext deletion (they are scoped on a per
sparkcontext basis).
I am not sure what you mean by disk based cache eviction, if you cache more
RDD than disk space the result will not be very pretty :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
I think she is checking for blanks?
But if the RDD is blank then nothing will happen, no db connections etc.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Sep 8, 2014 at 1:32 PM, Tobias Pfeiffer t...@preferred.jp
What do you mean by control your input”, are you trying to pace your spark
streaming by number of words. If so that is not supported as of now, you can
only control time consume all files within that time period.
--
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
updates to mysql may cause data corruption.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sun, Sep 7, 2014 at 11:54 AM, jchen jc...@pivotal.io wrote:
Hi,
Has someone tried using Spark Streaming with MySQL
is not
expected to alter anything apart from the RDD it is created upon, hence
spark may not realize this dependency try to parallelize the two
operations, causing error . Bottom line as long as you make all your
depedencies explicit in RDD, spark will take care of the magic.
Mayur Rustagi
Ph: +1 (760
also create a (node,
bytearray) combo join the two rdd together.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sat, Sep 6, 2014 at 10:51 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I have an input file which
with 5
sec processing, alternative is to process data in two pipelines (.5 5 )
in two spark streaming jobs overwrite results of one with the other.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sat, Sep 6, 2014 at 12:39 AM
(Sigmoid Analytics)
Not to mention Spark Pig communities.
Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
I would suggest you to use JDBC connector in mappartition instead of maps
as JDBC connections are costly can really impact your performance.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Aug 26, 2014 at 6:45 PM
Why dont you directly use DStream created as output of windowing process?
Any reason
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Aug 21, 2014 at 8:38 PM, Josh J joshjd...@gmail.com wrote:
Hi,
I
: Int, number: Int) : Int = {
if (number == 1)
return accumulator
else
factorialWithAccumulator(accumulator * number, number - 1)
}
factorialWithAccumulator(1, number)
}
MyRDD.map(factorial(5))
Mayur Rustagi
Ph: +1 (760) 203 3257
http
provide the fullpath of where to write( like hdfs:// etc)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Aug 21, 2014 at 8:29 AM, cuongpham92 cuongpha...@gmail.com wrote:
Hi,
I tried to write to text file from
transform your way :)
MyDStream.transform(RDD = RDD.map(wordChanger))
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Aug 20, 2014 at 1:25 PM, cuongpham92 cuongpha...@gmail.com wrote:
Hi,
I am a newbie to Spark
)
teenagers.saveAsParquetFile(people.parquet)
})
You can also try insertInto API instead of registerAsTable..but havnt used
it myself..
also you need to dynamically change parquet file name for every dstream...
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
is your hdfs running, can spark access it?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Aug 21, 2014 at 1:15 PM, cuongpham92 cuongpha...@gmail.com wrote:
I'm sorry, I just forgot /data after hdfs://localhost
the
task overhead of scheduling so many tasks mostly kills the performance.
import org.apache.spark.RangePartitioner;
var file=sc.textFile(my local path)
var partitionedFile=file.map(x=(x,1))
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))
Mayur Rustagi
Ph: +1 (760) 203
You can also use Update by key interface to store this shared variable. As
for count you can use foreachRDD to run counts on RDD then store that as
another RDD or put it in updatebykey
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com
predicates smartly hence get
better performance (similar to impala)
2. cache data at a partition level from Hive operate on those instead.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Aug 8, 2014
that
spark cannot access
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Aug 5, 2014 at 11:33 PM, alamin.ishak alamin.is...@gmail.com
wrote:
Hi,
Anyone? Any input would be much appreciated
Thanks,
Amin
On 5 Aug 2014
Then dont specify hdfs when you read file.
Also the community is quite active in response in general, just be a little
patient.
Also if possible look at spark training as part of spark summit 2014 vids
and/or amplabs training on spark website.
Mayur Rustagi
Ph: +1 (760) 203 3257
http
Have you tried
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
Thr is also a 0.9.1 version they talked about in one of the meetups.
Check out the s3 bucket inthe guide.. it should have a 0.9.1 version as
well.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
What is the usecase you are looking at?
Tsdb is not designed for you to query data directly from HBase, Ideally you
should use REST API if you are looking to do thin analysis. Are you looking
to do whole reprocessing of TSDB ?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
into the folder, its a hack but much less headache .
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Aug 1, 2014 at 10:21 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
Hi everyone
I haven't been receiving
Only blocker is accumulator can be only added to from slaves only read
on the master. If that constraint fit you well you can fire away.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Aug 1, 2014 at 7:38 AM, Julien
Http Api would be the best bet, I assume by graph you mean the charts
created by tsdb frontends.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Aug 1, 2014 at 4:48 PM, bumble123 tc1...@att.com wrote:
I'm trying
objects inside the class, so you may want to send across those
objects but not the whole parent class.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jul 28, 2014 at 8:28 PM, Wang, Jensen jensen.w...@sap.com wrote
Based on some discussions with my application users, I have been trying to come
up with a standard way to deploy applications built on Spark
1. Bundle the version of spark with your application and ask users store it in
hdfs before referring it in yarn to boot your application
2. Provide ways
Yes you lose the data
You can add machines but will require you to restart the cluster. Also
adding is manual on you add nodes
Regards
Mayur
On Wednesday, July 23, 2014, durga durgak...@gmail.com wrote:
Hi All,
I have a question,
For my company , we are planning to use spark-ec2 scripts to
Have a look at broadcast variables .
On Tuesday, July 22, 2014, Parthus peng.wei@gmail.com wrote:
Hi there,
I was wondering if anybody could help me find an efficient way to make a
MapReduce program like this:
1) For each map function, it need access some huge files, which is around
Hi,
Spark does that out of the box for you :)
It compresses down the execution steps as much as possible.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev
val sem = 0
sc.addSparkListener(new SparkListener {
override def onTaskStart(taskStart: SparkListenerTaskStart) {
sem +=1
}
})
sc = spark context
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
Hi,
We have fixed many major issues around Spork deploying it with some
customers. Would be happy to provide a working version to you to try out.
We are looking for more folks to try it out submit bugs.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
That version is old :).
We are not forking pig but cleanly separating out pig execution engine. Let
me know if you are willing to give it a go.
Also would love to know what features of pig you are using ?
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
If you receive data through multiple receivers across the cluster. I don't
think any order can be guaranteed. Order in distributed systems is tough.
On Tuesday, July 8, 2014, Yan Fang yanfang...@gmail.com wrote:
I know the order of processing DStream is guaranteed. Wondering if the
order of
Key idea is to simulate your app time as you enter data . So you can
connect spark streaming to a queue and insert data in it spaced by time.
Easier said than done :). What are the parallelism issues you are hitting
with your static approach.
On Friday, July 4, 2014, alessandro finamore
to work, so you
may be hitting some of those walls too.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jul 4, 2014 at 2:36 PM, Igor Pernek i...@pernek.net wrote:
Hi all!
I have a folder with 150 G of txt
What I typically do is use row_number subquery to filter based on that.
It works out pretty well, reduces the iteration. I think a offset solution
based on windowsing directly would be useful.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com
You'll get most of that information from mesos interface. You may not get
transfer of data information particularly.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Jul 3, 2014 at 6:28 AM, Tobias Pfeiffer t
The application server doesnt provide json api unlike the cluster
interface(8080).
If you are okay to patch spark, you can use our patch to create json API,
or you can use sparklistener interface in your application to get that info
out.
Mayur Rustagi
Ph: +1 (760) 203 3257
http
Ideally you should be converting RDD to schemardd ?
You are creating UnionRDD to join across dstream rdd?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi honeyjo...@ideata
You may be able to mix StreamingListener SparkListener to get meaningful
information about your task. however you need to connect a lot of pieces to
make sense of the flow..
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
A lot of RDD that you create in Code may not even be constructed as the
tasks layer is optimized in the DAG scheduler.. The closest is onUnpersistRDD
in SparkListner.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon
two job context cannot share data, are you collecting the data to the
master then sending it to the other context?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jul 2, 2014 at 11:57 AM, Honey Joshi
honeyjo
Your executors are going out of memory then subsequent tasks scheduled on
the scheduler are also failing, hence the lost tid(task id).
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jun 30, 2014 at 7:47 PM, Sguj
.. or in the worker log @ 192.168.222.164
or any of the machines where the crash log is displayed.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
A lot of things
how abou this?
https://groups.google.com/forum/#!topic/spark-users/ntPQUZFJt4M
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sat, Jun 28, 2014 at 10:19 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I have
It happens in a single operation itself. You may write it separately but
the stages are performed together if its possible. You will see only one
task in the output of your application.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/EfQms8tK5cE
I suspect they are using thr own builds.. has anybody had a chance to look
at it?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
You can use SparkListener interface to track the tasks.. another is to use
JSON patch (https://github.com/apache/spark/pull/882) track tasks with
json api
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 27, 2014
it for quite a long time you can
- Simplistically store it as hdfs load it each time
- Either store that in a table try to pull it with sparksql every
time(experimental).
- Use Ooyala Jobserver to cache the data do all processing using that.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203
I have seen this when I prevent spilling of shuffle data on disk. Can you
change shuffle memory fraction. Is your data spilling to disk?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jun 23, 2014 at 12:09 PM
Hi Sebastien,
Are you using Pyspark by any chance, is that working for you (post the
patch?)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jun 23, 2014 at 1:51 PM, Fedechicco fedechi...@gmail.com wrote:
I'm
did you try to register the class in Kryo serializer?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jun 23, 2014 at 7:00 PM, rrussell25 rrussel...@gmail.com wrote:
Thanks for pointer...tried Kryo and ran
is in those.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 24, 2014 at 3:33 AM, Aaron aaron.doss...@target.com wrote:
Sorry, I got my sample outputs wrong
(1,1) - 400
(1,2) - 500
(2,2)- 600
On Jun 23, 2014
This would be really useful. Especially for Shark where shift of
partitioning effects all subsequent queries unless task scheduling time
beats spark.locality.wait. Can cause overall low performance for all
subsequent tasks.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
Not really. You are better off using a cluster manager like Mesos or Yarn
for this.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 24, 2014 at 11:35 AM, Sirisha Devineni
sirisha_devin...@persistent.co.in wrote
HDFS driver keeps changing breaking compatibility, hence all the build
versions. If you dont use HDFS/YARN then you can safely ignore it.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 24, 2014 at 12:16 PM
To be clear number of map tasks are determined by number of partitions
inside the rdd hence the suggestion by Nicholas.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas
is safely serializable.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jun 25, 2014 at 4:12 AM, boci boci.b...@gmail.com wrote:
Hi guys,
I have a small question. I want to create a Worker class which using
ElasticClient
.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng pc...@uow.edu.au wrote:
I'm afraid persisting connection across two tasks is a dangerous act as
they
can't be guaranteed
from same machines etc.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 20, 2014 at 3:41 PM, Sameer Tilak ssti...@live.com wrote:
Dear Spark users,
I have a small 4 node Hadoop cluster. Each node is a VM -- 4
You are looking to create Shark operators for RDF? Since Shark backend is
shifting to SparkSQL it would be slightly hard but much better effort would
be to shift Gremlin to Spark (though a much beefier one :) )
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
or a seperate RDD for sparql operations ala SchemaRDD .. operators for
sparql can be defined thr.. not a bad idea :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 20, 2014 at 3:56 PM, andy petrella andy.petre
You should be able to configure in spark context in Spark shell.
spark.cores.max memory.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 20, 2014 at 4:30 PM, Shuo Xiang shuoxiang...@gmail.com wrote
val myRdds = sc.getPersistentRDDs
assert(myRdds.size === 1)
It'll return a map. Its pretty old 0.8.0 onwards.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 13, 2014 at 9:42 AM, mrm ma
I have also had trouble in worker joining the working set. I have typically
moved to Mesos based setup. Frankly for high availability you are better
off using a cluster manager.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
Sorry if this is a dumb question but why not several calls to
map-partitions sequentially. Are you looking to avoid function
serialization or is your function damaging partitions?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
You can resolve the columns to create keys using them.. then join. Is that
what you did?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Jun 12, 2014 at 9:24 PM, SK skrishna...@gmail.com wrote:
This issue
Are you able to use HadoopInputoutput reader for hbase in new hadoop Api
reader?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Jun 12, 2014 at 7:49 AM, gaurav.dasgupta gaurav.d...@gmail.com
wrote:
Is there anyone
, Mayur Rustagi wrote:
You can look to create a Dstream directly from S3 location using file
stream. If you want to use any specific logic you can rely on Queuestream
read data yourself from S3, process it push it into RDDQueue.
Mayur Rustagi
Ph: +1 (760) 203 3257
http
QueueStream example is in Spark Streaming examples:
http://www.boyunjian.com/javasrc/org.spark-project/spark-examples_2.9.3/0.7.2/_/spark/streaming/examples/QueueStream.scala
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
You can look to create a Dstream directly from S3 location using file
stream. If you want to use any specific logic you can rely on Queuestream
read data yourself from S3, process it push it into RDDQueue.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https
Where are you getting serialization error. Its likely to be a different
problem. Which class is not getting serialized?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Jun 5, 2014 at 6:32 PM, Vibhor Banga vibhorba
or may not work for you depending on internals of Mongodb client.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jun 4, 2014 at 10:27 PM, Samarth Mailinglist
mailinglistsama...@gmail.com wrote:
Thanks a lot, sorry
And then a are you sure after that :)
On 7 Jun 2014 06:59, Mikhail Strebkov streb...@gmail.com wrote:
Nick Chammas wrote
I think it would be better to have the kill link flush right, leaving a
large amount of space between it the stage detail link.
I think even better would be to have a
cleaner doesn't
find it and can't clean it properly.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jun 4, 2014 at 4:18 PM, nilmish nilmish@gmail.com wrote:
The error is resolved. I was using a comparator which
So are you using Java 7 or 8.
7 doesnt clean closures properly. So you need to define a static class as a
function then call that in your operations. Else it'll try to send the
whole class along with the function.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
You'll have to restart the cluster.. create copy of your existing slave..
add it to slave files in master restart the cluster
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 3, 2014 at 4:30 PM, Sirisha Devineni
Did you use docker or plain lxc specifically?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 3, 2014 at 1:40 PM, MrAsanjar . afsan...@gmail.com wrote:
thanks guys, that fixed my problem. As you might have
Clearly thr will be impact on performance but frankly depends on what you
are trying to achieve with the dataset.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba
You can increase your akka timeout, should give you some more life.. are
you running out of memory by any chance?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sat, May 31, 2014 at 6:52 AM, Michael Chang m
You can use Hadoop APi provide input/output reader hadoop configuration
file to read the data.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, May 28, 2014 at 7:22 PM, Laurent T laurent.thou
1 - 100 of 181 matches
Mail list logo