eed to coalesce or repartition.
--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001
On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com <lucas.g...@gmail.com
> wrote:
> Thanks Daniel!
>
> I've been wondering that fo
-configuration-options
I have no idea why it defaults to a fixed 200 (while default parallelism
defaults to a number scaled to your number of cores), or why there are two
separate configuration properties.
--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New
On Thu, Sep 28, 2017 at 7:23 AM, Gourav Sengupta
wrote:
>
> I will be very surprised if someone tells me that a 1 GB CSV text file is
> automatically split and read by multiple executors in SPARK. It does not
> matter whether it stays in HDFS, S3 or any other system.
>
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB)
> text file? Does it use InputFormat do create multiple splits and creates 1
> partition per split? Also, in case of S3 or NFS, how does the input split
> work? I understand for HDFS files are already pre-split so Spark can
> no matter what you do and how many nodes you start, in case you have a
> single text file, it will not use parallelism.
>
This is not true, unless the file is small or is gzipped (gzipped files
cannot be split).
not to
enable it, but I haven't had any problem with it.
--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001
On Sat, May 20, 2017 at 9:14 PM, Kabeer Ahmed <kab...@gmx.co.uk> wrote:
> Thank you Takeshi.
>
> As far as I se
ogle was not helpful.
--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001
As you noted, Accumulators do not guarantee accurate results except in
specific situations. I recommend never using them.
This article goes into some detail on the problems with accumulators:
http://imranrashid.com/posts/Spark-Accumulators/
On Wed, Mar 1, 2017 at 7:26 AM, Charles O. Bajomo <
I am not too familiar with Spark Standalone, so unfortunately I cannot give
you any definite answer. I do want to clarify something though.
The properties spark.sql.shuffle.partitions and spark.default.parallelism
affect how your data is split up, which will determine the *total* number
of tasks,
to disable it.
--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001
On Thu, Dec 22, 2016 at 11:09 AM, Kristina Rogale Plazonic <kpl...@gmail.com
> wrote:
> Hi,
>
> I write a randomly generated 30,000-row dataf
Accumulators are generally unreliable and should not be used. The answer to
(2) and (4) is yes. The answer to (3) is both.
Here's a more in-depth explanation:
http://imranrashid.com/posts/Spark-Accumulators/
On Sun, Dec 11, 2016 at 11:27 AM, Sudev A C wrote:
> Please
Did you try unioning the datasets for each CSV into a single dataset? You
may need to put the directory name into a column so you can partition by it.
On Tue, Nov 15, 2016 at 8:44 AM, benoitdr
wrote:
> Hello,
>
> I'm trying to convert a bunch of csv files to parquet,
.
Personally, I would just use a separate JSON library (e.g. json4s) to parse
this metadata into an object, rather than trying to read it in through
Spark.
--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001
I think it's fine to read animal types locally because there are only 70 of
them. It's just that you want to execute the Spark actions in parallel. The
easiest way to do that is to have only a single action.
Instead of grabbing the result right away, I would just add a column for
the animal type
access to the S3 bucket in the EMR cluster's AWS account.
Is there any way for Spark to access S3 buckets in multiple accounts? If
not, is there any best practice for how to work around this?
--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York
?
--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001
Yes, you can use Spark for ETL, as well as feature engineering, training,
and scoring.
~Daniel Siegmann
On Tue, Aug 2, 2016 at 3:29 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:
> Hi,
>
> If I may say, if you spend sometime going through this mailing list in
>
On Tue, Jun 7, 2016 at 11:43 PM, Francois Le Roux
wrote:
> 1. Should I use dataframes to ‘pull the source data? If so, do I do
> a groupby and order by as part of the SQL query?
>
Seems reasonable. If you use Scala you might want to define a case class
and convert
I don't believe there's anyway to output files of a specific size. What you
can do is partition your data into a number of partitions such that the
amount of data they each contain is around 1 GB.
On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain wrote:
> Hello Team,
>
>
>
> I
fitIntercept)
> res27: String = fitIntercept: whether to fit an intercept term (default:
> true)
>
> On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann <daniel.siegm...@teamaol.com>
> wrote:
>
>> I'm trying to understand how I can add a bias when training in Spark. I
>> h
ust be part of the model.
~Daniel Siegmann
if there
are multiple attempts. You can also see it in the Spark history server
(under incomplete applications, if the second attempt is still running).
~Daniel Siegmann
On Mon, Mar 21, 2016 at 9:58 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> Can you provide a bit more information ?
>
> R
parse weight vectors currently. There are potential
> solutions to these but they haven't been implemented as yet.
>
> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <daniel.siegm...@teamaol.com>
> wrote:
>
>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <nick.pentre...@gmail.com
>&g
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath
wrote:
> Would you mind letting us know the # training examples in the datasets?
> Also, what do your features look like? Are they text, categorical etc? You
> mention that most rows only have a few features, and all rows
extreme for a 20 million
> size dense weight vector (which should only be a few 100MB memory), so
> perhaps something else is going on.
>
> Nick
>
> On Tue, 8 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegm...@teamaol.com>
> wrote:
>
>> Just for the heck of
>>
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
>>
>> Only downside is that you can't use the pipeline framework from spark ml.
>>
>> Cheers,
>> Devin
>>
>>
>>
>> On Mon, Mar 7, 2016 at 4:54 PM, Danie
be appreciated.
~Daniel Siegmann
I have confirmed this is fixed in Spark 1.6.1 RC 1. Thanks.
On Tue, Feb 23, 2016 at 1:32 PM, Daniel Siegmann <
daniel.siegm...@teamaol.com> wrote:
> Yes, I will test once 1.6.1 RC1 is released. Thanks.
>
> On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust <mich...@databricks.co
In the past I have seen this happen when I filled up HDFS and some core
nodes became unhealthy. There was no longer anywhere to replicate the data.
>From your command it looks like you should have 1 master and 2 core nodes
in your cluster. Can you verify both the core nodes are healthy?
On Wed,
How many core nodes does your cluster have?
On Tue, Mar 1, 2016 at 4:15 AM, Oleg Ruchovets wrote:
> Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but
> it looks it does't work and throws exceptions.
> Please advice:
>
> [hadoop@ip-172-31-39-37
Yes, I will test once 1.6.1 RC1 is released. Thanks.
On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust <mich...@databricks.com>
wrote:
> I think this will be fixed in 1.6.1. Can you test when we post the first
> RC? (hopefully later today)
>
> On Mon, Feb 22, 2016 at 1:51 P
During testing you will typically be using some finite data. You want the
stream to shut down automatically when that data has been consumed so your
test shuts down gracefully.
Of course once the code is running in production you'll want it to keep
waiting for new records. So whether the stream
support serializing
arbitrary Seq values in datasets, or must everything be converted to Array?
~Daniel Siegmann
With EMR supporting Spark, I don't see much reason to use the spark-ec2
script unless it is important for you to be able to launch clusters using
the bleeding edge version of Spark. EMR does seem to do a pretty decent job
of keeping up to date - the latest version (4.3.0) supports the latest
Spark
Will there continue to be monthly releases on the 1.6.x branch during the
additional time for bug fixes and such?
On Tue, Jan 26, 2016 at 11:28 PM, Koert Kuipers wrote:
> thanks thats all i needed
>
> On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen wrote:
>
>>
As I understand it, your initial number of partitions will always depend on
the initial data. I'm not aware of any way to change this, other than
changing the configuration of the underlying data store.
Have you tried reading the data in several data frames (e.g. one data frame
per day),
RDD has methods to zip with another RDD or with an index, but there's no
equivalent for data frames. Anyone know a good way to do this?
I thought I could just convert to RDD, do the zip, and then convert back,
but ...
1. I don't see a way (outside developer API) to convert RDD[Row]
DataFrames are a higher level API for working with tabular data - RDDs are
used underneath. You can use either and easily convert between them in your
code as necessary.
DataFrames provide a nice abstraction for many cases, so it may be easier
to code against them. Though if you're used to
Each node can have any number of partitions. Spark will try to have a node
process partitions which are already on the node for best performance (if
you look at the list of tasks in the UI, look under the locality level
column).
As a rule of thumb, you probably want 2-3 times the number of
On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire vmadh...@umail.iu.edu
wrote:
I want to write junit test cases in scala for testing spark application.
Is there any guide or link which I can refer.
https://spark.apache.org/docs/latest/programming-guide.html#unit-testing
Typically I create test
To set up Eclipse for Spark you should install the Scala IDE plugins:
http://scala-ide.org/download/current.html
Define your project in Maven with Scala plugins configured (you should be
able to find documentation online) and import as an existing Maven project.
The source code should be in
If the number of items is very large, have you considered using
probabilistic counting? The HyperLogLogPlus
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java
class from stream-lib https://github.com/addthis/stream-lib
(hive.metastore.warehouse.dir, warehousePath.toString)
}
Cheers
On Wed, Apr 8, 2015 at 1:07 PM, Daniel Siegmann
daniel.siegm...@teamaol.com wrote:
I am trying to unit test some code which takes an existing HiveContext
and uses it to execute a CREATE TABLE query (among other things).
Unfortunately I've
I am trying to unit test some code which takes an existing HiveContext and
uses it to execute a CREATE TABLE query (among other things). Unfortunately
I've run into some hurdles trying to unit test this, and I'm wondering if
anyone has a good approach.
The metastore DB is automatically created in
You shouldn't need to do anything special. Are you using a named context?
I'm not sure those work with SparkSqlJob.
By the way, there is a forum on Google groups for the Spark Job Server:
https://groups.google.com/forum/#!forum/spark-jobserver
On Thu, Apr 2, 2015 at 5:10 AM, Harika
On Thu, Mar 12, 2015 at 1:45 AM, raghav0110...@gmail.com wrote:
In your response you say “When you call reduce and *similar *methods,
each partition can be reduced in parallel. Then the results of that can be
transferred across the network and reduced to the final result”. By similar
methods
Join causes a shuffle (sending data across the network). I expect it will
be better to filter before you join, so you reduce the amount of data which
is sent across the network.
Note this would be true for *any* transformation which causes a shuffle. It
would not be true if you're combining RDDs
An RDD is a Resilient *Distributed* Data set. The partitioning and
distribution of the data happens in the background. You'll occasionally
need to concern yourself with it (especially to get good performance), but
from an API perspective it's mostly invisible (some methods do allow you to
specify
OK, good to know data frames are still experimental. Thanks Michael.
On Mon, Mar 2, 2015 at 12:37 PM, Michael Armbrust mich...@databricks.com
wrote:
We have been using Spark SQL in production for our customers at Databricks
for almost a year now. We also know of some very large production
for network shuffle, in reduceByKey after map +
combine are done, I would like to filter the keys based on some threshold...
Is there a way to get the key, value after map+combine stages so that I
can run a filter on the keys ?
Thanks.
Deb
--
Daniel Siegmann, Software Developer
Velos
Thanks for the replies. Hopefully this will not be too difficult to fix.
Why not support multiple paths by overloading the parquetFile method to
take a collection of strings? That way we don't need an appropriate
delimiter.
On Thu, Dec 25, 2014 at 3:46 AM, Cheng, Hao hao.ch...@intel.com wrote:
, 3), (2, 4)]
and
y = [(3, 5), (4, 7)]
and I want to have
z = [(1, 3), (2, 4), (3, 5), (4, 7)]
How can I achieve this. I know you can use outerJoin followed by map to
achieve this, but is there a more direct way for this.
--
Daniel Siegmann, Software Developer
Velos
Accelerating
= [(1, 3), (2, 4)]
and
y = [(3, 5), (4, 7)]
and I want to have
z = [(1, 3), (2, 4), (3, 5), (4, 7)]
How can I achieve this. I know you can use outerJoin followed by map to
achieve this, but is there a more direct way for this.
--
Daniel Siegmann, Software Developer
Velos
be
to define my own equivalent of PairRDDFunctions which works with my class,
does type conversions to Tuple2, and delegates to PairRDDFunctions.
Does anyone know a better way? Anyone know if there will be a significant
performance penalty with that approach?
--
Daniel Siegmann, Software Developer
:45 PM, Michael Armbrust mich...@databricks.com
wrote:
I think you should also be able to get away with casting it back and forth
in this case using .asInstanceOf.
On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I have a class which is a subclass of Tuple2
string key get same
numeric consecutive key?
Any hints?
best,
/Shahab
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io
I've never used Mesos, sorry.
On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis lordjoe2...@gmail.com wrote:
The cluster runs Mesos and I can see the tasks in the Mesos UI but most
are not doing much - any hints about that UI
On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann
daniel.siegm
.
So…
What’s the real difference between an accumulator/accumulable and
aggregating an RDD? When is one method of aggregation preferred over the
other?
Thanks,
Nate
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
54 W 40th St, New York, NY 10018
E
)? Is there a mechanism similar to MR where we can ensure each
partition is assigned some amount of data by size, by setting some block
size parameter?
On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia
mchett
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io
, 2014 at 10:11 AM, Rishi Yadav ri...@infoobjects.com wrote:
If your data is in hdfs and you are reading as textFile and each file is
less than block size, my understanding is it would always have one
partition per file.
On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io
other action I am trying to perform inside the map
statement. I am failing to understand what I am doing wrong.
Can anyone help with this?
Thanks,
Simone Franzini, PhD
http://www.linkedin.com/in/simonefranzini
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
54
val rdds = paths.map { path =
sc.textFile(path).map(myFunc)
}
val completeRdd = sc.union(rdds)
Does that make any sense?
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io
without destroying the RDD for sibsequent
processing. persist will do this but these are big and perisist seems
expensive and I am unsure of which StorageLevel is needed, Is there a way
to clone a JavaRDD or does anyong have good ideas on how to do this?
--
Daniel Siegmann, Software Developer
?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR
mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine
for
your Play app.
Thanks,
Mohammed
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
)
}
}
val sparkInvoker = new SparkJobInvoker(sparkContext,
trainingDatasetLoader)
when(inputRDD.mapPartitions(transformerFunction)).thenReturn(classificationResultsRDD)
sparkInvoker.invoke(inputRDD)
Thanks,
Saket
--
Daniel Siegmann, Software Developer
Velos
/reduce applications from within Eclipse and
debug and learn.
thanks
sanjay
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
information in this
email irrelevant to the official business of Winbond shall be deemed as
neither given nor endorsed by Winbond.
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
...@spark.apache.org
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
like zipPartitions but for arbitrarily many
RDD's, is there any such functionality or how would I approach this problem?
Cheers,
Johan
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W
.
***
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
commands, e-mail: user-h...@spark.apache.org
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E
.
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12587.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Daniel Siegmann, Software Developer
Velos
Accelerating
...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE
. If you want more parallelism,
I think you just need more cores in your cluster--that is, bigger nodes, or
more nodes.
Daniel,
Have you been able to get around this limit?
Nick
On Fri, Aug 1, 2014 at 11:49 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
Sorry, but I haven't used
available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding
something fundamental. Any help would be appreciated.
Thanks.
Darin.
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
behavior, that should be equivalent to:
sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
Any ideas why this doesn't work?
-kr, Gerard.
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine
Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
to just allocate
one task per core, and so runs out of memory on the node. Is there any way
to give the scheduler a hint that the task uses lots of memory and cores so
it spreads it out more evenly?
Thanks,
Ravi Pandya
Microsoft Research
--
Daniel Siegmann, Software Developer
Velos
* thing you run on the cluster, you could also
configure the Workers to only report one core by manually launching the
spark.deploy.worker.Worker process with that flag (see
http://spark.apache.org/docs/latest/spark-standalone.html).
Matei
On Jul 14, 2014, at 1:59 PM, Daniel Siegmann
. From the data
injector and Streaming tab of web ui, it's running well.
However, I see quite a lot of Active stages in web ui even some of them
have all of their tasks completed.
I attach a screenshot for your reference.
Do you ever see this kind of behavior?
--
Daniel Siegmann, Software
Scalding. It's built on top of Cascading. If you have a huge dataset or
if you consider using map/reduce engine for your job, for any reason, you
can try Scalding.
PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs
on Spark too, not just M/R.
--
Daniel Siegmann, Software
...@gmail.com
wrote:
Daniel,
Do you mind sharing the size of your cluster and the production data
volumes ?
Thanks
Soumya
On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very
, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
wrote:
When you say large data sets, how large?
Thanks
On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained; Spark's API feels
cluster, we had 15 nodes. Each node had 24 cores and
2 workers each. Each executor got 14 GB of memory.
-Suren
On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com
wrote:
When you say large data sets, how large?
Thanks
On 07/07/2014 01:39 PM, Daniel Siegmann wrote
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
be scanned
by our systems for the purposes of information security and assessment of
internal compliance with Accenture policy.
__
www.accenture.com
--
Daniel Siegmann, Software Developer
Velos
Accelerating
.
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
on
the map() operation?
thanks!
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
archive at Nabble.com.
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
.
- Patrick
--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning
440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io
1 - 100 of 107 matches
Mail list logo