@Sachin
>>The partition key is very important if you need to run multiple instances
of streams application and certain instance processing certain partitions
only.
Again, depending on partition key is optional. It's actually a feature
enabler, so we can use local state stores to improve
@Sachin
>>is not elastic. You need to anticipate before hand on volume of data you
will have. Very difficult to add and reduce topic partitions later on.
Why do you say so Sachin? Kafka Streams will readjust once we add more
partitions to the Kafka topic. And when we add more machines,
Pls be aware that Accumulators involve communication back with the driver
and may not be efficient. I think OP wants some way to extract the stats
from the sql plan if it is being stored in some internal data structure
Regards
Sab
On 5 Nov 2016 9:42 p.m., "Deepak Sharma"
Can't you just reduce the amount of data you insert by applying a filter so
that only a small set of idpartitions is selected. You could have multiple
such inserts to cover all idpartitions. Does that help?
Regards
Sab
On 22 May 2016 1:11 pm, "swetha kasireddy"
wrote:
d ofcourse something that
> Amazon strongly suggests that we do not use. Please use roles and you will
> not have to worry about security.
>
> Regards,
> Gourav Sengupta
>
> On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan <
> sabarish@gmail.com> wrote:
>
>&g
You have a slash before the bucket name. It should be @.
Regards
Sab
On 15-Mar-2016 4:03 pm, "Yasemin Kaya" wrote:
> Hi,
>
> I am using Spark 1.6.0 standalone and I want to read a txt file from S3
> bucket named yasemindeneme and my file name is deneme.txt. But I am getting
>
It will compress only rdds with serialization enabled in the persistence
mode. So you could skip _SER modes for your other rdds. Not perfect but
something.
On 15-Mar-2016 4:33 pm, "Nirav Patel" wrote:
> Hi,
>
> I see that there's following spark config to compress an RDD.
PM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> It is a Spark-SQL and the version used is Spark-1.2.1.
>>>
>>> On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
&
caching anything. You could simply swap the
fractions in your case.
Regards
Sab
On Mon, Mar 14, 2016 at 2:20 PM, Prabhu Joseph <prabhujose.ga...@gmail.com>
wrote:
> It is a Spark-SQL and the version used is Spark-1.2.1.
>
> On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan <
t;
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 March 2016 at 08:06, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Which version of Spark are you using? The configuration varies by version.
>>
>> Regards
>> Sab
&g
Which version of Spark are you using? The configuration varies by version.
Regards
Sab
On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph
wrote:
> Hi All,
>
> A Hive Join query which runs fine and faster in MapReduce takes lot of
> time with Spark and finally fails
I believe you need to co-locate your Zeppelin on the same node where Spark
is installed. You need to specify the SPARK HOME. The master I used was
YARN.
Zeppelin exposes a notebook interface. A notebook can have many paragraphs.
You run the paragraphs. You can mix multiple contexts in the same
You can use S3's listKeys API and do a diff between consecutive listKeys to
identify what's new.
Are there multiple files in each zip? Single file archives are processed
just like text as long as it is one of the supported compression formats.
Regards
Sab
On Wed, Mar 9, 2016 at 10:33 AM,
If all you want is Spark standalone then its as simple as installing the
binaries and calling Spark submit passing your main class. I would advise
against running on Hadoop on Windows, it's a bit of trouble. But yes you
can do it if you want to.
Regards
Sab
Regards
Sab
On 29-Feb-2016 6:58 pm,
BeanInfo?
On 01-Mar-2016 6:25 am, "Steve Lewis" wrote:
> I have a relatively complex Java object that I would like to use in a
> dataset
>
> if I say
>
> Encoder evidence = Encoders.kryo(MyType.class);
>
> JavaRDD rddMyType= generateRDD(); // some code
>
> Dataset
Zeppelin?
Regards
Sab
On 01-Mar-2016 12:27 am, "Mich Talebzadeh"
wrote:
> Hi,
>
> Is there such thing as Spark for client much like RDBMS client that have
> cut down version of their big brother useful for client connectivity but
> cannot be used as server.
>
> Thanks
This is because Hadoop writables are being reused. Just map it to some
custom type and then do further operations including cache() on it.
Regards
Sab
On 27-Feb-2016 9:11 am, "Yan Yang" wrote:
> Hi
>
> I am pretty new to Spark, and after experimentation on our pipelines. I
I have tested upto 3 billion. ALS scales, you just need to scale your
cluster accordingly. More than building the model, it's getting the final
recommendations that won't scale as nicely, especially when number of
products is huge. This is the case when you are generating recommendations
in a
I don't have a proper answer to this. But to circumvent if you have 2
independent Spark jobs, you could update one when the other is serving
reads. But it's still not scalable for incessant updates.
Regards
Sab
On 25-Feb-2016 7:19 pm, "Udbhav Agarwal" wrote:
> Hi,
>
Like Robin said, pls explore Pregel. You could do it without Pregel but it
might be laborious. I have a simple outline below. You will need more
iterations if the number of levels is higher.
a-b
b-c
c-d
b-e
e-f
f-c
flatmaptopair
a -> (a-b)
b -> (a-b)
b -> (b-c)
c -> (b-c)
c -> (c-d)
d -> (c-d)
I believe the ALS algo expects the ratings to be aggregated (A). I don't
see why you have to use decimals for rating.
Regards
Sab
On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada wrote:
> Hello.
>
> I just started working on CF in MLlib.
> I am using trainImplicit because I
There is no execution plan for FP. Execution plan exists for sql.
Regards
Sab
On 24-Feb-2016 2:46 pm, "Ashok Kumar" wrote:
> Gurus,
>
> Is there anything like explain in Spark to see the execution plan in
> functional programming?
>
> warm regards
>
ntext does not need a hive instance to run.
> On Feb 24, 2016 19:15, "Sabarish Sasidharan" <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Yes
>>
>> Regards
>> Sab
>> On 24-Feb-2016 9:15 pm, "Koert Kuipers" <ko...@tresata.com
("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
> sys.exit
>
> *Results*
>
> Started at [24/02/2016 08:52:27.27]
> res1: org.apache.spark.sql.DataFrame = [result: string]
> s: org.apache.spark.sql.DataFrame = [AMOUNT_S
> ").collect.foreach(println)
> sys.exit
>
> *Results*
>
> Started at [24/02/2016 08:52:27.27]
> res1: org.apache.spark.sql.DataFrame = [result: string]
> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID:
> timestamp, CHANNEL_ID: bigint]
> c:
Yes
Regards
Sab
On 24-Feb-2016 9:15 pm, "Koert Kuipers" <ko...@tresata.com> wrote:
> are you saying that HiveContext.sql(...) runs on hive, and not on spark
> sql?
>
> On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com>
And yes, storage grows on demand. No issues with that.
Regards
Sab
On 24-Feb-2016 6:57 am, "Andy Davidson"
wrote:
> Currently our stream apps write results to hdfs. We are running into
> problems with HDFS becoming corrupted and running out of space. It seems
>
When using SQL your full query, including the joins, were executed in
Hive(or RDBMS) and only the results were brought into the Spark cluster. In
the FP case, the data for the 3 tables is first pulled into the Spark
cluster and then the join is executed.
Thus the time difference.
It's not
Your logs are getting archived in your logs bucket in S3.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html
Regards
Sab
On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR
wrote:
> Hi
>
> In am using an EMR cluster for running my
EMR does cost more than vanilla EC2. Using spark-ec2 can result in savings
with large clusters, though that is not everybody's cup of tea.
Regards
Sab
On 19-Feb-2016 7:55 pm, "Daniel Siegmann"
wrote:
> With EMR supporting Spark, I don't see much reason to use the
I don't think that's a good idea. Even if it wasn't in Spark. I am trying
to understand the benefits you gain by separating.
I would rather use a Composite pattern approach wherein adding to the
composite cascades the additive operations to the children. Thereby your
foreach code doesn't have to
You can setup SSH tunneling.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html
Regards
Sab
On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot
wrote:
> Hi,
> I have hadoop cluster set up in EC2.
> I am unable to view application logs in
When running in YARN, you can use the YARN Resource Manager UI to get to
the ApplicationMaster url, irrespective of client or cluster mode.
Regards
Sab
On 15-Feb-2016 10:10 am, "Divya Gehlot" wrote:
> Hi,
> I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as
I believe you will gain more understanding if you look at or use
mapPartitions()
Regards
Sab
On 15-Feb-2016 8:38 am, "Christopher Brady"
wrote:
> I tried it without the cache, but it didn't change anything. The reason
> for the cache is that other actions will be
Looks like your executors are running out of memory. YARN is not kicking
them out. Just increase the executor memory. Also considering increasing
the parallelism ie the number of partitions.
Regards
Sab
On 11-Feb-2016 5:46 am, "Nirav Patel" wrote:
> In Yarn we have
Make sure you are using s3 bucket in same region. Also I would access my
bucket this way s3n://bucketname/foldername.
You can test privileges using the s3 cmd line client.
Also, if you are using instance profiles you don't need to specify access
and secret keys. No harm in specifying though.
Yes you can look at using the capacity scheduler or the fair scheduler with
YARN. Both allow using full cluster when idle. And both allow considering
cpu plus memory when allocating resources which is sort of necessary with
Spark.
Regards
Sab
On 13-Feb-2016 10:11 pm, "Eugene Morozov"
The Hive context can be used instead of sql context even when you are
accessing data from non-Hive sources like mysql or postgres for ex. It has
better sql support than the sqlcontext as it uses the HiveQL parser.
Regards
Sab
On 15-Feb-2016 3:07 am, "Mich Talebzadeh" wrote:
You could always construct a new MatrixFactorizationModel with your
filtered set of user features and product features. I believe its just a
stateless wrapper around the actual rdds.
Regards
Sab
On Wed, Feb 3, 2016 at 5:28 AM, Roberto Pagliari
wrote:
> When using
The most efficient to determine the number of columns would be to do a
take(1) and split in the driver.
Regards
Sab
On 19-Jan-2016 8:48 pm, "Richard Siebeling" wrote:
> Hi,
>
> what is the most efficient way to split columns and know how many columns
> are created.
>
>
You could generate as many duplicates with a tag/sequence. And then use a
custom partitioner that uses that tag/sequence in addition to the key to do
the partitioning.
Regards
Sab
On 12-Jan-2016 12:21 am, "Daniel Imberman"
wrote:
> Hi all,
>
> I'm looking for a way to
One option could be to store them as blobs in a cache like Redis and then
read + broadcast them from the driver. Or you could store them in HDFS and
read + broadcast from the driver.
Regards
Sab
On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg wrote:
> We have a
If you are on EMR, these can go into your hdfs site config. And will work
with Spark on YARN by default.
Regards
Sab
On 11-Jan-2016 5:16 pm, "Krishna Rao" wrote:
> Hi all,
>
> Is there a method for reading from s3 without having to hard-code keys?
> The only 2 ways I've
You can just repartition by the id, if the final objective is to have all
data for the same key in the same partition.
Regards
Sab
On Wed, Jan 6, 2016 at 11:02 AM, Gavin Yue wrote:
> I found that in 1.6 dataframe could do repartition.
>
> Should I still need to do
collect() will bring everything to driver and is costly. Instead of using
collect() + parallelize, you could use rdd1.checkpoint() along with a more
efficient action like rdd1.count(). This you can do within the for loop.
Hopefully you are using the Kryo serializer already.
Regards
Sab
On Mon,
If all you want to do is to load data into Hive, you don't need to use
Spark.
For subsequent query performance you would want to convert to ORC or
Parquet when loading into Hive.
Regards
Sab
On 16-Dec-2015 7:34 am, "Divya Gehlot" wrote:
> Hi,
> I am new bee to Spark
#2: if using hdfs it's on the disks. You can use the HDFS command line to
browse your data. And then use s3distcp or simply distcp to copy data from
hdfs to S3. Or even use hdfs get commands to copy to local disk and then
use S3 cli to copy to s3
#3. Cost of accessing data in S3 from Ec2 nodes,
You could use the number of input files to determine the number of output
partitions. This assumes your input file sizes are deterministic.
Else, you could also persist the RDD and then determine it's size using the
apis.
Regards
Sab
On 26-Nov-2015 11:13 pm, "Nezih Yigitbasi"
First of all, select * is not a useful SQL to evaluate. Very rarely would a
user require all 362K records for visual analysis.
Second, collect() forces movement of all data from executors to the driver.
Instead write it out to some other table or to HDFS.
Also Spark is more beneficial when you
If yarn has only 50 cores then it can support max 49 executors plus 1
driver application master.
Regards
Sab
On 24-Nov-2015 1:58 pm, "谢廷稳" wrote:
> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>
> I have ran it again, the command to run it is:
> ./bin/spark-submit
Those are empty partitions. I don't see the number of partitions specified
in code. That then implies the default parallelism config is being used and
is set to a very high number, the sum of empty + non empty files.
Regards
Sab
On 21-Nov-2015 11:59 pm, "Andy Davidson"
The stack trace is clear enough:
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
protocol method: #method(reply-code=406,
reply-text=PRECONDITION_FAILED - inequivalent arg 'durable' for queue
'hello1' in vhost '/': received 'true' but current is 'false', class-id=50,
You can also do rdd.toJavaRDD(). Pls check the API docs
Regards
Sab
On 18-Nov-2015 3:12 am, "Bryan Cutler" wrote:
> Hi Ivan,
>
> Since Spark 1.4.1 there is a Java-friendly function in LDAModel to get the
> topic distributions called javaTopicDistributions() that returns a
>
Spark will use the number of executors you specify in spark-submit. Are you
saying that Spark is not able to use more executors after you modify it in
spark-submit? Are you using dynamic allocation?
Regards
Sab
On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan <
dineshranganat...@gmail.com>
You can write your own custom partitioner to achieve this
Regards
Sab
On 17-Nov-2015 1:11 am, "prateek arora" wrote:
> Hi
>
> I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i
> want to reparation this RDD in to 30 partition so every
How are you writing it out? Can you post some code?
Regards
Sab
On 14-Nov-2015 5:21 am, "Rok Roskar" wrote:
> I'm not sure what you mean? I didn't do anything specifically to partition
> the columns
> On Nov 14, 2015 00:38, "Davies Liu" wrote:
>
>>
You are probably trying to access the spring context from the executors
after initializing it at the driver. And running into serialization issues.
You could instead use mapPartitions() and initialize the spring context
from within that.
That said I don't think that will solve all of your issues
We have done this by blocking but without using BlockMatrix. We used our
own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What
is the size of your block? How much memory are you giving to the executors?
I assume you are running on YARN, if so you would want to make sure your
The reserved cores are to prevent starvation so that user B cam run jobs
when user A's job is already running and using almost all of the cluster.
You can change your scheduler configuration to use more cores.
Regards
Sab
On 13-Nov-2015 6:56 pm, "Parin Choganwala" wrote:
>
I have seen some failures in our workloads with Kryo, one I remember is a
scenario with very large arrays. We could not get Kryo to work despite
using the different configuration properties. Switching to java serde was
what worked.
Regards
Sab
On Tue, Nov 10, 2015 at 11:43 AM, Hitoshi Ozawa
Theoretically the executor is a long lived container. So you could use some
simple caching library or a simple Singleton to cache the data in your
executors, once they load it from mysql. But note that with lots of
executors you might choke your mysql.
Regards
Sab
On 05-Nov-2015 7:03 pm, "Kay-Uwe
Qubole uses yarn.
Regards
Sab
On 06-Nov-2015 8:31 am, "Jerry Lam" <chiling...@gmail.com> wrote:
> Does Qubole use Yarn or Mesos for resource management?
>
> Sent from my iPhone
>
> > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
> >
> > Qubole
>
Qubole is one option where you can use spots and get a couple other
benefits. We use Qubole at Manthan for our Spark workloads.
For ensuring all the nodes are ready, you could use
yarn.minregisteredresourcesratio config property to ensure the execution
doesn't start till the requisite containers
If you are writing to S3, also make sure that you are using the direct
output committer. I don't have streaming jobs but it helps in my machine
learning jobs. Also, though more partitions help in processing faster, they
do slow down writes to S3. So you might want to coalesce before writing to
S3.
Hi
You cannot use PairRDD but you can use JavaRDD. So in your case, to
make it work with least change, you would call run(transactions.values()).
Each MLLib implementation has its own data structure typically and you
would have to convert from your data structure before you invoke. For ex if
you
Did you try the sc.binaryFiles() which gives you an RDD of
PortableDataStream that wraps around the underlying bytes.
On Tue, Oct 27, 2015 at 10:23 PM, Balachandar R.A. wrote:
> Hello,
>
>
> I have developed a hadoop based solution that process a binary file. This
>
How many rows are you joining? How many rows in the output?
Regards
Sab
On 24-Oct-2015 2:32 am, "pratik khadloya" wrote:
> Actually the groupBy is not taking a lot of time.
> The join that i do later takes the most (95 %) amount of time.
> Also, the grouping i am doing is
You can point to your custom HADOOP_CONF_DIR in your spark-env.sh
Regards
Sab
On 01-Oct-2015 5:22 pm, "Vinoth Sankar" wrote:
> Hi,
>
> I'm new to Spark. For my application I need to overwrite Hadoop
> configurations (Can't change Configurations in Hadoop as it might affect
A little caution is needed as one executor per node may not always be ideal
esp when your nodes have lots of RAM. But yes, using lesser number of
executors has benefits like more efficient broadcasts.
Regards
Sab
On 24-Sep-2015 2:57 pm, "Adrian Tanase" wrote:
> RE: # because
By java class objects if you mean your custom Java objects, yes of course.
That will work.
Regards
Sab
On 24-Sep-2015 3:36 pm, "Ramkumar V" wrote:
> Hi,
>
> I want to know whether grouping by java class objects is possible or not
> in java Spark.
>
> I have Tuple2<
la to java. To make it
>> work, we have to define 'rdd' as JavaRDD<Tuple2<Tuple2<Object,
>> Object>, Matrix>>
>>
>> As Yanbo has mentioned, I think a Java friendly constructor is still in
>> demand.
>>
>> 2015-09-23 13:14 GMT+08:00 Pulasthi
You can't obtain that from the model. But you can always ask the model to
predict the cluster center for your vectors by calling predict().
Regards
Sab
On Wed, Sep 23, 2015 at 7:24 PM, Tapan Sharma
wrote:
> Hi All,
>
> In the KMeans example provided under mllib, it
Hi Pulasthi
You can always use JavaRDD.rdd() to get the scala rdd. So in your case,
new BlockMatrix(rdd.rdd(), 2, 2)
should work.
Regards
Sab
On Tue, Sep 22, 2015 at 10:50 PM, Pulasthi Supun Wickramasinghe <
pulasthi...@gmail.com> wrote:
> Hi Yanbo,
>
> Thanks for the reply. I thought i
How many executors are you using when using Spark SQL?
On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:
I see that you are not reusing the same mapper instance in the Scala
snippet.
Regards
Sab
On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue
MEMORY_ONLY will fail if there is not enough memory but MEMORY_AND_DISK
will spill to disk
Regards
Sab
On Tue, Aug 18, 2015 at 12:45 PM, Harsha HN 99harsha.h@gmail.com
wrote:
Hello Sparkers,
I would like to understand difference btw these Storage levels for a RDD
portion that doesn't
It is still in memory for future rdd transformations and actions. What you
get in driver is a copy of the data.
Regards
Sab
On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:
When I do an rdd.collect().. The data moves back to driver Or is still
held in memory across the
On Jun 27, 2015, at 8:50 PM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:
Are you running on top of YARN? Plus pls provide your infrastructure
details.
Regards
Sab
On 28-Jun-2015 8:47 am, Ayman Farahat ayman.fara...@yahoo.com.invalid
wrote:
Hello;
I tried to adjust the number
Are you running on top of YARN? Plus pls provide your infrastructure
details.
Regards
Sab
On 28-Jun-2015 8:47 am, Ayman Farahat ayman.fara...@yahoo.com.invalid
wrote:
Hello;
I tried to adjust the number of blocks by repartitioning the input.
Here is How I do it; (I am partitioning by users )
Are you running on top of YARN? Plus pls provide your infrastructure
details.
Regards
Sab
On 28-Jun-2015 9:20 am, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:
Are you running on top of YARN? Plus pls provide your infrastructure
details.
Regards
Sab
On 28-Jun-2015 8:47 am
Did you try increasing the perm gen for the driver?
Regards
Sab
On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote:
When reading large (and many) datasets with the Spark 1.4.0 DataFrames
parquet reader (the org.apache.spark.sql.parquet format), the following
exceptions are thrown:
64GB in parquet could be many billions of rows because of the columnar
compression. And count distinct by itself is an expensive operation. This
is not just on Spark, even on Presto/Impala, you would see performance dip
with count distincts. And the cluster is not that powerful either.
The one
Nick is right. I too have implemented this way and it works just fine. In
my case, there can be even more products. You simply broadcast blocks of
products to userFeatures.mapPartitions() and BLAS multiply in there to get
recommendations. In my case 10K products form one block. Note that you
would
Whatever you write in bolts would be the logic you want to apply on your
events. In Spark, that logic would be coded in map() or similar such
transformations and/or actions. Spark doesn't enforce a structure for
capturing your processing logic like Storm does.
Regards
Sab
Probably overloading the
My bad. This was an outofmemory disguised as something else.
Regards
Sab
On Sun, Mar 22, 2015 at 1:53 AM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:
I am consistently running into this ArrayIndexOutOfBoundsException issue
when using trainImplicit. I have tried changing
Unlike a map() wherein your task is acting on a row at a time, with
mapPartitions(), the task is passed the entire content of the partition in
an iterator. You can then return back another iterator as the output. I
don't do scala, but from what I understand from your code snippet... The
iterator
clusters of similar rows with euclidean
distance.
Best,
Reza
On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:
Hi Reza
I see that ((int, int), double) pairs are generated for any combination
that meets the criteria controlled
I am trying to compute column similarities on a 30x1000 RowMatrix of
DenseVectors. The size of the input RDD is 3.1MB and its all in one
partition. I am running on a single node of 15G and giving the driver 1G
and the executor 9G. This is on a single node hadoop. In the first attempt
the
Sorry, I actually meant 30 x 1 matrix (missed a 0)
Regards
Sab
Hi Reza
I see that ((int, int), double) pairs are generated for any combination
that meets the criteria controlled by the threshold. But assuming a simple
1x10K matrix that means I would need atleast 12GB memory per executor for
the flat map just for these pairs excluding any other overhead.
89 matches
Mail list logo