Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan
@Sachin >>The partition key is very important if you need to run multiple instances of streams application and certain instance processing certain partitions only. Again, depending on partition key is optional. It's actually a feature enabler, so we can use local state stores to improve

Re: Kafka streams vs Spark streaming

2017-10-11 Thread Sabarish Sasidharan
@Sachin >>is not elastic. You need to anticipate before hand on volume of data you will have. Very difficult to add and reduce topic partitions later on. Why do you say so Sachin? Kafka Streams will readjust once we add more partitions to the Kafka topic. And when we add more machines,

Re: Optimized way to use spark as db to hdfs etl

2016-11-06 Thread Sabarish Sasidharan
Pls be aware that Accumulators involve communication back with the driver and may not be efficient. I think OP wants some way to extract the stats from the sql plan if it is being stored in some internal data structure Regards Sab On 5 Nov 2016 9:42 p.m., "Deepak Sharma"

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Sabarish Sasidharan
Can't you just reduce the amount of data you insert by applying a filter so that only a small set of idpartitions is selected. You could have multiple such inserts to cover all idpartitions. Does that help? Regards Sab On 22 May 2016 1:11 pm, "swetha kasireddy" wrote:

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan
d ofcourse something that > Amazon strongly suggests that we do not use. Please use roles and you will > not have to worry about security. > > Regards, > Gourav Sengupta > > On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan < > sabarish@gmail.com> wrote: > >&g

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan
You have a slash before the bucket name. It should be @. Regards Sab On 15-Mar-2016 4:03 pm, "Yasemin Kaya" wrote: > Hi, > > I am using Spark 1.6.0 standalone and I want to read a txt file from S3 > bucket named yasemindeneme and my file name is deneme.txt. But I am getting >

Re: Compress individual RDD

2016-03-15 Thread Sabarish Sasidharan
It will compress only rdds with serialization enabled in the persistence mode. So you could skip _SER modes for your other rdds. Not perfect but something. On 15-Mar-2016 4:33 pm, "Nirav Patel" wrote: > Hi, > > I see that there's following spark config to compress an RDD.

Re: Hive Query on Spark fails with OOM

2016-03-15 Thread Sabarish Sasidharan
PM, Prabhu Joseph < >> prabhujose.ga...@gmail.com> wrote: >> >>> It is a Spark-SQL and the version used is Spark-1.2.1. >>> >>> On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan < >>> sabarish.sasidha...@manthan.com> wrote: >>> &

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Sabarish Sasidharan
caching anything. You could simply swap the fractions in your case. Regards Sab On Mon, Mar 14, 2016 at 2:20 PM, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > It is a Spark-SQL and the version used is Spark-1.2.1. > > On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan <

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Sabarish Sasidharan
t; > http://talebzadehmich.wordpress.com > > > > On 14 March 2016 at 08:06, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> Which version of Spark are you using? The configuration varies by version. >> >> Regards >> Sab &g

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Sabarish Sasidharan
Which version of Spark are you using? The configuration varies by version. Regards Sab On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph wrote: > Hi All, > > A Hive Join query which runs fine and faster in MapReduce takes lot of > time with Spark and finally fails

Re: Zeppelin Integration

2016-03-10 Thread Sabarish Sasidharan
I believe you need to co-locate your Zeppelin on the same node where Spark is installed. You need to specify the SPARK HOME. The master I used was YARN. Zeppelin exposes a notebook interface. A notebook can have many paragraphs. You run the paragraphs. You can mix multiple contexts in the same

Re: S3 Zip File Loading Advice

2016-03-09 Thread Sabarish Sasidharan
You can use S3's listKeys API and do a diff between consecutive listKeys to identify what's new. Are there multiple files in each zip? Single file archives are processed just like text as long as it is one of the supported compression formats. Regards Sab On Wed, Mar 9, 2016 at 10:33 AM,

Re: Spark on Windows platform

2016-03-01 Thread Sabarish Sasidharan
If all you want is Spark standalone then its as simple as installing the binaries and calling Spark submit passing your main class. I would advise against running on Hadoop on Windows, it's a bit of trouble. But yes you can do it if you want to. Regards Sab Regards Sab On 29-Feb-2016 6:58 pm,

Re: DataSet Evidence

2016-03-01 Thread Sabarish Sasidharan
BeanInfo? On 01-Mar-2016 6:25 am, "Steve Lewis" wrote: > I have a relatively complex Java object that I would like to use in a > dataset > > if I say > > Encoder evidence = Encoders.kryo(MyType.class); > > JavaRDD rddMyType= generateRDD(); // some code > > Dataset

Re: Spark for client

2016-02-29 Thread Sabarish Sasidharan
Zeppelin? Regards Sab On 01-Mar-2016 12:27 am, "Mich Talebzadeh" wrote: > Hi, > > Is there such thing as Spark for client much like RDBMS client that have > cut down version of their big brother useful for client connectivity but > cannot be used as server. > > Thanks

Re: .cache() changes contents of RDD

2016-02-27 Thread Sabarish Sasidharan
This is because Hadoop writables are being reused. Just map it to some custom type and then do further operations including cache() on it. Regards Sab On 27-Feb-2016 9:11 am, "Yan Yang" wrote: > Hi > > I am pretty new to Spark, and after experimentation on our pipelines. I

Re: ALS trainImplicit performance

2016-02-25 Thread Sabarish Sasidharan
I have tested upto 3 billion. ALS scales, you just need to scale your cluster accordingly. More than building the model, it's getting the final recommendations that won't scale as nicely, especially when number of products is huge. This is the case when you are generating recommendations in a

Re: Multiple user operations in spark.

2016-02-25 Thread Sabarish Sasidharan
I don't have a proper answer to this. But to circumvent if you have 2 independent Spark jobs, you could update one when the other is serving reads. But it's still not scalable for incessant updates. Regards Sab On 25-Feb-2016 7:19 pm, "Udbhav Agarwal" wrote: > Hi, >

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Sabarish Sasidharan
Like Robin said, pls explore Pregel. You could do it without Pregel but it might be laborious. I have a simple outline below. You will need more iterations if the number of levels is higher. a-b b-c c-d b-e e-f f-c flatmaptopair a -> (a-b) b -> (a-b) b -> (b-c) c -> (b-c) c -> (c-d) d -> (c-d)

Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Sabarish Sasidharan
I believe the ALS algo expects the ratings to be aggregated (A). I don't see why you have to use decimals for rating. Regards Sab On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada wrote: > Hello. > > I just started working on CF in MLlib. > I am using trainImplicit because I

Re: Execution plan in spark

2016-02-24 Thread Sabarish Sasidharan
There is no execution plan for FP. Execution plan exists for sql. Regards Sab On 24-Feb-2016 2:46 pm, "Ashok Kumar" wrote: > Gurus, > > Is there anything like explain in Spark to see the execution plan in > functional programming? > > warm regards >

Re: Using functional programming rather than SQL

2016-02-24 Thread Sabarish Sasidharan
ntext does not need a hive instance to run. > On Feb 24, 2016 19:15, "Sabarish Sasidharan" < > sabarish.sasidha...@manthan.com> wrote: > >> Yes >> >> Regards >> Sab >> On 24-Feb-2016 9:15 pm, "Koert Kuipers" <ko...@tresata.com

Re: Using Spark functional programming rather than SQL, Spark on Hive tables

2016-02-24 Thread Sabarish Sasidharan
("SELECT > FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') > ").collect.foreach(println) > sys.exit > > *Results* > > Started at [24/02/2016 08:52:27.27] > res1: org.apache.spark.sql.DataFrame = [result: string] > s: org.apache.spark.sql.DataFrame = [AMOUNT_S

Re: Using Spark functional programming rather than SQL, Spark on Hive tables

2016-02-24 Thread Sabarish Sasidharan
> ").collect.foreach(println) > sys.exit > > *Results* > > Started at [24/02/2016 08:52:27.27] > res1: org.apache.spark.sql.DataFrame = [result: string] > s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID: > timestamp, CHANNEL_ID: bigint] > c:

Re: Using functional programming rather than SQL

2016-02-24 Thread Sabarish Sasidharan
Yes Regards Sab On 24-Feb-2016 9:15 pm, "Koert Kuipers" <ko...@tresata.com> wrote: > are you saying that HiveContext.sql(...) runs on hive, and not on spark > sql? > > On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com>

Re: streaming spark is writing results to S3 a good idea?

2016-02-23 Thread Sabarish Sasidharan
And yes, storage grows on demand. No issues with that. Regards Sab On 24-Feb-2016 6:57 am, "Andy Davidson" wrote: > Currently our stream apps write results to hdfs. We are running into > problems with HDFS becoming corrupted and running out of space. It seems >

Re: Using functional programming rather than SQL

2016-02-23 Thread Sabarish Sasidharan
When using SQL your full query, including the joins, were executed in Hive(or RDBMS) and only the results were brought into the Spark cluster. In the FP case, the data for the 3 tables is first pulled into the Spark cluster and then the join is executed. Thus the time difference. It's not

Re: [Please Help] Log redirection on EMR

2016-02-21 Thread Sabarish Sasidharan
Your logs are getting archived in your logs bucket in S3. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html Regards Sab On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR wrote: > Hi > > In am using an EMR cluster for running my

Re: Is this likely to cause any problems?

2016-02-19 Thread Sabarish Sasidharan
EMR does cost more than vanilla EC2. Using spark-ec2 can result in savings with large clusters, though that is not everybody's cup of tea. Regards Sab On 19-Feb-2016 7:55 pm, "Daniel Siegmann" wrote: > With EMR supporting Spark, I don't see much reason to use the

Re: Running multiple foreach loops

2016-02-17 Thread Sabarish Sasidharan
I don't think that's a good idea. Even if it wasn't in Spark. I am trying to understand the benefits you gain by separating. I would rather use a Composite pattern approach wherein adding to the composite cascades the additive operations to the children. Thereby your foreach code doesn't have to

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Sabarish Sasidharan
You can setup SSH tunneling. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html Regards Sab On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot wrote: > Hi, > I have hadoop cluster set up in EC2. > I am unable to view application logs in

Re: which master option to view current running job in Spark UI

2016-02-14 Thread Sabarish Sasidharan
When running in YARN, you can use the YARN Resource Manager UI to get to the ApplicationMaster url, irrespective of client or cluster mode. Regards Sab On 15-Feb-2016 10:10 am, "Divya Gehlot" wrote: > Hi, > I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as

Re: coalesce and executor memory

2016-02-14 Thread Sabarish Sasidharan
I believe you will gain more understanding if you look at or use mapPartitions() Regards Sab On 15-Feb-2016 8:38 am, "Christopher Brady" wrote: > I tried it without the cache, but it didn't change anything. The reason > for the cache is that other actions will be

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-14 Thread Sabarish Sasidharan
Looks like your executors are running out of memory. YARN is not kicking them out. Just increase the executor memory. Also considering increasing the parallelism ie the number of partitions. Regards Sab On 11-Feb-2016 5:46 am, "Nirav Patel" wrote: > In Yarn we have

Re: newbie unable to write to S3 403 forbidden error

2016-02-14 Thread Sabarish Sasidharan
Make sure you are using s3 bucket in same region. Also I would access my bucket this way s3n://bucketname/foldername. You can test privileges using the s3 cmd line client. Also, if you are using instance profiles you don't need to specify access and secret keys. No harm in specifying though.

Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread Sabarish Sasidharan
Yes you can look at using the capacity scheduler or the fair scheduler with YARN. Both allow using full cluster when idle. And both allow considering cpu plus memory when allocating resources which is sort of necessary with Spark. Regards Sab On 13-Feb-2016 10:11 pm, "Eugene Morozov"

RE: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Sabarish Sasidharan
The Hive context can be used instead of sql context even when you are accessing data from non-Hive sources like mysql or postgres for ex. It has better sql support than the sqlcontext as it uses the HiveQL parser. Regards Sab On 15-Feb-2016 3:07 am, "Mich Talebzadeh" wrote:

Re: recommendProductsForUser for a subset of user

2016-02-03 Thread Sabarish Sasidharan
You could always construct a new MatrixFactorizationModel with your filtered set of user features and product features. I believe its just a stateless wrapper around the actual rdds. Regards Sab On Wed, Feb 3, 2016 at 5:28 AM, Roberto Pagliari wrote: > When using

Re: Split columns in RDD

2016-01-19 Thread Sabarish Sasidharan
The most efficient to determine the number of columns would be to do a take(1) and split in the driver. Regards Sab On 19-Jan-2016 8:48 pm, "Richard Siebeling" wrote: > Hi, > > what is the most efficient way to split columns and know how many columns > are created. > >

Re:

2016-01-12 Thread Sabarish Sasidharan
You could generate as many duplicates with a tag/sequence. And then use a custom partitioner that uses that tag/sequence in addition to the key to do the partitioning. Regards Sab On 12-Jan-2016 12:21 am, "Daniel Imberman" wrote: > Hi all, > > I'm looking for a way to

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Sabarish Sasidharan
One option could be to store them as blobs in a cache like Redis and then read + broadcast them from the driver. Or you could store them in HDFS and read + broadcast from the driver. Regards Sab On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg wrote: > We have a

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Sabarish Sasidharan
If you are on EMR, these can go into your hdfs site config. And will work with Spark on YARN by default. Regards Sab On 11-Jan-2016 5:16 pm, "Krishna Rao" wrote: > Hi all, > > Is there a method for reading from s3 without having to hard-code keys? > The only 2 ways I've

Re: How to concat few rows into a new column in dataframe

2016-01-06 Thread Sabarish Sasidharan
You can just repartition by the id, if the final objective is to have all data for the same key in the same partition. Regards Sab On Wed, Jan 6, 2016 at 11:02 AM, Gavin Yue wrote: > I found that in 1.6 dataframe could do repartition. > > Should I still need to do

Re: [Beg for help] spark job with very low efficiency

2015-12-21 Thread Sabarish Sasidharan
collect() will bring everything to driver and is costly. Instead of using collect() + parallelize, you could use rdd1.checkpoint() along with a more efficient action like rdd1.count(). This you can do within the for loop. Hopefully you are using the Kryo serializer already. Regards Sab On Mon,

Re: Pros and cons -Saving spark data in hive

2015-12-15 Thread Sabarish Sasidharan
If all you want to do is to load data into Hive, you don't need to use Spark. For subsequent query performance you would want to convert to ORC or Parquet when loading into Hive. Regards Sab On 16-Dec-2015 7:34 am, "Divya Gehlot" wrote: > Hi, > I am new bee to Spark

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Sabarish Sasidharan
#2: if using hdfs it's on the disks. You can use the HDFS command line to browse your data. And then use s3distcp or simply distcp to copy data from hdfs to S3. Or even use hdfs get commands to copy to local disk and then use S3 cli to copy to s3 #3. Cost of accessing data in S3 from Ec2 nodes,

Re: question about combining small parquet files

2015-11-30 Thread Sabarish Sasidharan
You could use the number of input files to determine the number of output partitions. This assumes your input file sizes are deterministic. Else, you could also persist the RDD and then determine it's size using the apis. Regards Sab On 26-Nov-2015 11:13 pm, "Nezih Yigitbasi"

Re: Conversely, Hive is performing better than Spark-Sql

2015-11-24 Thread Sabarish Sasidharan
First of all, select * is not a useful SQL to evaluate. Very rarely would a user require all 362K records for visual analysis. Second, collect() forces movement of all data from executors to the driver. Instead write it out to some other table or to HDFS. Also Spark is more beneficial when you

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

2015-11-24 Thread Sabarish Sasidharan
If yarn has only 50 cores then it can support max 49 executors plus 1 driver application master. Regards Sab On 24-Nov-2015 1:58 pm, "谢廷稳" wrote: > OK, yarn.scheduler.maximum-allocation-mb is 16384. > > I have ran it again, the command to run it is: > ./bin/spark-submit

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-21 Thread Sabarish Sasidharan
Those are empty partitions. I don't see the number of partitions specified in code. That then implies the default parallelism config is being used and is set to a very high number, the sum of empty + non empty files. Regards Sab On 21-Nov-2015 11:59 pm, "Andy Davidson"

Re: Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread Sabarish Sasidharan
The stack trace is clear enough: Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method(reply-code=406, reply-text=PRECONDITION_FAILED - inequivalent arg 'durable' for queue 'hello1' in vhost '/': received 'true' but current is 'false', class-id=50,

Re: Working with RDD from Java

2015-11-17 Thread Sabarish Sasidharan
You can also do rdd.toJavaRDD(). Pls check the API docs Regards Sab On 18-Nov-2015 3:12 am, "Bryan Cutler" wrote: > Hi Ivan, > > Since Spark 1.4.1 there is a Java-friendly function in LDAModel to get the > topic distributions called javaTopicDistributions() that returns a >

Re: Spark Expand Cluster

2015-11-16 Thread Sabarish Sasidharan
Spark will use the number of executors you specify in spark-submit. Are you saying that Spark is not able to use more executors after you modify it in spark-submit? Are you using dynamic allocation? Regards Sab On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan < dineshranganat...@gmail.com>

Re: how can evenly distribute my records in all partition

2015-11-16 Thread Sabarish Sasidharan
You can write your own custom partitioner to achieve this Regards Sab On 17-Nov-2015 1:11 am, "prateek arora" wrote: > Hi > > I have a RDD with 30 record ( Key/value pair ) and running 30 executor . i > want to reparation this RDD in to 30 partition so every

Re: very slow parquet file write

2015-11-14 Thread Sabarish Sasidharan
How are you writing it out? Can you post some code? Regards Sab On 14-Nov-2015 5:21 am, "Rok Roskar" wrote: > I'm not sure what you mean? I didn't do anything specifically to partition > the columns > On Nov 14, 2015 00:38, "Davies Liu" wrote: > >>

Re: Spark and Spring Integrations

2015-11-14 Thread Sabarish Sasidharan
You are probably trying to access the spring context from the executors after initializing it at the driver. And running into serialization issues. You could instead use mapPartitions() and initialize the spring context from within that. That said I don't think that will solve all of your issues

Re: large, dense matrix multiplication

2015-11-13 Thread Sabarish Sasidharan
We have done this by blocking but without using BlockMatrix. We used our own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What is the size of your block? How much memory are you giving to the executors? I assume you are running on YARN, if so you would want to make sure your

Re: Stack Overflow Question

2015-11-13 Thread Sabarish Sasidharan
The reserved cores are to prevent starvation so that user B cam run jobs when user A's job is already running and using almost all of the cluster. You can change your scheduler configuration to use more cores. Regards Sab On 13-Nov-2015 6:56 pm, "Parin Choganwala" wrote: >

Re: Why is Kryo not the default serializer?

2015-11-09 Thread Sabarish Sasidharan
I have seen some failures in our workloads with Kryo, one I remember is a scenario with very large arrays. We could not get Kryo to work despite using the different configuration properties. Switching to java serde was what worked. Regards Sab On Tue, Nov 10, 2015 at 11:43 AM, Hitoshi Ozawa

Re: How to use data from Database and reload every hour

2015-11-05 Thread Sabarish Sasidharan
Theoretically the executor is a long lived container. So you could use some simple caching library or a simple Singleton to cache the data in your executors, once they load it from mysql. But note that with lots of executors you might choke your mysql. Regards Sab On 05-Nov-2015 7:03 pm, "Kay-Uwe

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Sabarish Sasidharan
Qubole uses yarn. Regards Sab On 06-Nov-2015 8:31 am, "Jerry Lam" <chiling...@gmail.com> wrote: > Does Qubole use Yarn or Mesos for resource management? > > Sent from my iPhone > > > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > > > > Qubole >

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Sabarish Sasidharan
Qubole is one option where you can use spots and get a couple other benefits. We use Qubole at Manthan for our Spark workloads. For ensuring all the nodes are ready, you could use yarn.minregisteredresourcesratio config property to ensure the execution doesn't start till the requisite containers

RE: Spark/Kafka Streaming Job Gets Stuck

2015-10-29 Thread Sabarish Sasidharan
If you are writing to S3, also make sure that you are using the direct output committer. I don't have streaming jobs but it helps in my machine learning jobs. Also, though more partitions help in processing faster, they do slow down writes to S3. So you might want to coalesce before writing to S3.

Re: Running FPGrowth over a JavaPairRDD?

2015-10-29 Thread Sabarish Sasidharan
Hi You cannot use PairRDD but you can use JavaRDD. So in your case, to make it work with least change, you would call run(transactions.values()). Each MLLib implementation has its own data structure typically and you would have to convert from your data structure before you invoke. For ex if you

Re: Using Hadoop Custom Input format in Spark

2015-10-27 Thread Sabarish Sasidharan
Did you try the sc.binaryFiles() which gives you an RDD of PortableDataStream that wraps around the underlying bytes. On Tue, Oct 27, 2015 at 10:23 PM, Balachandar R.A. wrote: > Hello, > > > I have developed a hadoop based solution that process a binary file. This >

Re: Huge shuffle data size

2015-10-24 Thread Sabarish Sasidharan
How many rows are you joining? How many rows in the output? Regards Sab On 24-Oct-2015 2:32 am, "pratik khadloya" wrote: > Actually the groupBy is not taking a lot of time. > The join that i do later takes the most (95 %) amount of time. > Also, the grouping i am doing is

Re: Calrification on Spark-Hadoop Configuration

2015-10-01 Thread Sabarish Sasidharan
You can point to your custom HADOOP_CONF_DIR in your spark-env.sh Regards Sab On 01-Oct-2015 5:22 pm, "Vinoth Sankar" wrote: > Hi, > > I'm new to Spark. For my application I need to overwrite Hadoop > configurations (Can't change Configurations in Hadoop as it might affect

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Sabarish Sasidharan
A little caution is needed as one executor per node may not always be ideal esp when your nodes have lots of RAM. But yes, using lesser number of executors has benefits like more efficient broadcasts. Regards Sab On 24-Sep-2015 2:57 pm, "Adrian Tanase" wrote: > RE: # because

Re: GroupBy Java objects in Java Spark

2015-09-24 Thread Sabarish Sasidharan
By java class objects if you mean your custom Java objects, yes of course. That will work. Regards Sab On 24-Sep-2015 3:36 pm, "Ramkumar V" wrote: > Hi, > > I want to know whether grouping by java class objects is possible or not > in java Spark. > > I have Tuple2<

Re: Creating BlockMatrix with java API

2015-09-24 Thread Sabarish Sasidharan
la to java. To make it >> work, we have to define 'rdd' as JavaRDD<Tuple2<Tuple2<Object, >> Object>, Matrix>> >> >> As Yanbo has mentioned, I think a Java friendly constructor is still in >> demand. >> >> 2015-09-23 13:14 GMT+08:00 Pulasthi

Re: K Means Explanation

2015-09-23 Thread Sabarish Sasidharan
You can't obtain that from the model. But you can always ask the model to predict the cluster center for your vectors by calling predict(). Regards Sab On Wed, Sep 23, 2015 at 7:24 PM, Tapan Sharma wrote: > Hi All, > > In the KMeans example provided under mllib, it

Re: Creating BlockMatrix with java API

2015-09-22 Thread Sabarish Sasidharan
Hi Pulasthi You can always use JavaRDD.rdd() to get the scala rdd. So in your case, new BlockMatrix(rdd.rdd(), 2, 2) should work. Regards Sab On Tue, Sep 22, 2015 at 10:50 PM, Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi Yanbo, > > Thanks for the reply. I thought i

Re: How to increase the Json parsing speed

2015-08-28 Thread Sabarish Sasidharan
How many executors are you using when using Spark SQL? On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: I see that you are not reusing the same mapper instance in the Scala snippet. Regards Sab On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue

Re: Difference btw MEMORY_ONLY and MEMORY_AND_DISK

2015-08-18 Thread Sabarish Sasidharan
MEMORY_ONLY will fail if there is not enough memory but MEMORY_AND_DISK will spill to disk Regards Sab On Tue, Aug 18, 2015 at 12:45 PM, Harsha HN 99harsha.h@gmail.com wrote: Hello Sparkers, I would like to understand difference btw these Storage levels for a RDD portion that doesn't

Re: Regarding rdd.collect()

2015-08-18 Thread Sabarish Sasidharan
It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote: When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

2015-06-27 Thread Sabarish Sasidharan
On Jun 27, 2015, at 8:50 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Are you running on top of YARN? Plus pls provide your infrastructure details. Regards Sab On 28-Jun-2015 8:47 am, Ayman Farahat ayman.fara...@yahoo.com.invalid wrote: Hello; I tried to adjust the number

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

2015-06-27 Thread Sabarish Sasidharan
Are you running on top of YARN? Plus pls provide your infrastructure details. Regards Sab On 28-Jun-2015 8:47 am, Ayman Farahat ayman.fara...@yahoo.com.invalid wrote: Hello; I tried to adjust the number of blocks by repartitioning the input. Here is How I do it; (I am partitioning by users )

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

2015-06-27 Thread Sabarish Sasidharan
Are you running on top of YARN? Plus pls provide your infrastructure details. Regards Sab On 28-Jun-2015 9:20 am, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: Are you running on top of YARN? Plus pls provide your infrastructure details. Regards Sab On 28-Jun-2015 8:47 am

Re: Parquet problems

2015-06-25 Thread Sabarish Sasidharan
Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, Anders Arpteg arp...@spotify.com wrote: When reading large (and many) datasets with the Spark 1.4.0 DataFrames parquet reader (the org.apache.spark.sql.parquet format), the following exceptions are thrown:

Re: Help optimising Spark SQL query

2015-06-23 Thread Sabarish Sasidharan
64GB in parquet could be many billions of rows because of the columnar compression. And count distinct by itself is an expensive operation. This is not just on Spark, even on Presto/Impala, you would see performance dip with count distincts. And the cluster is not that powerful either. The one

Re: Matrix Multiplication and mllib.recommendation

2015-06-17 Thread Sabarish Sasidharan
Nick is right. I too have implemented this way and it works just fine. In my case, there can be even more products. You simply broadcast blocks of products to userFeatures.mapPartitions() and BLAS multiply in there to get recommendations. In my case 10K products form one block. Note that you would

Re: Spark or Storm

2015-06-17 Thread Sabarish Sasidharan
Whatever you write in bolts would be the logic you want to apply on your events. In Spark, that logic would be coded in map() or similar such transformations and/or actions. Spark doesn't enforce a structure for capturing your processing logic like Storm does. Regards Sab Probably overloading the

Re: ArrayIndexOutOfBoundsException in ALS.trainImplicit

2015-03-22 Thread Sabarish Sasidharan
My bad. This was an outofmemory disguised as something else. Regards Sab On Sun, Mar 22, 2015 at 1:53 AM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: I am consistently running into this ArrayIndexOutOfBoundsException issue when using trainImplicit. I have tried changing

Re: mapPartitions - How Does it Works

2015-03-18 Thread Sabarish Sasidharan
Unlike a map() wherein your task is acting on a row at a time, with mapPartitions(), the task is passed the entire content of the partition in an iterator. You can then return back another iterator as the output. I don't do scala, but from what I understand from your code snippet... The iterator

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Sabarish Sasidharan
clusters of similar rows with euclidean distance. Best, Reza On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: ​Hi Reza ​​ I see that ((int, int), double) pairs are generated for any combination that meets the criteria controlled

Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
I am trying to compute column similarities on a 30x1000 RowMatrix of DenseVectors. The size of the input RDD is 3.1MB and its all in one partition. I am running on a single node of 15G and giving the driver 1G and the executor 9G. This is on a single node hadoop. In the first attempt the

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
Sorry, I actually meant 30 x 1 matrix (missed a 0) Regards Sab

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
​Hi Reza ​​ I see that ((int, int), double) pairs are generated for any combination that meets the criteria controlled by the threshold. But assuming a simple 1x10K matrix that means I would need atleast 12GB memory per executor for the flat map just for these pairs excluding any other overhead.