Re: Accessing AWS S3 in Frankfurt (v4 only - AWS4-HMAC-SHA256)

2015-03-20 Thread Gourav Sengupta
Hi Ralf, using secret keys and authorization details is a strict NO for AWS, they are major security lapses and should be avoided at any cost. Have you tried starting the clusters using ROLES, they are wonderful way to start clusters or EC2 nodes and you do not have to copy and paste any

Re: JAVA for SPARK certification

2015-05-05 Thread Gourav Sengupta
is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is JAVA for Spark certification? Will learning only

JAVA for SPARK certification

2015-05-05 Thread Gourav Sengupta
Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav

Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Gourav Sengupta
Excellent resource: http://www.oreilly.com/pub/e/3330 And more amazing is the fact that the presenter actually responds to your questions. Regards, Gourav Sengupta On Wed, Aug 19, 2015 at 4:12 PM, Todd bit1...@163.com wrote: Hi, I would ask if there are some blogs/articles/videos on how

Re: Is there any tool that i can prove to customer that spark is faster then hive ?

2015-08-12 Thread Gourav Sengupta
. Regards, Gourav Sengupta On Wed, Aug 12, 2015 at 1:01 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Perhaps you could time the end-to-end runtime for each pipeline, and each stage? Through Id be fairly confidant that Spark will outperform hive/mahout on MR, that's not he only

Re: Java 8 vs Scala

2015-07-15 Thread Gourav Sengupta
Why would you create a class and then instantiate it to store data and change the class every time you have to add a new element? In OOPS terminology a class represents an object, and an object has states - does it not? Purely from a data warehousing perspective - one of the fundamental

Re: Exception when S3 path contains colons

2015-08-25 Thread Gourav Sengupta
I am not quite sure about this but should the notation not be s3n://redactedbucketname/* instead of s3a://redactedbucketname/* The best way is to use s3://bucketname/path/* Regards, Gourav On Tue, Aug 25, 2015 at 10:35 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can change the names,

Re: Directly reading data from S3 to EC2 with PySpark

2015-09-15 Thread Gourav Sengupta
Hi, If you start your EC2 nodes with correct roles (default in most cases depending on your needs) you should be able to work on S3 and all other AWS resources without giving any keys. I have been doing that for some time now and I have not faced any issues yet. Regards, Gourav On Tue, Sep

Re: AWS_CREDENTIAL_FILE

2015-09-22 Thread Gourav Sengupta
Hi, I think that it is a very bad practice to use your keys in nodes. Please start EC2 nodes/ EMR Clusters with proper roles and you do not have to worry about any keys at all. Kindly refer to AWS documentation for further details. Regards, Gourav On Mon, Sep 21, 2015 at 4:34 PM, Michel Lemay

Re: newbie how to upgrade a spark-ec2 cluster?

2015-12-02 Thread Gourav Sengupta
Hi, And so you have the money to keep a SPARK cluster up and running? The way I make it work is test the code in local system with a localised spark installation and then create data pipeline triggered by lambda which starts SPARK cluster and processes the data via SPARK steps and then terminates

HiveContext Self join not reading from cache

2015-12-16 Thread Gourav Sengupta
Hi, This is how the data can be created: 1. TableA : cached() 2. TableB : cached() 3. TableC: TableA inner join TableB cached() 4. TableC join TableC does not take the data from cache but starts reading the data for TableA and TableB from disk. Does this sound like a bug? The self join between

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
hi, I think that people have reported the same issue elsewhere, and this should be registered as a bug in SPARK https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html Regards, Gourav On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta <gourav.sengu...@gmail.com > wrote:

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
in SPARK > > https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html > > > Regards, > Gourav > > On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi Ted, >> >> The self join works fi

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
Hi, the attached DAG shows that for the same table (self join) SPARK is unnecessarily getting data from S3 for one side of the join where as its able to use cache for the other side. Regards, Gourav On Fri, Dec 18, 2015 at 10:29 AM, Gourav Sengupta <gourav.sengu...@gmail.com > wrote:

hiveContext: storing lookup of partitions

2015-12-15 Thread Gourav Sengupta
Hi, I have a HIVE table with few thousand partitions (based on date and time). It takes a long time to run if for the first time and then subsequently it is fast. Is there a way to store the cache of partition lookups so that every time I start a new SPARK instance (cannot keep my personal

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
Hi Jeff, sadly that does not resolve the issue. I am sure that the memory mapping to physical files locations can be saved and recovered in SPARK. Regards, Gourav Sengupta On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang <zjf...@gmail.com> wrote: > oh, you are using S3. As I remember

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
I guess you mean the stage of getting the split info. I suspect it might > be your cluster issue (or metadata store), unusually it won't take such > long time for splitting. > > On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> H

Re: Stuck with DataFrame df.select("select * from table");

2015-12-27 Thread Gourav Sengupta
ing it as a table? I think we should be using hivecontext or sqlcontext to run queries on a registered table. Regards, Gourav Sengupta On Sat, Dec 26, 2015 at 6:27 PM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: > Chris, thanks. That'd be great to try =) > > -- > Be w

Re: HiveContext Self join not reading from cache

2015-12-17 Thread Gourav Sengupta
4,c#90], Some(d) >+- Sort [c#253 ASC], false, 0 > +- TungstenExchange hashpartitioning(c#253,200), None > +- InMemoryColumnarTableScan [c#253], InMemoryRelation > [b#246,c#253], true, 1, StorageLevel(true, true, false, true, 1), > Project [b#4,c#90], Some(d) > &g

Re: ImportError: No module named numpy

2016-06-04 Thread Gourav Sengupta
by including the following: PYSPARK_PYTHON=<>/anaconda2/bin/python2.7 PATH=$PATH:<>/anaconda/bin <>/pyspark :) In case you are using it in EMR the solution is a bit tricky. Just let me know in case you want any further help. Regards, Gourav Sengupta On Thu, Jun 2, 2016 at 7:59 PM

HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
on (A.PK = B.FK) where B.FK is not null; This query takes 4.5 mins in SPARK Regards, Gourav Sengupta

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
will surely be excited to see if I am going wrong here and post the results of sql.describe(). Thanks a ton once again. Hi Ted, Is there anyway you can throw some light on this before I post this in a blog? Regards, Gourav Sengupta On Fri, Jun 10, 2016 at 7:22 PM, Gavin Yue <yue.yu

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
ava...@gmail.com> wrote: > ooc are the tables partitioned on a.pk and b.fk? Hive might be using > copartitioning in that case: it is one of hive's strengths. > > 2016-06-09 7:28 GMT-07:00 Gourav Sengupta <gourav.sengu...@gmail.com>: > >> Hi Mich, >> >

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-09 Thread Gourav Sengupta
Hi, are you using EC2 instances or local cluster behind firewall. Regards, Gourav Sengupta On Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > > I'm trying to create a table on s3a but I keep hitting the following error: > > Exce

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
t; > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.c

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
gt; Could you print out the sql execution plan? My guess is about broadcast > join. > > > > On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here >

Re: Accessing s3a files from Spark

2016-06-01 Thread Gourav Sengupta
, Gourav Sengupta On Tue, May 31, 2016 at 12:22 PM, Mayuresh Kunjir <mayur...@cs.duke.edu> wrote: > How do I use it? I'm accessing s3a from Spark's textFile API. > > On Tue, May 31, 2016 at 7:16 AM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Hi Mayuresh &

Re: Accessing s3a files from Spark

2016-05-31 Thread Gourav Sengupta
Hi, And on another note, is it required to use s3a? Why not use s3:// only? I prefer to use s3a:// only while writing files to S3 from EMR. Regards, Gourav Sengupta On Tue, May 31, 2016 at 12:04 PM, Gourav Sengupta <gourav.sengu...@gmail.com > wrote: > Hi, > > Is your spark

Re: Accessing s3a files from Spark

2016-05-31 Thread Gourav Sengupta
Hi, Is your spark cluster running in EMR or via self created SPARK cluster using EC2 or from a local cluster behind firewall? What is the SPARK version you are using? Regards, Gourav Sengupta On Sun, May 29, 2016 at 10:55 PM, Mayuresh Kunjir <mayur...@cs.duke.edu> wrote: > I'

Re: Secondary Indexing?

2016-05-30 Thread Gourav Sengupta
Hi, have you tried using partitioning and parquet format. It works super fast in SPARK. Regards, Gourav On Mon, May 30, 2016 at 5:08 PM, Michael Segel wrote: > I’m not sure where to post this since its a bit of a philosophical > question in terms of design and

Re: FullOuterJoin on Spark

2016-06-22 Thread Gourav Sengupta
+1 for the guidance from Nirvan. Also it would be better to repartition and store the data in parquet format in case you are planning to do the joins more than once or with other data sources. Parquet with SPARK works likes a charm. Over S3 I have seen its performance being quite close to cached

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Gourav Sengupta
used case. Spark in local mode will be way faster compared to SPARK running on HADOOP. I have a system with 64 GB RAM and SSD and its performance on local cluster SPARK is way better. Did your join include the same number of columns and rows for the dimension table? Regards, Gourav Sengupta

Re: Spark UI shows finished when job had an error

2016-06-17 Thread Gourav Sengupta
Hi, Can you please see the query plan (in case you are using a query)? There is a very high chance that the query was broken into multiple steps and only a subsequent step failed. Regards, Gourav Sengupta On Fri, Jun 17, 2016 at 2:49 PM, Sumona Routh <sumos...@gmail.com> wrote:

storing query object

2016-01-19 Thread Gourav Sengupta
, Gourav Sengupta

Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Gourav Sengupta
sec for 1 gb of data whereas in Spark, it is taking 4 mins > of time. > On 6/9/2016 3:19 PM, Gavin Yue wrote: > > Could you print out the sql execution plan? My guess is about broadcast > join. > > > > On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...

Using SPARK packages in Spark Cluster

2016-02-12 Thread Gourav Sengupta
n starting a cluster as mentioned above. Regards, Gourav Sengupta

Re: Is there a way to save csv file fast ?

2016-02-10 Thread Gourav Sengupta
Hi, The writes, in terms of number of records written simultaneously, can be increased if you increased the number of partitions. You can try to increase the number of partitions and check out how it works. There is though an upper cap (the one that I faced in Ubuntu) on the number of parallel

Re: Using SPARK packages in Spark Cluster

2016-02-13 Thread Gourav Sengupta
gt; constantly going to use them. > > Best, > Burak > > > > On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi, >> >> I am creating sparkcontext in a SPARK standalone cluster as mentioned >> here: h

Re: Accessing Web UI

2016-02-23 Thread Gourav Sengupta
> On 22 Feb 2016, at 11:00, Kayode Odeyemi <drey...@gmail.com> wrote: >>>> >>>> Try http://localhost:4040 >>>> >>>> On Mon, Feb 22, 2016 at 8:23 AM, Vasanth Bhat <vasb...@gmail.com> >>>> wrote: >>>> >>>

Re: pandas dataframe to spark csv

2016-02-23 Thread Gourav Sengupta
Hi, The solutions is here: https://github.com/databricks/spark-csv Using the above solution you can read CSV directly into a dataframe as well. Regards, Gourav On Tue, Feb 23, 2016 at 12:03 PM, Devesh Raj Singh wrote: > Hi, > > I have imported spark csv dataframe in

Re: Spark standalone peer2peer network

2016-02-23 Thread Gourav Sengupta
h that you mention exists or is available only in one system. Regards, Gourav Sengupta On Tue, Feb 23, 2016 at 8:39 PM, Robineast <robin.e...@xense.co.uk> wrote: > Hi Thomas > > I can confirm that I have had this working in the past. I'm pretty sure you > don't need p

Re: s3 access through proxy

2016-02-26 Thread Gourav Sengupta
the files in a s3://bucket/ or s3://bucket/key/ to your local system. And then you can point your spark cluster to the local data store and run the queries.Of course that depends on the data volume as well. Regards, Gourav Sengupta On Fri, Feb 26, 2016 at 7:29 PM, Joshua Buss <joshua.b...@gma

Re: storing query object

2016-01-22 Thread Gourav Sengupta
: > https://issues.apache.org/jira/browse/SPARK-8125 > > You can also look at parent issue. > > Which Spark release are you using ? > > > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > > > > > Hi, > > > >

Fwd: storing query object

2016-01-22 Thread Gourav Sengupta
, Gourav Sengupta

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Gourav Sengupta
Hi, Are you creating RDD's using textfile option? Can you please let me know the following: 1. Number of partitions 2. Number of files 3. Time taken to create the RDD's Regards, Gourav Sengupta On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Gourav Sengupta
Hi, are you creating RDD's out of the data? Regards, Gourav On Tue, Jan 26, 2016 at 12:45 PM, aecc wrote: > Sorry, I have not been able to solve the issue. I used speculation mode as > workaround to this. > > > > -- > View this message in context: >

Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-27 Thread Gourav Sengupta
rom S3, but right now I upgraded to >>> spark 1.5.2 and seems like reading from S3 works fine (first succeeded task >>> in the screenshot attached, which takes 42 s). >>> >>> But than it gets stuck. The screenshot attached shows 24 running tasks >>

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
Hi, So far no one is able to get my question at all. I know what it takes to load packages via SPARK shell or SPARK submit. How do I load packages when starting a SPARK cluster, as mentioned here http://spark.apache.org/docs/latest/spark-standalone.html ? Regards, Gourav Sengupta On Mon

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
Hi, How to we include the following package: https://github.com/databricks/spark-csv while starting a SPARK standalone cluster as mentioned here: http://spark.apache.org/docs/latest/spark-standalone.html Thanks and Regards, Gourav Sengupta On Mon, Feb 15, 2016 at 10:32 AM, Ramanathan R

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
wrote: > $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.3.0 > > > > It will download everything for you and register into your JVM. If you > want to use it in your Prod just package it with maven. > > On 15/02/2016, at 12:14, Gourav Sengupta <g

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
luster in local mode kindly do not attempt in answering this question. My question is how to use packages like https://github.com/databricks/spark-csv when I using SPARK cluster in local mode. Regards, Gourav Sengupta <http://spark.apache.org/docs/latest/spark-standalone.html> On Mon, Feb 15, 201

Re: Stored proc with spark

2016-02-16 Thread Gourav Sengupta
Hi Gaurav, do you mean stored proc that returns a table? Regards, Gourav On Tue, Feb 16, 2016 at 9:04 AM, Gaurav Agarwal wrote: > Hi > Can I load the data into spark from oracle storedproc > > Thanks >

Re: Scala from Jupyter

2016-02-16 Thread Gourav Sengupta
Apache Zeppelin will be the right solution with in built plugins for python and visualizations as well. Are you planning to use this in EMR? Regards, Gourav On Tue, Feb 16, 2016 at 12:04 PM, Rajeev Reddy wrote: > Hello, > > Let me understand your query correctly. >

Re: Scala from Jupyter

2016-02-16 Thread Gourav Sengupta
take a look here as well http://zeppelin-project.org/ it executes Scala and Python and Markup document in the same notebook and draws beautiful visualisations as well. It comes built in AWS EMR as well. Regards, Gourav On Tue, Feb 16, 2016 at 12:43 PM, Aleksandr Modestov <

Re: Reading CSV file using pyspark

2016-02-18 Thread Gourav Sengupta
as there are some write issues which 2.11 resolves. Hopefully you are using the latest release of SPARK. $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 Regards, Gourav Sengupta On Thu, Feb 18, 2016 at 11:05 AM, Teng Qiu <teng...@gmail.com> wrote: > downloa

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
Hi, Just out of sheet curiosity why are you not using EMR to start your SPARK cluster? Regards, Gourav On Thu, Feb 18, 2016 at 12:23 PM, Ted Yu wrote: > Have you seen this ? > > HADOOP-10988 > > Cheers > > On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
interesting. And I am almost sure that none of EMR hosted services of HADOOP, SPARK, Zepplin, etc are exposed to the external IP addresses even if you are using the classical setting. Regards, Gourav Sengupta On Thu, Feb 18, 2016 at 2:25 PM, Teng Qiu <teng...@gmail.com> wrote: > EMR

Re: Is this likely to cause any problems?

2016-02-18 Thread Gourav Sengupta
, Gourav Sengupta On Thu, Feb 18, 2016 at 2:30 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Please see the last 3 posts on this thread: > > http://search-hadoop.com/m/q3RTtTorTf2o3UGK1=Re+spark+ec2+vs+EMR > > FYI > > On Thu, Feb 18, 2016 at 6:25 AM, Teng Qiu <teng

Re: Why no computations run on workers/slaves in cluster mode?

2016-02-18 Thread Gourav Sengupta
one system and not other then the workers will only run from that system. Regards, Gourav Sengupta On Wed, Feb 17, 2016 at 4:20 PM, Junjie Qian <qian.jun...@outlook.com> wrote: > Hi all, > > I am new to Spark, and have one problem that, no computations run on > workers/slave_servers in

Re: Accessing Web UI

2016-02-19 Thread Gourav Sengupta
can you please try localhost:8080? Regards, Gourav Sengupta On Fri, Feb 19, 2016 at 11:18 AM, vasbhat <vasb...@gmail.com> wrote: > Hi, > >I have installed the spark1.6 and trying to start the master > (start-master.sh) and access the webUI. > > I get the f

Re: Spark Job Hanging on Join

2016-02-21 Thread Gourav Sengupta
know. From what I reckon joins like yours should not take more than a few milliseconds. Regards, Gourav Sengupta On Fri, Feb 19, 2016 at 5:31 PM, Tamara Mendt <t...@hellofresh.com> wrote: > Hi all, > > I am running a Spark job that gets stuck attempting to join two > datafram

Re: Spark Job Hanging on Join

2016-02-21 Thread Gourav Sengupta
Sorry, please include the following questions to the list above: the SPARK version? whether you are using RDD or DataFrames? is the code run locally or in SPARK Cluster mode or in AWS EMR? Regards, Gourav Sengupta On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta <gourav.sengu...@gmail.

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Gourav Sengupta
. Regards, Gourav Sengupta On Tue, Mar 1, 2016 at 9:15 AM, Oleg Ruchovets <oruchov...@gmail.com> wrote: > Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but > it looks it does't work and throws exceptions. > Please advice: > > [hadoop@ip-172-31-39-3

Re: Spark UI standalone "crashes" after an application finishes

2016-03-01 Thread Gourav Sengupta
or Scala (see Apache Toree) or use Zeppelin. Regards, Gourav Sengupta On Mon, Feb 29, 2016 at 11:48 PM, Sumona Routh <sumos...@gmail.com> wrote: > Hi there, > I've been doing some performance tuning of our Spark application, which is > using Spark 1.2.1 standalone. I have been

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
security. Regards, Gourav Sengupta On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan <sabarish@gmail.com > wrote: > You have a slash before the bucket name. It should be @. > > Regards > Sab > On 15-Mar-2016 4:03 pm, "Yasemin Kaya" <godo...@gmail.com> w

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Hi, Try starting your clusters with roles, and you will not have to configure, hard code anything at all. Let me know in case you need any help with this. Regards, Gourav Sengupta On Tue, Mar 15, 2016 at 11:32 AM, Yasemin Kaya <godo...@gmail.com> wrote: > Hi Safak, > > I c

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
. Understand what you > suggested is an appropriate way of doing it, which I myself have proposed > before, but that doesn't solve the OP's problem at hand. > > Regards > Sab > On 15-Mar-2016 8:27 pm, "Gourav Sengupta" <gourav.sengu...@gmail.com> > wrote:

Re: S3n performance (@AaronDavidson)

2016-04-13 Thread Gourav Sengupta
Hi, I have stopped working on s3n for a long time now. In case you are working with parquet and writing files s3a is the only alternative to failures. Otherwise why not use just s3://? Regards, Gourav On Wed, Apr 13, 2016 at 12:17 PM, Steve Loughran wrote: > > On 12

Re: Sqoop on Spark

2016-04-08 Thread Gourav Sengupta
modules in SPARK Local Server mode, please let me know. Regards, Gourav Sengupta On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> wrote: > Good to know that. > > That is why Sqoop has this "direct" mode, to utilize the vendor specific > feature. &g

Re: Weird error while serialization

2016-04-09 Thread Gourav Sengupta
Hi, why are you not using data frames and SPARK CSV? Regards, Gourav On Sat, Apr 9, 2016 at 10:00 PM, SURAJ SHETH wrote: > Hi, > I am using Spark 1.5.2 > > The file contains 900K rows each with twelve fields (tab separated): > The first 11 fields are Strings with a maximum

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Gourav Sengupta
why not use AWS Lambda? Regards, Gourav On Fri, Apr 8, 2016 at 8:14 PM, Benjamin Kim wrote: > Has anyone monitored an S3 bucket or directory using Spark Streaming and > pulled any new files to process? If so, can you provide basic Scala coding > help on this? > > Thanks, >

Re: PYSPARK_PYTHON doesn't work in spark worker

2016-03-07 Thread Gourav Sengupta
hi, how are you running your SPARK cluster (is it in local mode or distributed mode). Do you have pyspark installed in anaconda? Regards, Gourav Sengupta On Mon, Mar 7, 2016 at 9:28 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hi all > I had following c

Re: Sorting the dataframe

2016-03-04 Thread Gourav Sengupta
optimization. Regards, Gourav Sengupta On Fri, Mar 4, 2016 at 8:35 AM, Mohammad Tariq <donta...@gmail.com> wrote: > You could try DataFrame.sort() to sort your data based on a column. > > > > [image: http://] > > Tariq, Mohammad > about.me/mti > [image

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Gourav Sengupta
messages all over the place for another 20 mins after which we killed jupyter application. Regards, Gourav Sengupta On Sun, Mar 6, 2016 at 11:48 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Gourav: > For the 3rd paragraph, did you mean the job seemed to be idle for about 5 > minut

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Gourav Sengupta
which is not running for lower than 5 minutes. Regards, Gourav Sengupta On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com> wrote: > Working on a streaming job with DirectParquetOutputCommitter to S3 > I need to use PartitionBy and hence SaveMode.Append > &g

Re: Spark SQL drops the HIVE table in "overwrite" mode while writing into table

2016-03-06 Thread Gourav Sengupta
hi, is the table that you are trying to overwrite an external table or temporary table created in hivecontext? Regards, Gourav Sengupta On Sat, Mar 5, 2016 at 3:01 PM, Dhaval Modi <dhavalmod...@gmail.com> wrote: > Hi Team, > > I am facing a issue while writing dataframe bac

Re: How can I pass a Data Frame from object to another class

2016-03-06 Thread Gourav Sengupta
. Regards, Gourav Sengupta On Sun, Mar 6, 2016 at 10:57 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks for this tip > > The way I do it is to pass SparckContext "sc" to method > firstquery.firstquerym by calling the following > >

Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
Hi, That depends on a lot of things, but as a starting point I would ask whether you are planning to store your data in JSON format? Regards, Gourav Sengupta On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi < guillaume.bilod...@gmail.com> wrote: > Our problem space is survey analyti

Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
data projects (like any other BI projects) do not deliver value or turn extremely expensive to maintain because the approach is that tools solve the problem. Regards, Gourav Sengupta On Sun, Mar 6, 2016 at 5:25 PM, Guillaume Bilodeau < guillaume.bilod...@gmail.com> wrote: > The data is

Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
Hi, once again that is all about tooling. Regards, Gourav Sengupta On Sun, Mar 6, 2016 at 7:52 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > > > What is the current size of your relational database? > > > > Are we talking about

Re: SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
> > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 1 March 2016

Re: Spark UI standalone "crashes" after an application finishes

2016-03-01 Thread Gourav Sengupta
port 4040 will no more available > after your spark app finished. you should go to spark master's UI > (port 8080), and take a look "completed applications"... > > refer to doc: http://spark.apache.org/docs/latest/monitoring.html > read the first "note that" :) &g

Re: Spark sql query taking long time

2016-03-03 Thread Gourav Sengupta
Hi, why not read the table into a dataframe directly using SPARK CSV package. You are trying to solve the problem the round about way. Regards, Gourav Sengupta On Thu, Mar 3, 2016 at 12:33 PM, Sumedh Wale <sw...@snappydata.io> wrote: > On Thursday 03 March 2016 11:03 AM, Angel An

Re: Using Spark SQL / Hive on AWS EMR

2016-03-03 Thread Gourav Sengupta
Hi, Why are you trying to load data into HIVE and then access it via hiveContext? (by the way hiveContext tables are not visible in the sqlContext). Please read the data directly into a SPARK dataframe and then register it as a temp table to run queries on it. Regards, Gourav On Thu, Mar 3,

Re: Spark sql query taking long time

2016-03-03 Thread Gourav Sengupta
Hi, using dataframes you can use SQL, and SQL has an option of JOIN, BETWEEN, IN and LIKE OPERATIONS. Why would someone use a dataframe and then use them as RDD's? :) Regards, Gourav Sengupta On Thu, Mar 3, 2016 at 4:28 PM, Sumedh Wale <sw...@snappydata.io> wrote: > On Thursday 03 M

Re: SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
in the library folder than the ones which are usually supplied with the SPARK distribution: 1. ojdbc7.jar 2. spark-csv***jar file Regards, Gourav Sengupta On Tue, Mar 1, 2016 at 5:19 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > I am getting the error "*java.la

Fwd: Starting SPARK application in cluster mode from an IDE

2016-03-01 Thread Gourav Sengupta
Hi, I will be grateful if someone could kindly respond back to this query. Thanks and Regards, Gourav Sengupta -- Forwarded message -- From: Gourav Sengupta <gourav.sengu...@gmail.com> Date: Sat, Feb 27, 2016 at 12:39 AM Subject: Starting SPARK application in cluster mod

SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
Francisco", 12, 44.52, true), Row("Palo Alto", 12, 22.33, false), Row("Munich", 8, 3.14, true))) val hiveContext = new HiveContext(sc) //val sqlContext = new org.apache.spark.sql.SQLContext(sc) } } - Regards, Gourav Sengupta

Re: Unable to execute query on SAPHANA using SPARK

2016-03-29 Thread Gourav Sengupta
Hi Reena, Why would you want to run a SPARK off data in SAP HANA? Is not SAP HANA already an in memory, columnar storage, SAP bells-and-whistles, super-duper expensive way of doing what poor people do in SPARK sans SAP ERP integration layers? I am just trying to understand the used case here.

Re: Reading from Amazon S3

2016-04-28 Thread Gourav Sengupta
Why would you use JAVA (create a problem and then try to solve it)? Have you tried using Scala or Python or even R? Regards, Gourav On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran wrote: > > On 26 Apr 2016, at 18:49, Ted Yu wrote: > > Looking at

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
6 12:11:18 -0700 >> Subject: Re: Weird results with Spark SQL Outer joins >> To: gourav.sengu...@gmail.com >> CC: user@spark.apache.org >> >> >> Gourav, >> >> I wish that was case, but I have done a select count on each of the two >> tables in

Re: Error from reading S3 in Scala

2016-05-03 Thread Gourav Sengupta
Hi, The best thing to do is start the EMR clusters with proper permissions in the roles that way you do not need to worry about the keys at all. Another thing, why are we using s3a// instead of s3:// ? Besides that you can increase s3 speeds using the instructions mentioned here:

Re: Reading from Amazon S3

2016-05-02 Thread Gourav Sengupta
JAVA does not easily parallelize, JAVA is verbose, uses different classes for serializing, and on top of that you are using RDD's instead of dataframes. Should a senior project not have an implied understanding that it should be technically superior? Why not use SCALA? Regards, Gourav On Mon,

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Spark version: 1.6 > Result from spark shell > OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( > mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat > 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 > > Thanks, > > KP > > On Mon, May 2, 20

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
Hi, As always, can you please write down details regarding your SPARK cluster - the version, OS, IDE used, etc? Regards, Gourav Sengupta On Mon, May 2, 2016 at 5:58 PM, kpeng1 <kpe...@gmail.com> wrote: > Hi All, > > I am running into a weird result with Spark SQL Outer join

Re: SparkSQL with large result size

2016-05-02 Thread Gourav Sengupta
Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and writing it to Parquet format and then querying parquet format to query it and partition the data and write out individual csv files without any issues on a single node SPARK cluster installation. Are you trying to

Re: Reading from Amazon S3

2016-05-02 Thread Gourav Sengupta
ld be better to > support him with the problem, because Spark supports Java. Java and Scala > run on the same underlying JVM. > > On 02 May 2016, at 17:42, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > JAVA does not easily parallelize, JAVA is verbose, uses differen

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Gourav Sengupta
This shows that both the tables have matching records and no mismatches. Therefore obviously you have the same results irrespective of whether you use right or left join. I think that there is no problem here, unless I am missing something. Regards, Gourav On Mon, May 2, 2016 at 7:48 PM, kpeng1

Re: Spark on AWS

2016-05-02 Thread Gourav Sengupta
://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html In case you are trying to load enough data in the spark Master node for graphing or exploratory analysis using Matlab, seaborn or bokeh its better to increase the driver memory by recreating spark context. Regards Gourav Sengupta

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Gourav Sengupta
Hi Kevin, Having given it a first look I do think that you have hit something here and this does not look quite fine. I have to work on the multiple AND conditions in ON and see whether that is causing any issues. Regards, Gourav Sengupta On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <

  1   2   3   4   5   6   >