Hi Ralf,
using secret keys and authorization details is a strict NO for AWS, they
are major security lapses and should be avoided at any cost.
Have you tried starting the clusters using ROLES, they are wonderful way to
start clusters or EC2 nodes and you do not have to copy and paste any
is since Spark written in scala, having done in
Scala will be ok for certification.
If someone who has done certification can confirm.
Thanks,
Kartik
On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com
wrote:
Hi,
how important is JAVA for Spark certification? Will learning only
Hi,
how important is JAVA for Spark certification? Will learning only Python
and Scala not work?
Regards,
Gourav
Excellent resource: http://www.oreilly.com/pub/e/3330
And more amazing is the fact that the presenter actually responds to your
questions.
Regards,
Gourav Sengupta
On Wed, Aug 19, 2015 at 4:12 PM, Todd bit1...@163.com wrote:
Hi,
I would ask if there are some blogs/articles/videos on how
.
Regards,
Gourav Sengupta
On Wed, Aug 12, 2015 at 1:01 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:
Perhaps you could time the end-to-end runtime for each pipeline, and each
stage?
Through Id be fairly confidant that Spark will outperform hive/mahout on
MR, that's not he only
Why would you create a class and then instantiate it to store data and
change the class every time you have to add a new element? In OOPS
terminology a class represents an object, and an object has states - does
it not?
Purely from a data warehousing perspective - one of the fundamental
I am not quite sure about this but should the notation not be
s3n://redactedbucketname/*
instead of
s3a://redactedbucketname/*
The best way is to use s3://bucketname/path/*
Regards,
Gourav
On Tue, Aug 25, 2015 at 10:35 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can change the names,
Hi,
If you start your EC2 nodes with correct roles (default in most cases
depending on your needs) you should be able to work on S3 and all other AWS
resources without giving any keys.
I have been doing that for some time now and I have not faced any issues
yet.
Regards,
Gourav
On Tue, Sep
Hi,
I think that it is a very bad practice to use your keys in nodes.
Please start EC2 nodes/ EMR Clusters with proper roles and you do not have
to worry about any keys at all.
Kindly refer to AWS documentation for further details.
Regards,
Gourav
On Mon, Sep 21, 2015 at 4:34 PM, Michel Lemay
Hi,
And so you have the money to keep a SPARK cluster up and running? The way I
make it work is test the code in local system with a localised spark
installation and then create data pipeline triggered by lambda which starts
SPARK cluster and processes the data via SPARK steps and then terminates
Hi,
This is how the data can be created:
1. TableA : cached()
2. TableB : cached()
3. TableC: TableA inner join TableB cached()
4. TableC join TableC does not take the data from cache but starts reading
the data for TableA and TableB from disk.
Does this sound like a bug? The self join between
hi,
I think that people have reported the same issue elsewhere, and this should
be registered as a bug in SPARK
https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html
Regards,
Gourav
On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:
in SPARK
>
> https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html
>
>
> Regards,
> Gourav
>
> On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> The self join works fi
Hi,
the attached DAG shows that for the same table (self join) SPARK is
unnecessarily getting data from S3 for one side of the join where as its
able to use cache for the other side.
Regards,
Gourav
On Fri, Dec 18, 2015 at 10:29 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:
Hi,
I have a HIVE table with few thousand partitions (based on date and time).
It takes a long time to run if for the first time and then subsequently it
is fast.
Is there a way to store the cache of partition lookups so that every time I
start a new SPARK instance (cannot keep my personal
Hi Jeff,
sadly that does not resolve the issue. I am sure that the memory mapping to
physical files locations can be saved and recovered in SPARK.
Regards,
Gourav Sengupta
On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang <zjf...@gmail.com> wrote:
> oh, you are using S3. As I remember
I guess you mean the stage of getting the split info. I suspect it might
> be your cluster issue (or metadata store), unusually it won't take such
> long time for splitting.
>
> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> H
ing it
as a table? I think we should be using hivecontext or sqlcontext to run
queries on a registered table.
Regards,
Gourav Sengupta
On Sat, Dec 26, 2015 at 6:27 PM, Eugene Morozov <evgeny.a.moro...@gmail.com>
wrote:
> Chris, thanks. That'd be great to try =)
>
> --
> Be w
4,c#90], Some(d)
>+- Sort [c#253 ASC], false, 0
> +- TungstenExchange hashpartitioning(c#253,200), None
> +- InMemoryColumnarTableScan [c#253], InMemoryRelation
> [b#246,c#253], true, 1, StorageLevel(true, true, false, true, 1),
> Project [b#4,c#90], Some(d)
>
&g
by including the following:
PYSPARK_PYTHON=<>/anaconda2/bin/python2.7 PATH=$PATH:<>/anaconda/bin <>/pyspark
:)
In case you are using it in EMR the solution is a bit tricky. Just let me
know in case you want any further help.
Regards,
Gourav Sengupta
On Thu, Jun 2, 2016 at 7:59 PM
on (A.PK = B.FK)
where B.FK is not null;
This query takes 4.5 mins in SPARK
Regards,
Gourav Sengupta
will surely be excited to see if I am going wrong here and post the
results of sql.describe(). Thanks a ton once again.
Hi Ted,
Is there anyway you can throw some light on this before I post this in a
blog?
Regards,
Gourav Sengupta
On Fri, Jun 10, 2016 at 7:22 PM, Gavin Yue <yue.yu
ava...@gmail.com> wrote:
> ooc are the tables partitioned on a.pk and b.fk? Hive might be using
> copartitioning in that case: it is one of hive's strengths.
>
> 2016-06-09 7:28 GMT-07:00 Gourav Sengupta <gourav.sengu...@gmail.com>:
>
>> Hi Mich,
>>
>
Hi,
are you using EC2 instances or local cluster behind firewall.
Regards,
Gourav Sengupta
On Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
>
> I'm trying to create a table on s3a but I keep hitting the following error:
>
> Exce
t;
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.c
gt; Could you print out the sql execution plan? My guess is about broadcast
> join.
>
>
>
> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Hi,
>
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
>
,
Gourav Sengupta
On Tue, May 31, 2016 at 12:22 PM, Mayuresh Kunjir <mayur...@cs.duke.edu>
wrote:
> How do I use it? I'm accessing s3a from Spark's textFile API.
>
> On Tue, May 31, 2016 at 7:16 AM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Hi Mayuresh
&
Hi,
And on another note, is it required to use s3a? Why not use s3:// only? I
prefer to use s3a:// only while writing files to S3 from EMR.
Regards,
Gourav Sengupta
On Tue, May 31, 2016 at 12:04 PM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:
> Hi,
>
> Is your spark
Hi,
Is your spark cluster running in EMR or via self created SPARK cluster
using EC2 or from a local cluster behind firewall? What is the SPARK
version you are using?
Regards,
Gourav Sengupta
On Sun, May 29, 2016 at 10:55 PM, Mayuresh Kunjir <mayur...@cs.duke.edu>
wrote:
> I'
Hi,
have you tried using partitioning and parquet format. It works super fast
in SPARK.
Regards,
Gourav
On Mon, May 30, 2016 at 5:08 PM, Michael Segel
wrote:
> I’m not sure where to post this since its a bit of a philosophical
> question in terms of design and
+1 for the guidance from Nirvan. Also it would be better to repartition and
store the data in parquet format in case you are planning to do the joins
more than once or with other data sources. Parquet with SPARK works likes a
charm. Over S3 I have seen its performance being quite close to cached
used
case.
Spark in local mode will be way faster compared to SPARK running on HADOOP.
I have a system with 64 GB RAM and SSD and its performance on local cluster
SPARK is way better.
Did your join include the same number of columns and rows for the dimension
table?
Regards,
Gourav Sengupta
Hi,
Can you please see the query plan (in case you are using a query)?
There is a very high chance that the query was broken into multiple steps
and only a subsequent step failed.
Regards,
Gourav Sengupta
On Fri, Jun 17, 2016 at 2:49 PM, Sumona Routh <sumos...@gmail.com> wrote:
,
Gourav Sengupta
sec for 1 gb of data whereas in Spark, it is taking 4 mins
> of time.
> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>
> Could you print out the sql execution plan? My guess is about broadcast
> join.
>
>
>
> On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...
n starting a cluster as mentioned above.
Regards,
Gourav Sengupta
Hi,
The writes, in terms of number of records written simultaneously, can be
increased if you increased the number of partitions. You can try to
increase the number of partitions and check out how it works. There is
though an upper cap (the one that I faced in Ubuntu) on the number of
parallel
gt; constantly going to use them.
>
> Best,
> Burak
>
>
>
> On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I am creating sparkcontext in a SPARK standalone cluster as mentioned
>> here: h
> On 22 Feb 2016, at 11:00, Kayode Odeyemi <drey...@gmail.com> wrote:
>>>>
>>>> Try http://localhost:4040
>>>>
>>>> On Mon, Feb 22, 2016 at 8:23 AM, Vasanth Bhat <vasb...@gmail.com>
>>>> wrote:
>>>>
>>>
Hi,
The solutions is here: https://github.com/databricks/spark-csv
Using the above solution you can read CSV directly into a dataframe as well.
Regards,
Gourav
On Tue, Feb 23, 2016 at 12:03 PM, Devesh Raj Singh
wrote:
> Hi,
>
> I have imported spark csv dataframe in
h that you mention exists or is
available only in one system.
Regards,
Gourav Sengupta
On Tue, Feb 23, 2016 at 8:39 PM, Robineast <robin.e...@xense.co.uk> wrote:
> Hi Thomas
>
> I can confirm that I have had this working in the past. I'm pretty sure you
> don't need p
the files in a s3://bucket/ or s3://bucket/key/ to your local system. And
then you can point your spark cluster to the local data store and run the
queries.Of course that depends on the data volume as well.
Regards,
Gourav Sengupta
On Fri, Feb 26, 2016 at 7:29 PM, Joshua Buss <joshua.b...@gma
:
> https://issues.apache.org/jira/browse/SPARK-8125
>
> You can also look at parent issue.
>
> Which Spark release are you using ?
>
> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
> >
> >
> > Hi,
> >
> >
,
Gourav Sengupta
Hi,
Are you creating RDD's using textfile option? Can you please let me know
the following:
1. Number of partitions
2. Number of files
3. Time taken to create the RDD's
Regards,
Gourav Sengupta
On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:
Hi,
are you creating RDD's out of the data?
Regards,
Gourav
On Tue, Jan 26, 2016 at 12:45 PM, aecc wrote:
> Sorry, I have not been able to solve the issue. I used speculation mode as
> workaround to this.
>
>
>
> --
> View this message in context:
>
rom S3, but right now I upgraded to
>>> spark 1.5.2 and seems like reading from S3 works fine (first succeeded task
>>> in the screenshot attached, which takes 42 s).
>>>
>>> But than it gets stuck. The screenshot attached shows 24 running tasks
>>
Hi,
So far no one is able to get my question at all. I know what it takes to
load packages via SPARK shell or SPARK submit.
How do I load packages when starting a SPARK cluster, as mentioned here
http://spark.apache.org/docs/latest/spark-standalone.html ?
Regards,
Gourav Sengupta
On Mon
Hi,
How to we include the following package:
https://github.com/databricks/spark-csv while starting a SPARK standalone
cluster as mentioned here:
http://spark.apache.org/docs/latest/spark-standalone.html
Thanks and Regards,
Gourav Sengupta
On Mon, Feb 15, 2016 at 10:32 AM, Ramanathan R
wrote:
> $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
>
>
>
> It will download everything for you and register into your JVM. If you
> want to use it in your Prod just package it with maven.
>
> On 15/02/2016, at 12:14, Gourav Sengupta <g
luster in local mode kindly do not attempt
in answering this question.
My question is how to use packages like
https://github.com/databricks/spark-csv when I using SPARK cluster in local
mode.
Regards,
Gourav Sengupta
<http://spark.apache.org/docs/latest/spark-standalone.html>
On Mon, Feb 15, 201
Hi Gaurav,
do you mean stored proc that returns a table?
Regards,
Gourav
On Tue, Feb 16, 2016 at 9:04 AM, Gaurav Agarwal
wrote:
> Hi
> Can I load the data into spark from oracle storedproc
>
> Thanks
>
Apache Zeppelin will be the right solution with in built plugins for python
and visualizations as well.
Are you planning to use this in EMR?
Regards,
Gourav
On Tue, Feb 16, 2016 at 12:04 PM, Rajeev Reddy
wrote:
> Hello,
>
> Let me understand your query correctly.
>
take a look here as well http://zeppelin-project.org/ it executes Scala and
Python and Markup document in the same notebook and draws beautiful
visualisations as well. It comes built in AWS EMR as well.
Regards,
Gourav
On Tue, Feb 16, 2016 at 12:43 PM, Aleksandr Modestov <
as there are some write issues which 2.11 resolves.
Hopefully you are using the latest release of SPARK.
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
Regards,
Gourav Sengupta
On Thu, Feb 18, 2016 at 11:05 AM, Teng Qiu <teng...@gmail.com> wrote:
> downloa
Hi,
Just out of sheet curiosity why are you not using EMR to start your SPARK
cluster?
Regards,
Gourav
On Thu, Feb 18, 2016 at 12:23 PM, Ted Yu wrote:
> Have you seen this ?
>
> HADOOP-10988
>
> Cheers
>
> On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton
interesting.
And I am almost sure that none of EMR hosted services of HADOOP, SPARK,
Zepplin, etc are exposed to the external IP addresses even if you are using
the classical setting.
Regards,
Gourav Sengupta
On Thu, Feb 18, 2016 at 2:25 PM, Teng Qiu <teng...@gmail.com> wrote:
> EMR
,
Gourav Sengupta
On Thu, Feb 18, 2016 at 2:30 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> Please see the last 3 posts on this thread:
>
> http://search-hadoop.com/m/q3RTtTorTf2o3UGK1=Re+spark+ec2+vs+EMR
>
> FYI
>
> On Thu, Feb 18, 2016 at 6:25 AM, Teng Qiu <teng
one system and not other then
the workers will only run from that system.
Regards,
Gourav Sengupta
On Wed, Feb 17, 2016 at 4:20 PM, Junjie Qian <qian.jun...@outlook.com>
wrote:
> Hi all,
>
> I am new to Spark, and have one problem that, no computations run on
> workers/slave_servers in
can you please try localhost:8080?
Regards,
Gourav Sengupta
On Fri, Feb 19, 2016 at 11:18 AM, vasbhat <vasb...@gmail.com> wrote:
> Hi,
>
>I have installed the spark1.6 and trying to start the master
> (start-master.sh) and access the webUI.
>
> I get the f
know. From what I reckon joins like yours should not take more than a few
milliseconds.
Regards,
Gourav Sengupta
On Fri, Feb 19, 2016 at 5:31 PM, Tamara Mendt <t...@hellofresh.com> wrote:
> Hi all,
>
> I am running a Spark job that gets stuck attempting to join two
> datafram
Sorry,
please include the following questions to the list above:
the SPARK version?
whether you are using RDD or DataFrames?
is the code run locally or in SPARK Cluster mode or in AWS EMR?
Regards,
Gourav Sengupta
On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta <gourav.sengu...@gmail.
.
Regards,
Gourav Sengupta
On Tue, Mar 1, 2016 at 9:15 AM, Oleg Ruchovets <oruchov...@gmail.com> wrote:
> Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but
> it looks it does't work and throws exceptions.
> Please advice:
>
> [hadoop@ip-172-31-39-3
or Scala (see Apache Toree) or use Zeppelin.
Regards,
Gourav Sengupta
On Mon, Feb 29, 2016 at 11:48 PM, Sumona Routh <sumos...@gmail.com> wrote:
> Hi there,
> I've been doing some performance tuning of our Spark application, which is
> using Spark 1.2.1 standalone. I have been
security.
Regards,
Gourav Sengupta
On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan <sabarish@gmail.com
> wrote:
> You have a slash before the bucket name. It should be @.
>
> Regards
> Sab
> On 15-Mar-2016 4:03 pm, "Yasemin Kaya" <godo...@gmail.com> w
Hi,
Try starting your clusters with roles, and you will not have to configure,
hard code anything at all.
Let me know in case you need any help with this.
Regards,
Gourav Sengupta
On Tue, Mar 15, 2016 at 11:32 AM, Yasemin Kaya <godo...@gmail.com> wrote:
> Hi Safak,
>
> I c
. Understand what you
> suggested is an appropriate way of doing it, which I myself have proposed
> before, but that doesn't solve the OP's problem at hand.
>
> Regards
> Sab
> On 15-Mar-2016 8:27 pm, "Gourav Sengupta" <gourav.sengu...@gmail.com>
> wrote:
Hi,
I have stopped working on s3n for a long time now. In case you are working
with parquet and writing files s3a is the only alternative to failures.
Otherwise why not use just s3://?
Regards,
Gourav
On Wed, Apr 13, 2016 at 12:17 PM, Steve Loughran
wrote:
>
> On 12
modules in SPARK Local Server mode, please let me know.
Regards,
Gourav Sengupta
On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> wrote:
> Good to know that.
>
> That is why Sqoop has this "direct" mode, to utilize the vendor specific
> feature.
&g
Hi,
why are you not using data frames and SPARK CSV?
Regards,
Gourav
On Sat, Apr 9, 2016 at 10:00 PM, SURAJ SHETH wrote:
> Hi,
> I am using Spark 1.5.2
>
> The file contains 900K rows each with twelve fields (tab separated):
> The first 11 fields are Strings with a maximum
why not use AWS Lambda?
Regards,
Gourav
On Fri, Apr 8, 2016 at 8:14 PM, Benjamin Kim wrote:
> Has anyone monitored an S3 bucket or directory using Spark Streaming and
> pulled any new files to process? If so, can you provide basic Scala coding
> help on this?
>
> Thanks,
>
hi,
how are you running your SPARK cluster (is it in local mode or distributed
mode). Do you have pyspark installed in anaconda?
Regards,
Gourav Sengupta
On Mon, Mar 7, 2016 at 9:28 AM, guoqing0...@yahoo.com.hk <
guoqing0...@yahoo.com.hk> wrote:
> Hi all
> I had following c
optimization.
Regards,
Gourav Sengupta
On Fri, Mar 4, 2016 at 8:35 AM, Mohammad Tariq <donta...@gmail.com> wrote:
> You could try DataFrame.sort() to sort your data based on a column.
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image
messages all over the place for another 20 mins
after which we killed jupyter application.
Regards,
Gourav Sengupta
On Sun, Mar 6, 2016 at 11:48 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> Gourav:
> For the 3rd paragraph, did you mean the job seemed to be idle for about 5
> minut
which is not running for lower
than 5 minutes.
Regards,
Gourav Sengupta
On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com> wrote:
> Working on a streaming job with DirectParquetOutputCommitter to S3
> I need to use PartitionBy and hence SaveMode.Append
>
&g
hi,
is the table that you are trying to overwrite an external table or
temporary table created in hivecontext?
Regards,
Gourav Sengupta
On Sat, Mar 5, 2016 at 3:01 PM, Dhaval Modi <dhavalmod...@gmail.com> wrote:
> Hi Team,
>
> I am facing a issue while writing dataframe bac
.
Regards,
Gourav Sengupta
On Sun, Mar 6, 2016 at 10:57 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:
> Thanks for this tip
>
> The way I do it is to pass SparckContext "sc" to method
> firstquery.firstquerym by calling the following
>
>
Hi,
That depends on a lot of things, but as a starting point I would ask
whether you are planning to store your data in JSON format?
Regards,
Gourav Sengupta
On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi <
guillaume.bilod...@gmail.com> wrote:
> Our problem space is survey analyti
data projects (like any other BI projects) do not deliver
value or turn extremely expensive to maintain because the approach is that
tools solve the problem.
Regards,
Gourav Sengupta
On Sun, Mar 6, 2016 at 5:25 PM, Guillaume Bilodeau <
guillaume.bilod...@gmail.com> wrote:
> The data is
Hi,
once again that is all about tooling.
Regards,
Gourav Sengupta
On Sun, Mar 6, 2016 at 7:52 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:
> Hi,
>
>
>
> What is the current size of your relational database?
>
>
>
> Are we talking about
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 1 March 2016
port 4040 will no more available
> after your spark app finished. you should go to spark master's UI
> (port 8080), and take a look "completed applications"...
>
> refer to doc: http://spark.apache.org/docs/latest/monitoring.html
> read the first "note that" :)
&g
Hi,
why not read the table into a dataframe directly using SPARK CSV package.
You are trying to solve the problem the round about way.
Regards,
Gourav Sengupta
On Thu, Mar 3, 2016 at 12:33 PM, Sumedh Wale <sw...@snappydata.io> wrote:
> On Thursday 03 March 2016 11:03 AM, Angel An
Hi,
Why are you trying to load data into HIVE and then access it via
hiveContext? (by the way hiveContext tables are not visible in the
sqlContext).
Please read the data directly into a SPARK dataframe and then register it
as a temp table to run queries on it.
Regards,
Gourav
On Thu, Mar 3,
Hi,
using dataframes you can use SQL, and SQL has an option of JOIN, BETWEEN,
IN and LIKE OPERATIONS. Why would someone use a dataframe and then use them
as RDD's? :)
Regards,
Gourav Sengupta
On Thu, Mar 3, 2016 at 4:28 PM, Sumedh Wale <sw...@snappydata.io> wrote:
> On Thursday 03 M
in the library
folder than the ones which are usually supplied with the SPARK distribution:
1. ojdbc7.jar
2. spark-csv***jar file
Regards,
Gourav Sengupta
On Tue, Mar 1, 2016 at 5:19 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:
> Hi,
>
> I am getting the error "*java.la
Hi,
I will be grateful if someone could kindly respond back to this query.
Thanks and Regards,
Gourav Sengupta
-- Forwarded message --
From: Gourav Sengupta <gourav.sengu...@gmail.com>
Date: Sat, Feb 27, 2016 at 12:39 AM
Subject: Starting SPARK application in cluster mod
Francisco", 12, 44.52, true),
Row("Palo Alto", 12, 22.33, false),
Row("Munich", 8, 3.14, true)))
val hiveContext = new HiveContext(sc)
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
}
}
-
Regards,
Gourav Sengupta
Hi Reena,
Why would you want to run a SPARK off data in SAP HANA? Is not SAP HANA
already an in memory, columnar storage, SAP bells-and-whistles, super-duper
expensive way of doing what poor people do in SPARK sans SAP ERP
integration layers?
I am just trying to understand the used case here.
Why would you use JAVA (create a problem and then try to solve it)? Have
you tried using Scala or Python or even R?
Regards,
Gourav
On Thu, Apr 28, 2016 at 10:07 AM, Steve Loughran
wrote:
>
> On 26 Apr 2016, at 18:49, Ted Yu wrote:
>
> Looking at
6 12:11:18 -0700
>> Subject: Re: Weird results with Spark SQL Outer joins
>> To: gourav.sengu...@gmail.com
>> CC: user@spark.apache.org
>>
>>
>> Gourav,
>>
>> I wish that was case, but I have done a select count on each of the two
>> tables in
Hi,
The best thing to do is start the EMR clusters with proper permissions in
the roles that way you do not need to worry about the keys at all.
Another thing, why are we using s3a// instead of s3:// ?
Besides that you can increase s3 speeds using the instructions mentioned
here:
JAVA does not easily parallelize, JAVA is verbose, uses different classes
for serializing, and on top of that you are using RDD's instead of
dataframes.
Should a senior project not have an implied understanding that it should be
technically superior?
Why not use SCALA?
Regards,
Gourav
On Mon,
Spark version: 1.6
> Result from spark shell
> OS: Linux version 2.6.32-431.20.3.el6.x86_64 (
> mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
> 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014
>
> Thanks,
>
> KP
>
> On Mon, May 2, 20
Hi,
As always, can you please write down details regarding your SPARK cluster -
the version, OS, IDE used, etc?
Regards,
Gourav Sengupta
On Mon, May 2, 2016 at 5:58 PM, kpeng1 <kpe...@gmail.com> wrote:
> Hi All,
>
> I am running into a weird result with Spark SQL Outer join
Hi,
I have worked on 300GB data by querying it from CSV (using SPARK CSV) and
writing it to Parquet format and then querying parquet format to query it
and partition the data and write out individual csv files without any
issues on a single node SPARK cluster installation.
Are you trying to
ld be better to
> support him with the problem, because Spark supports Java. Java and Scala
> run on the same underlying JVM.
>
> On 02 May 2016, at 17:42, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> JAVA does not easily parallelize, JAVA is verbose, uses differen
This shows that both the tables have matching records and no mismatches.
Therefore obviously you have the same results irrespective of whether you
use right or left join.
I think that there is no problem here, unless I am missing something.
Regards,
Gourav
On Mon, May 2, 2016 at 7:48 PM, kpeng1
://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
In case you are trying to load enough data in the spark Master node for
graphing or exploratory analysis using Matlab, seaborn or bokeh its better
to increase the driver memory by recreating spark context.
Regards
Gourav Sengupta
Hi Kevin,
Having given it a first look I do think that you have hit something here
and this does not look quite fine. I have to work on the multiple AND
conditions in ON and see whether that is causing any issues.
Regards,
Gourav Sengupta
On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <
1 - 100 of 537 matches
Mail list logo