Hi Danellis
For point 1 , spark streaming is something to look at.
For point 2 , you can create DAO from cassandra on each stream
processing.This may be costly operation though , but to do real time
processing of data , you have to live with t.
Point 3 is covered in point 2 above.
Since you are
I am doing join over 1 dataframe and a empty data frame.
The first dataframe got almost 50k records.
This operation nvere returns back and runs indefinitely.
Is there any solution to get around this?
--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net
ll config to overcome this.
> Tried almost everything i could after searching online.
>
> Any help from the mailing list would be appreciated.
>
> On Thu, Aug 4, 2016 at 7:43 AM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> I am facing the same issue with spark 1.5
I am facing the same issue with spark 1.5.2
If the file size that's being processed by spark , is of size 10-12 MB , it
throws out of memory .
But if the same file is within 5 MB limit , it runs fine.
I am using spark configuration with 7GB of memory and 3 cores for executors
in the cluster of 8
Yes.I am using spark for ETL and I am sure there are lot of other companies
who are using spark for ETL.
Thanks
Deepak
On 2 Aug 2016 11:40 pm, "Rohit L" wrote:
> Does anyone use Spark for ETL?
>
> On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal wrote:
I am using DAO in spark application to write the final computation to
Cassandra and it performs well.
What kinds of issues you foresee using DAO for hbase ?
Thanks
Deepak
On 19 Jul 2016 10:04 pm, "Yu Wei" wrote:
> Hi guys,
>
>
> I write spark application and want to store
In spark streaming , you have to decide the duration of micro batches to
run.
Once you get the micro batch , transform it as per your logic and then you
can use saveAsTextFiles on your final RDD to write it to HDFS.
Thanks
Deepak
On 20 Jul 2016 9:49 am, wrote:
Hi Phil
I guess for() is executed on the driver while foreach() will execute it in
parallel.
You can try this without collecting the rdd try both .
foreach in this case would print on executors and you would not see
anything on the driver console.
Thanks
Deepak
On Tue, Jul 12, 2016 at 9:28 PM,
Yes .You can do something like this :
.map(x=>mapfunction(x))
Thanks
Deepak
On 9 Jul 2016 9:22 am, "charles li" wrote:
>
> hi, guys, is there a way to dynamic load files within the map function.
>
> i.e.
>
> Can I code as bellow:
>
>
>
>
> thanks a lot.
>
>
>
> --
>
You have to distribute the files in some distributed file system like hdfs.
Or else copy the files to all executors local file system and make sure to
mention the file scheme in the URI explicitly.
Thanks
Deepak
On Thu, Jul 7, 2016 at 7:13 PM, Balachandar R.A.
wrote:
atic
> write a size properly for what I already set in Alluxio 512MB per block.
>
>
> On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmc...@gmail.com> wrote:
>
> Before writing coalesing your rdd to 1 .
> It will create only 1 output file .
> Multiple part file happens
Hi Ajay
Looking at spark code , i can see you used hive context.
Can you try using sql context instead of hive context there?
Thanks
Deepak
On Mon, Jun 13, 2016 at 10:15 PM, Ajay Chander wrote:
> Hi Mohit,
>
> Thanks for your time. Please find my response below.
>
> Did
I am not sure if Spark provides any support for incremental extracts
inherently.
But you can maintain a file e.g. extractRange.conf in hdfs , to read from
it the end range and update it with new end range from spark job before it
finishes with the new relevant ranges to be used next time.
On
Hi Mayuresh
Instead of s3a , have you tried the https:// uri for the same s3 bucket?
HTH
Deepak
On Tue, May 31, 2016 at 4:41 PM, Mayuresh Kunjir
wrote:
>
>
> On Tue, May 31, 2016 at 5:29 AM, Steve Loughran
> wrote:
>
>> which s3 endpoint?
>>
>>
>
Hi Saurabh
You can have hadoop cluster running YARN as scheduler.
Configure spark to run with the same YARN setup.
Then you need R only on 1 node , and connect to the cluster using the
SparkR.
Thanks
Deepak
On Mon, May 30, 2016 at 12:12 PM, Jörn Franke wrote:
>
> Well if
Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2
How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of
Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2
How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of
Hi
I have scala program consisting of spark core and spark streaming APIs
Is there any open source tool that i can use to debug the program for
performance reasons?
My primary interest is to find the block of codes that would be exeuted on
driver and what would go to the executors.
Is there JMX
dead and it shuts down abruptly.
>> Could this issue be related to yarn? I see correct behavior locally. I
>> did "yarn kill " to kill the job.
>>
>>
>> On Thu, May 12, 2016 at 12:28 PM Deepak Sharma <deepakmc...@gmail.com>
>> wrote:
>>
(Marketing Platform-BLR) <
rakes...@flipkart.com> wrote:
> Yes, it seems to be the case.
> In this case executors should have continued logging values till 300, but
> they are shutdown as soon as i do "yarn kill .."
>
> On Thu, May 12, 2016 at 12:11 PM Deepak Sharma
er$: VALUE -> 205
> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 206
>
>
>
>
>
>
> On Thu, May 12, 2016 at 11:45 AM Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Hi Rakesh
>> Did you tried se
Hi Rakesh
Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true *for
your spark configuration instance?
If not try this , and let us know if this helps.
Thanks
Deepak
On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) <
rakes...@flipkart.com> wrote:
> Issue i
Since you are registering workers from the same node , do you have enough
cores and RAM(In this case >=9 cores and > = 24 GB ) on this
node(11.14.224.24)?
Thanks
Deepak
On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ
wrote:
> Hi All,
>
> I need to set same memory and
then apply
> compression codec on it, save the rdd to another Hadoop cluster?
>
> Thank you,
> Ajay
>
> On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote:
>
>> Hi Ajay
>> You can look at wholeTextFiles method of rdd[string,string] and
Hi Ajay
You can look at wholeTextFiles method of rdd[string,string] and then map
each of rdd to saveAsTextFile .
This will serve the purpose .
I don't think if anything default like distcp exists in spark
Thanks
Deepak
On 10 May 2016 11:27 pm, "Ajay Chander" wrote:
> Hi
Spark 2.0 is yet to come out for public release.
I am waiting to get hands on it as well.
Please do let me know if i can download source and build spark2.0 from
github.
Thanks
Deepak
On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind wrote:
> Hi All,
>
> We are evaluating a
With Structured Streaming ,Spark would provide apis over spark sql engine.
Its like once you have the structured stream and dataframe created out of
this , you can do ad-hoc querying on the DF , which means you are actually
querying the stram without having to store or transform.
I have not used
Hi Tapan
I would suggest an architecture where you have different storage layer and
data servng layer.
Spark is still best for batch processing of data.
So what i am suggesting here is you can have your data stored as it is in
some hdfs raw layer , run your ELT in spark on this raw data and
Hi all,
I am looking for an architecture to ingest 10 mils of messages in the micro
batches of seconds.
If anyone has worked on similar kind of architecture , can you please
point me to any documentation around the same like what should be the
architecture , which all components/big data
as trying to
> run big data stuff on windows. Have run in so much of issues that I could
> just throw the laptop with windows out.
>
> Your view - Redhat, Ubuntu or Centos.
> Does Redhat give a one year licence on purchase etc?
>
> Thanks
>
> On Mon, Apr 18, 2016 at
re Galore on Spark.
> Since I am starting afresh, what would you advice?
>
> On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Binary for Spark means ts spark built against hadoop 2.6
>> It will not have any hadoop executables.
&
binary format or will have to build it?
> 3) Is there a basic tutorial for Hadoop on windows for the basic needs of
> Spark.
>
> Thanks in Advance !
>
> On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Once you download hadoop
Once you download hadoop and format the namenode , you can use start-dfs.sh
to start hdfs.
Then use 'jps' to sss if datanode/namenode services are up and running.
Thanks
Deepak
On Mon, Apr 18, 2016 at 5:18 PM, My List wrote:
> Hi ,
>
> I am a newbie on Spark.I wanted to
Hello All,
I am looking for a use case where anyone have used spark streaming
integration with LinkedIn.
--
Thanks
Deepak
Hi Rafael
If you are using yarn as the engine , you can always use RM UI to see the
application progress.
Thanks
Deepak
On Tue, Apr 5, 2016 at 12:18 PM, Rafael Barreto
wrote:
> Hello,
>
> I have a driver deployed using `spark-submit` in supervised cluster mode.
>
There is Spark action defined for oozie workflows.
Though I am not sure if it supports only Java SPARK jobs or Scala jobs as
well.
https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
Thanks
Deepak
On Mon, Mar 7, 2016 at 2:44 PM, Divya Gehlot
wrote:
> Hi,
>
Yes , you can do it unless the method is marked static/final.
Most of the methods in SparkContext are marked static so you can't over
ride them definitely , else over ride would work usually.
Thanks
Deepak
On Fri, Jan 8, 2016 at 12:06 PM, yuliya Feldman wrote:
>
Invalid jobj 2. If SparkR was restarted, Spark operations need to be
> re-executed.
>
>
> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Hi Sandee
Hi Sandeep
I am not sure if ORC can be read directly in R.
But there can be a workaround .First create hive table on top of ORC files
and then access hive table in R.
Thanks
Deepak
On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana
wrote:
> Hello
>
> I need to read an ORC
I have never tried this but there is yarn client api's that you can use in
your spark program to get the application id.
Here is the link to the yarn client java doc:
http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html
getApplications() is the method for your
An approach I can think of is using Ambari Metrics Service(AMS)
Using these metrics , you can decide upon if the cluster is low in
resources.
If yes, call the Ambari management API to add the node to the cluster.
Thanks
Deepak
On Mon, Dec 14, 2015 at 2:48 PM, cs user
Hi All
Sorry for spamming your inbox.
I am really keen to work on a big data project full time(preferably remote
from India) , if not I am open to volunteering as well.
Please do let me know if there is any such opportunity available
--
Thanks
Deepak
Sai,
I am bit confused here.
How are you using write with results?
I am using spark 1.4.1 and when i use write , it complains about write not
being member of DataFrame.
error:value write is not a member of org.apache.spark.sql.DataFrame
Thanks
Deepak
On Mon, Nov 16, 2015 at 4:10 PM, 张炜
Hi All
I am confused on RDD persistence in cache .
If I cache RDD , is it going to stay there in memory even if my spark
program completes execution , which created it.
If not , how can I guarantee that RDD is persisted in cache even after the
program finishes execution.
Thanks
Deepak
uot; <engr...@gmail.com> wrote:
> The cache gets cleared out when the job finishes. I am not aware of a way
> to keep the cache around between jobs. You could save it as an object file
> to disk and load it as an object file on your next job for speed.
> On Thu, Nov 5, 2015 at 6:1
Hi
I am looking for any blog / doc on the developer's best practices if using
Spark .I have already looked at the tuning guide on spark.apache.org.
Please do let me know if any one is aware of any such resource.
Thanks
Deepak
101 - 146 of 146 matches
Mail list logo