Catalyst dependency on Spark Core

2014-07-13 Thread Aniket Bhatnagar
As per the recent presentation given in Scala days ( http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was mentioned that Catalyst is independent of Spark. But on inspecting pom.xml of sql/catalyst module, it seems it has a dependency on Spark Core. Any particular reason for

NotSerializableException exception while using TypeTag in Scala 2.10

2014-07-28 Thread Aniket Bhatnagar
I am trying to serialize objects contained in RDDs using runtime relfection via TypeTag. However, the Spark job keeps failing java.io.NotSerializableException on an instance of TypeCreator (auto generated by compiler to enable TypeTags). Is there any workaround for this without switching to scala

RDD to DStream

2014-08-01 Thread Aniket Bhatnagar
Sometimes it is useful to convert a RDD into a DStream for testing purposes (generating DStreams from historical data, etc). Is there an easy way to do this? I could come up with the following inefficient way but no sure if there is a better way to achieve this. Thoughts? class

Re: RDD to DStream

2014-08-01 Thread Aniket Bhatnagar
August 2014 13:55, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Sometimes it is useful to convert a RDD into a DStream for testing purposes (generating DStreams from historical data, etc). Is there an easy way to do this? I could come up with the following inefficient way but no sure

Re: Spark shell creating a local SparkContext instead of connecting to connecting to Spark Master

2014-08-06 Thread Aniket Bhatnagar
*:7077 $SPARK_HOME/bin/spark-shell​ Then it will connect to that *whatever* master. Thanks Best Regards On Tue, Aug 5, 2014 at 8:51 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi Apologies if this is a noob question. I have setup Spark 1.0.1 on EMR using a slightly modified

Block input-* already exists on this machine; not re-adding it warnings

2014-08-22 Thread Aniket Bhatnagar
Hi everyone I back ported kinesis-asl to spark 1.0.2 and ran a quick test on my local machine. It seems to be working fine but I keep getting the following warnings. I am not sure what it means and weather it is something to worry about or not. 2014-08-22 15:53:43,803 [pool-1-thread-7] WARN

Understanding how to create custom DStreams in Spark streaming

2014-08-22 Thread Aniket Bhatnagar
Hi everyone Sorry about the noob question, but I am struggling to understand ways to create DStreams in Spark. Here is my understanding based on what I could gather from documentation and studying Spark code (as well as some hunch). Please correct me if I am wrong. 1. In most cases, one would

Re: Block input-* already exists on this machine; not re-adding it warnings

2014-08-26 Thread Aniket Bhatnagar
. Thanks, Aniket On 22 August 2014 15:54, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi everyone I back ported kinesis-asl to spark 1.0.2 and ran a quick test on my local machine. It seems to be working fine but I keep getting the following warnings. I am not sure what it means and weather

Using unshaded akka in Spark driver

2014-08-28 Thread Aniket Bhatnagar
I am building (yet another) job server for Spark using Play! framework and it seems like Play's akka dependency conflicts with Spark's shaded akka dependency. Using SBT, I can force Play to use akka 2.2.3 (unshaded) but I haven't been able to figure out how to exclude com.typesafe.akka

[Stream] Checkpointing | chmod: cannot access `/cygdrive/d/tmp/spark/f8e594bf-d940-41cb-ab0e-0fd3710696cb/rdd-57/.part-00001-attempt-215': No such file or directory

2014-09-01 Thread Aniket Bhatnagar
On my local (windows) dev environment, I have been trying to get spark streaming running to test my real time(ish) jobs. I have set the checkpoint directory as /tmp/spark and have installed latest cygwin. I keep getting the following error: org.apache.hadoop.util.Shell$ExitCodeException: chmod:

Re: [Stream] Checkpointing | chmod: cannot access `/cygdrive/d/tmp/spark/f8e594bf-d940-41cb-ab0e-0fd3710696cb/rdd-57/.part-00001-attempt-215': No such file or directory

2014-09-01 Thread Aniket Bhatnagar
Hi everyone It turns out that I had chef installed and it's chmod has higher preference than cygwin's chmod in the PATH. I fixed the environment variable and now its working fine. On 1 September 2014 11:48, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: On my local (windows) dev

[Streaming] Triggering an action in absence of data

2014-09-01 Thread Aniket Bhatnagar
Hi all I am struggling to implement a use case wherein I need to trigger an action in case no data has been received for X amount of time. I haven't been able to figure out an easy way to do this. No state/foreach methods get called when no data has arrived. I thought of generating a 'tick'

Using Spark's ActionSystem for performing analytics using Akka

2014-09-02 Thread Aniket Bhatnagar
Sorry about the noob question, but I was just wondering if we use Spark's ActorSystem (SparkEnv.actorSystem), would it distribute actors across worker nodes or would the actors only run in driver JVM?

Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-11 Thread Aniket Bhatnagar
Hi all I am trying to run kinesis spark streaming application on a standalone spark cluster. The job works find in local mode but when I submit it (using spark-submit), it doesn't do anything. I enabled logs for org.apache.spark.streaming.kinesis package and I regularly get the following in

Out of memory with Spark Streaming

2014-09-11 Thread Aniket Bhatnagar
I am running a simple Spark Streaming program that pulls in data from Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps data and persists to a store. The program is running in local mode right now and runs out of memory after a while. I am yet to investigate heap dumps

Re: Out of memory with Spark Streaming

2014-09-11 Thread Aniket Bhatnagar
: You could set spark.executor.memory to something bigger than the default (512mb) On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am running a simple Spark Streaming program that pulls in data from Kinesis at a batch interval of 10 seconds, windows

Re: Spark on Raspberry Pi?

2014-09-11 Thread Aniket Bhatnagar
Just curiois... What's the use case you are looking to implement? On Sep 11, 2014 10:50 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Limited memory could also cause you some problems and limit usability. If you're looking for a local testing environment, vagrant boxes may serve you much

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread Aniket Bhatnagar
Dependency hell... My fav problem :). I had run into a similar issue with hbase and jetty. I cant remember thw exact fix, but is are excerpts from my dependencies that may be relevant: val hadoop2Common = org.apache.hadoop % hadoop-common % hadoop2Version excludeAll(

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-12 Thread Aniket Bhatnagar
appreciated reinis -- -Original-Nachricht- Von: Aniket Bhatnagar aniket.bhatna...@gmail.com An: sp...@orbit-x.de Cc: user user@spark.apache.org Datum: 11-09-2014 20:00 Betreff: Re: Re[2]: HBase 0.96+ with Spark 1.0+ Dependency hell... My fav problem :). I had

Re: Out of memory with Spark Streaming

2014-09-12 Thread Aniket Bhatnagar
...@gmail.com wrote: Which version of spark are you running? If you are running the latest one, then could try running not a window but a simple event count on every 2 second batch, and see if you are still running out of memory? TD On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-19 Thread Aniket Bhatnagar
Kinesis integration. TD On Thu, Sep 11, 2014 at 4:51 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi all I am trying to run kinesis spark streaming application on a standalone spark cluster. The job works find in local mode but when I submit it (using spark-submit), it doesn't do

Re: Bulk-load to HBase

2014-09-19 Thread Aniket Bhatnagar
foreachPartition() + Put(), but your example code can be used to clean up my code. BTW, since the data uploaded by Put() goes through normal HBase write path, it can be slow. So, it would be nice if bulk-load could be used, since it bypasses the write path. Thanks. *From:* Aniket Bhatnagar

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-24 Thread Aniket Bhatnagar
on the cluster? Thanks, Aniket On 22 September 2014 18:14, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi all I was finally able to figure out why this streaming appeared stuck. The reason was that I was running out of workers in my standalone deployment of Spark. There was no feedback

[Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2014-09-24 Thread Aniket Bhatnagar
Hi all Reading through Spark streaming's custom receiver documentation, it is recommended that onStart and onStop methods should not block indefinitely. However, looking at the source code of KinesisReceiver, the onStart method calls worker.run that blocks until worker is shutdown (via a call to

Spark doesn't retry task while writing to HDFS

2014-10-24 Thread Aniket Bhatnagar
Hi all I have written a job that reads data from HBASE and writes to HDFS (fairly simple). While running the job, I noticed that a few of the tasks failed with the following error. Quick googling on the error suggests that its an unexplained error and is perhaps intermittent. What I am curious to

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Aniket Bhatnagar
Just curious... Why would you not store the processed results in regular relational database? Not sure what you meant by persist the appropriate RDDs. Did you mean output of your job will be RDDs? On 24 October 2014 13:35, ankits ankitso...@gmail.com wrote: I want to set up spark SQL to allow

Re: Out of memory with Spark Streaming

2014-10-31 Thread Aniket Bhatnagar
. thanks for posting this, aniket! -chris On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi all Sorry but this was totally my mistake. In my persistence logic, I was creating async http client instance in RDD foreach but was never closing it leading

Getting spark job progress programmatically

2014-11-18 Thread Aniket Bhatnagar
I am writing yet another Spark job server and have been able to submit jobs and return/save results. I let multiple jobs use the same spark context but I set job group while firing each job so that I can in future cancel jobs. Further, what I deserve to do is provide some kind of status

Is sorting persisted after pair rdd transformations?

2014-11-18 Thread Aniket Bhatnagar
I am trying to figure out if sorting is persisted after applying Pair RDD transformations and I am not able to decisively tell after reading the documentation. For example: val numbers = .. // RDD of numbers val pairedNumbers = numbers.map(number = (number % 100, number)) val sortedPairedNumbers

Re: Getting spark job progress programmatically

2014-11-18 Thread Aniket Bhatnagar
petrella andy.petre...@gmail.com wrote: I started some quick hack for that in the notebook, you can head to: https://github.com/andypetrella/spark-notebook/ blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatnagar aniket.bhatna

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Aniket Bhatnagar
ak...@sigmoidanalytics.com wrote: If something is persisted you can easily see them under the Storage tab in the web ui. Thanks Best Regards On Tue, Nov 18, 2014 at 7:26 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am trying to figure out if sorting is persisted after

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Aniket Bhatnagar
/ apache/spark/Aggregator.scala is that your function will always see the items in the order that they are in the input RDD. An RDD partition is always accessed as an iterator, so it will not be read out of order. On Wed, Nov 19, 2014 at 2:28 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote

Re: Getting spark job progress programmatically

2014-11-19 Thread Aniket Bhatnagar
to add this stuffs in the public API. Any other ideas? On Tue Nov 18 2014 at 4:03:35 PM Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Thanks Andy. This is very useful. This gives me all active stages their percentage completion but I am unable to tie stages to job group (or specific job

Re: Getting spark job progress programmatically

2014-11-19 Thread Aniket Bhatnagar
., https://github.com/apache/spark/pull/3009 If there is anything that needs to be added, please add it to those issues or PRs. On Wed, Nov 19, 2014 at 7:55 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I have for now submitted a JIRA ticket @ https://issues.apache.org/jira

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
What's your cluster size? For streamig to work, it needs shards + 1 executors. On Wed, Nov 26, 2014, 5:53 PM A.K.M. Ashrafuzzaman ashrafuzzaman...@gmail.com wrote: Hi guys, When we are using Kinesis with 1 shard then it works fine. But when we use more that 1 then it falls into an infinite

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
-175-5592433 Twitter https://twitter.com/ashrafuzzaman | Blog http://jitu-blog.blogspot.com/ | Facebook https://www.facebook.com/ashrafuzzaman.jitu Check out The Academy http://newscred.com/theacademy, your #1 source for free content marketing resources On Wed, Nov 26, 2014 at 11:26 PM, Aniket

Programmatically running spark jobs using yarn-client

2014-12-08 Thread Aniket Bhatnagar
I am trying to create (yet another) spark as a service tool that lets you submit jobs via REST APIs. I think I have nearly gotten it to work baring a few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but I have hit the road block with the following issue. I have created a

Re: Programmatically running spark jobs using yarn-client

2014-12-09 Thread Aniket Bhatnagar
inside target/scala-*/projectname-*.jar) and then use it while submitting. If you are not using spark-submit then you can simply add this jar to spark by sc.addJar(/path/to/target/scala*/projectname*jar) Thanks Best Regards On Mon, Dec 8, 2014 at 7:23 PM, Aniket Bhatnagar aniket.bhatna

Spark 1.1.0 does not spawn more than 6 executors in yarn-client mode and ignores --num-executors

2014-12-10 Thread Aniket Bhatnagar
I am running spark 1.1.0 on AWS EMR and I am running a batch job that should seems to be highly parallelizable in yarn-client mode. But spark stop spawning any more executors after spawning 6 executors even though YARN cluster has 15 healthy m1.large nodes. I even tried providing '--num-executors

Re: Having problem with Spark streaming with Kinesis

2014-12-14 Thread Aniket Bhatnagar
The reason is because of the following code: val numStreams = numShards val kinesisStreams = (0 until numStreams).map { i = KinesisUtils.createStream(ssc, streamName, endpointUrl, kinesisCheckpointInterval, InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2) } In the above

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Aniket Bhatnagar
Try the workaround (addClassPathJars(sparkContext, this.getClass.getClassLoader) discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAJOb8buD1B6tUtOfG8_Ok7F95C3=r-zqgffoqsqbjdxd427...@mail.gmail.com%3E Thanks, Aniket On Mon Dec 15 2014 at 07:43:24 Tomoya Igarashi

Re: Serialization issue when using HBase with Spark

2014-12-15 Thread Aniket Bhatnagar
The reason not using sc.newAPIHadoopRDD is it only support one scan each time. I am not sure is that's true. You can use multiple scans as following: val scanStrings = scans.map(scan = convertScanToString(scan)) conf.setStrings(MultiTableInputFormat.SCANS, scanStrings : _*) where

Re: Spark with HBase

2014-12-15 Thread Aniket Bhatnagar
In case you are still looking for help, there has been multiple discussions in this mailing list that you can try searching for. Or you can simply use https://github.com/unicredit/hbase-rdd :-) Thanks, Aniket On Wed Dec 03 2014 at 16:11:47 Ted Yu yuzhih...@gmail.com wrote: Which hbase release

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Aniket Bhatnagar
/TomoyaIgarashi/9688bdd5663af95ddd4d Is there any problem? 2014-12-15 18:48 GMT+09:00 Aniket Bhatnagar aniket.bhatna...@gmail.com: Try the workaround (addClassPathJars(sparkContext, this.getClass.getClassLoader) discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox

Re: Streaming | Partition count mismatch exception while saving data in RDD

2014-12-16 Thread Aniket Bhatnagar
It turns out that this happens when checkpoint is set to a local directory path. I have opened a JIRA SPARK-4862 for Spark streaming to output better error message. Thanks, Aniket On Tue Dec 16 2014 at 20:08:13 Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am using spark 1.1.0 running

Re: Spark 1.1.0 does not spawn more than 6 executors in yarn-client mode and ignores --num-executors

2014-12-16 Thread Aniket Bhatnagar
Hi guys I am hoping someone might have a clue on why this is happening. Otherwise I will have to dwell into YARN module's source code to better understand the issue. On Wed, Dec 10, 2014, 11:54 PM Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am running spark 1.1.0 on AWS EMR and I am

Re: Are lazy values created once per node or once per partition?

2014-12-17 Thread Aniket Bhatnagar
I would think that it has to be per worker. On Wed, Dec 17, 2014, 6:32 PM Ashic Mahtab as...@live.com wrote: Hello, Say, I have the following code: let something = Something() someRdd.foreachRdd(something.someMethod) And in something, I have a lazy member variable that gets created in

Spark 1.2.0 Yarn not published

2014-12-28 Thread Aniket Bhatnagar
Hi all I just realized that spark-yarn artifact hasn't been published for 1.2.0 release. Any particular reason for that? I was using it in my yet another spark-job-server project to submit jobs to a YARN cluster through convenient REST APIs (with some success). The job server was creating

Re: Spark 1.2.0 Yarn not published

2014-12-29 Thread Aniket Bhatnagar
/JW1q5vd61V1/Spark-yarn+1.2.0subj=Re+spark+yarn_2+10+1+2+0+artifacts Cheers On Dec 28, 2014, at 11:13 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi all I just realized that spark-yarn artifact hasn't been published for 1.2.0 release. Any particular reason for that? I was using

Re: action progress in ipython notebook?

2014-12-29 Thread Aniket Bhatnagar
Hi Josh Is there documentation available for status API? I would like to use it. Thanks, Aniket On Sun Dec 28 2014 at 02:37:32 Josh Rosen rosenvi...@gmail.com wrote: The console progress bars are implemented on top of a new stable status API that was added in Spark 1.2. It's possible to

Re: action progress in ipython notebook?

2014-12-29 Thread Aniket Bhatnagar
little to no risk to introduce new bugs elsewhere in Spark. On Mon, Dec 29, 2014 at 3:08 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi Josh Is there documentation available for status API? I would like to use it. Thanks, Aniket On Sun Dec 28 2014 at 02:37:32 Josh Rosen

Re: Host Error on EC2 while accessing hdfs from stadalone

2014-12-30 Thread Aniket Bhatnagar
Did you check firewall rules in security groups? On Tue, Dec 30, 2014, 9:34 PM Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, I am using spark standalone on EC2. I can access ephemeral hdfs from spark-shell interface but I can't access hdfs in standalone application. I am using spark

sparkContext.textFile does not honour the minPartitions argument

2015-01-01 Thread Aniket Bhatnagar
I am trying to read a file into a single partition but it seems like sparkContext.textFile ignores the passed minPartitions value. I know I can repartition the RDD but I was curious to know if this is expected or if this is a bug that needs to be further investigated?

Re: sparkContext.textFile does not honour the minPartitions argument

2015-01-02 Thread Aniket Bhatnagar
number of partitions value is to increase number of partitions not reduce it from default value. On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am trying to read a file into a single partition but it seems like sparkContext.textFile ignores the passed

Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-02-02 Thread Aniket Bhatnagar
On Fri Jan 30 2015 at 23:29:25 Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Right. Which makes me to believe that the directory is perhaps configured somewhere and i have missed configuring the same. The process that is submitting jobs (basically becomes driver) is running in sudo mode

Re: How to output to S3 and keep the order

2015-01-19 Thread Aniket Bhatnagar
When you repartiton, ordering can get lost. You would need to sort after repartitioning. Aniket On Tue, Jan 20, 2015, 7:08 AM anny9699 anny9...@gmail.com wrote: Hi, I am using Spark on AWS and want to write the output to S3. It is a relatively small file and I don't want them to output as

Re: kinesis multiple records adding into stream

2015-01-16 Thread Aniket Bhatnagar
Sorry. I couldn't understand the issue. Are you trying to send data to kinesis from a spark batch/real time job? - Aniket On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi Experts! I am using kinesis dependency as follow groupId = org.apache.spark artifactId =

Re: kinesis creating stream scala code exception

2015-01-15 Thread Aniket Bhatnagar
Are you using spark in standalone mode or yarn or mesos? If its yarn, please mention the hadoop distribution and version. What spark distribution are you using (it seems 1.2.0 but compiled with which hadoop version)? Thanks, Aniket On Thu, Jan 15, 2015, 4:59 PM Hafiz Mujadid

Re: ClosureCleaner should use ClassLoader created by SparkContext

2015-01-21 Thread Aniket Bhatnagar
(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] On Wed Jan 21 2015 at 17:34:34 Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: While implementing a spark server, I realized that Thread's context loader must be set to any dynamically loaded

ClosureCleaner should use ClassLoader created by SparkContext

2015-01-21 Thread Aniket Bhatnagar
While implementing a spark server, I realized that Thread's context loader must be set to any dynamically loaded classloader so that ClosureCleaner can do it's thing. Should the ClosureCleaner not use classloader created by SparkContext (that has all dynamically added jars via SparkContext.addJar)

Re: Inserting an element in RDD[String]

2015-01-15 Thread Aniket Bhatnagar
Sure there is. Create a new RDD just containing the schema line (hint: use sc.parallelize) and then union both the RDDs (the header RDD and data RDD) to get a final desired RDD. On Thu Jan 15 2015 at 19:48:52 Hafiz Mujadid hafizmujadi...@gmail.com wrote: hi experts! I hav an RDD[String] and i

Re: saving rdd to multiple files named by the key

2015-01-26 Thread Aniket Bhatnagar
This might be helpful: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport sha...@plaid.com wrote: Hi, I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k]. I got them by

Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Aniket Bhatnagar
for it. -Sven On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am programmatically submit spark jobs in yarn-client mode on EMR. Whenever a job tries to save file to s3, it gives the below mentioned exception. I think the issue might be what EMR

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-12 Thread Aniket Bhatnagar
the Spark assembly first in the classpath fixes the issue. I expect that the version of Parquet that's being included in the EMR libs just needs to be upgraded. ~ Jonathan Kelly From: Aniket Bhatnagar aniket.bhatna...@gmail.com Date: Sunday, January 4, 2015 at 10:51 PM To: Adam Gilmore

spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-07 Thread Aniket Bhatnagar
It seems that spark-network-yarn compiled for scala 2.11 depends on spark-network-shuffle compiled for scala 2.10. This causes cross version dependencies conflicts in sbt. Seems like a publishing error? http://www.uploady.com/#!/download/6Yn95UZA0DR/3taAJFjCJjrsSXOR

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Aniket Bhatnagar
This is tricky to debug. Check logs of node and resource manager of YARN to see if you can trace the error. In the past I have to closely look at arguments getting passed to YARN container (they get logged before attempting to launch containers). If I still don't get a clue, I had to check the

Re: A spark newbie question

2015-01-04 Thread Aniket Bhatnagar
Go through spark API documentation. Basically you have to do group by (date, message_type) and then do a count. On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas dines...@yahoo.com.invalid wrote: A spark cassandra newbie question. Thanks in advance for the help. I have a cassandra table with 2

Re: spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-08 Thread Aniket Bhatnagar
libraries are java-only (the scala version appended there is just for helping the build scripts). But it does look weird, so it would be nice to fix it. On Wed, Jan 7, 2015 at 12:25 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: It seems that spark-network-yarn compiled for scala 2.11

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-04 Thread Aniket Bhatnagar
Can you confirm your emr version? Could it be because of the classpath entries for emrfs? You might face issues with using S3 without them. Thanks, Aniket On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore dragoncu...@gmail.com wrote: Just an update on this - I found that the script by Amazon was the

Saprk 1.2.0 | Spark job fails with MetadataFetchFailedException

2015-03-19 Thread Aniket Bhatnagar
I have a job that sorts data and runs a combineByKey operation and it sometimes fails with the following error. The job is running on spark 1.2.0 cluster with yarn-client deployment mode. Any clues on how to debug the error? org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output

OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Aniket Bhatnagar
I am aggregating a dataset using combineByKey method and for a certain input size, the job fails with the following error. I have enabled head dumps to better analyze the issue and will report back if I have any findings. Meanwhile, if you guys have any idea of what could possibly result in this

Re: OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Aniket Bhatnagar
the hashCode and equals method correctly. On Wednesday, April 15, 2015 at 3:10 PM, Aniket Bhatnagar wrote: I am aggregating a dataset using combineByKey method and for a certain input size, the job fails with the following error. I have enabled head dumps to better analyze the issue

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-23 Thread Aniket Bhatnagar
and test it out? On Thu, May 21, 2015 at 1:43 AM, Tathagata Das t...@databricks.com wrote: Thanks for the JIRA. I will look into this issue. TD On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I ran into one of the issues that are potentially caused because

Re: How to close connection in mapPartitions?

2015-10-23 Thread Aniket Bhatnagar
Are you sure RedisClientPool is being initialized properly in the constructor of RedisCache? Can you please copy paste the code that you use to initialize RedisClientPool inside the constructor of RedisCache? Thanks, Aniket On Fri, Oct 23, 2015 at 11:47 AM Bin Wang wrote: >

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
Can you try storing the state (word count) in an external key value store? On Sat, Nov 7, 2015, 8:40 AM Thúy Hằng Lê wrote: > Hi all, > > Anyone could help me on this. It's a bit urgent for me on this. > I'm very confused and curious about Spark data checkpoint

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
} > } > > Without using updateStageByKey, I'm only have the stats of the last > micro-batch. > > Any advise on this? > > > 2015-11-07 11:35 GMT+07:00 Aniket Bhatnagar <aniket.bhatna...@gmail.com>: > >> Can you try storing the state (word count) in an exter

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Aniket Bhatnagar
rs to be the same. I was hoping an example might > shed some light on the issue. > > Regards, > > Bryan Jeffrey > > > > > > > > On Thu, Oct 8, 2015 at 7:04 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> Here is an example:

Re: Launching EC2 instances with Spark compiled for Scala 2.11

2015-10-08 Thread Aniket Bhatnagar
Is it possible for you to use EMR instead of EC2? If so, you may be able to tweak EMR bootstrap scripts to install your custom spark build. Thanks, Aniket On Thu, Oct 8, 2015 at 5:58 PM Theodore Vasiloudis < theodoros.vasilou...@gmail.com> wrote: > Hello, > > I was wondering if there is an easy

Re: HBase Spark Streaming giving error after restore

2015-10-17 Thread Aniket Bhatnagar
Can you try changing classOf[OutputFormat[String, BoxedUnit]] to classOf[OutputFormat[String, Put]] while configuring hconf? On Sat, Oct 17, 2015, 11:44 AM Amit Hora wrote: > Hi, > > Regresta for delayed resoonse > please find below full stack trace > >

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Aniket Bhatnagar
Hi Tom There has to be a difference in classpaths in yarn-client and yarn-cluster mode. Perhaps a good starting point would be to print classpath as a first thing in SimpleApp.main. It should give clues around why it works in yarn-cluster mode. Thanks, Aniket On Wed, Sep 9, 2015, 2:11 PM Tom

Re: Checkpointing with Kinesis

2015-09-17 Thread Aniket Bhatnagar
You can perhaps setup a WAL that logs to S3? New cluster should pick the records that weren't processed due previous cluster termination. Thanks, Aniket On Thu, Sep 17, 2015, 9:19 PM Alan Dipert wrote: > Hello, > We are using Spark Streaming 1.4.1 in AWS EMR to process records

Re: spark-submit classloader issue...

2015-09-28 Thread Aniket Bhatnagar
Hi Rachna Can you just use http client provided via spark transitive dependencies instead of excluding them? The reason user classpath first is failing could be because you have spark artifacts in your assembly jar that dont match with what is deployed (version mismatch or you built the version

Re: word count (group by users) in spark

2015-09-21 Thread Aniket Bhatnagar
reduceByKey(_ + _) > > val output = wordCounts. >map({case ((user, word), count) => (user, (word, count))}). >groupByKey() > > By Aniket, if we group by user first, it could run out of memory when > spark tries to put all words in a single sequence, couldn't it? >

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread Aniket Bhatnagar
way to do so? > > best, > /Shahab > > On Fri, Sep 18, 2015 at 12:54 PM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> Can you try yarn-client mode? >> >> On Fri, Sep 18, 2015, 3:38 PM shahab <shahab.mok...@gmail.com> wrote: >> >>

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread Aniket Bhatnagar
Can you try yarn-client mode? On Fri, Sep 18, 2015, 3:38 PM shahab wrote: > Hi, > > Probably I have wrong zeppelin configuration, because I get the following > error when I execute spark statements in Zeppelin: > > org.apache.spark.SparkException: Detected yarn-cluster

Re: question building spark in a virtual machine

2015-09-19 Thread Aniket Bhatnagar
Hi Eval Can you check if your Ubuntu VM has enough RAM allocated to run JVM of size 3gb? thanks, Aniket On Sat, Sep 19, 2015, 9:09 PM Eyal Altshuler wrote: > Hi, > > I had configured the MAVEN_OPTS environment variable the same as you wrote. > My java version is

Re: word count (group by users) in spark

2015-09-19 Thread Aniket Bhatnagar
Using scala API, you can first group by user and then use combineByKey. Thanks, Aniket On Sat, Sep 19, 2015, 6:41 PM kali.tumm...@gmail.com wrote: > Hi All, > I would like to achieve this below output using spark , I managed to write > in Hive and call it in spark but

Re: Help me! Spark WebUI is corrupted!

2015-12-31 Thread Aniket Bhatnagar
Are you running on YARN or standalone? On Thu, Dec 31, 2015, 3:35 PM LinChen wrote: > *Screenshot1(Normal WebUI)* > > > > *Screenshot2(Corrupted WebUI)* > > > > As screenshot2 shows, the format of my Spark WebUI looks strange and I > cannot click the description of active

Re: Execute function once on each node

2016-07-18 Thread Aniket Bhatnagar
You can't assume that the number to nodes will be constant as some may fail, hence you can't guarantee that a function will execute at most once or atleast once on a node. Can you explain your use case in a bit more detail? On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: >

Re: Execute function once on each node

2016-07-19 Thread Aniket Bhatnagar
ould allow us to read the data in situ without a copy. > > I understand that manually assigning tasks to nodes reduces fault > tolerance, but the simulation codes already explicitly assign tasks, so a > failure of any one node is already a full-job failure. > > On Mon, J

Re: Very long pause/hang at end of execution

2016-11-06 Thread Aniket Bhatnagar
In order to know what's going on, you can study the thread dumps either from spark UI or from any other thread dump analysis tool. Thanks, Aniket On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson wrote: > I'm doing some processing and then clustering of a small

Improvement proposal | Dynamic disk allocation

2016-11-06 Thread Aniket Bhatnagar
Hello Dynamic allocation feature allows you to add executors and scale computation power. This is great, however, I feel like we also need a way to dynamically scale storage. Currently, if the disk is not able to hold the spilled/shuffle data, the job is aborted causing frustration and loss of

Re: Very long pause/hang at end of execution

2016-11-06 Thread Aniket Bhatnagar
On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > > > In order to know what's going on, you can study the thread dumps either > from spark UI or from any other thread dump analysis tool. > > Thanks, > Aniket

Re: Improvement proposal | Dynamic disk allocation

2016-11-06 Thread Aniket Bhatnagar
If people agree that is desired, I am willing to submit a SIP for this and find time to work on it. Thanks, Aniket On Sun, Nov 6, 2016 at 1:06 PM Aniket Bhatnagar <aniket.bhatna...@gmail.com> wrote: > Hello > > Dynamic allocation feature allows you to add executors and scale >

Re: Dataset API | Setting number of partitions during join/groupBy

2016-11-11 Thread Aniket Bhatnagar
t; fails because the number of partitions is less? Or you want to do it for a > perf gain? > > > > Also, what were your initial Dataset partitions and how many did you have > for the result of join? > > > > *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com] > *Sent

Re: OS killing Executor due to high (possibly off heap) memory usage

2016-11-25 Thread Aniket Bhatnagar
f data to be on one executor node. Sent from my Windows 10 phone *From: *Rodrick Brown <rodr...@orchardplatform.com> *Sent: *Friday, November 25, 2016 12:25 AM *To: *Aniket Bhatnagar <aniket.bhatna...@gmail.com> *Cc: *user <user@spark.apache.org> *Subject: *Re: OS killing Executor due to

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Aniket Bhatnagar
Try changing compression to bzip2 or lzo. For reference - http://comphadoop.weebly.com Thanks, Aniket On Mon, Nov 21, 2016, 10:18 PM yeshwanth kumar wrote: > Hi, > > we are running Hive on Spark, we have an external table over snappy > compressed csv file of size 917.4 M

OS killing Executor due to high (possibly off heap) memory usage

2016-11-24 Thread Aniket Bhatnagar
Hi Spark users I am running a job that does join of a huge dataset (7 TB+) and the executors keep crashing randomly, eventually causing the job to crash. There are no out of memory exceptions in the log and looking at the dmesg output, it seems like the OS killed the JVM because of high memory

Re: Very long pause/hang at end of execution

2016-11-16 Thread Aniket Bhatnagar
> happens again I will try grabbing thread dumps and I will see if I can > figure out what is going on. > > > On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > > > I doubt it's GC as you mentioned that the pause is several mi

Re: Very long pause/hang at end of execution

2016-11-16 Thread Aniket Bhatnagar
Also, how are you launching the application? Through spark submit or creating spark content in your app? Thanks, Aniket On Wed, Nov 16, 2016 at 10:44 AM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Thanks for sharing the thread dump. I had a look at them and couldn't find &

  1   2   >