Re: Master options Cluster/Client descrepencies.

2016-03-30 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 Thanks Best Regards On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna < satyajit.apas...@gmail.com> wrote: > > Hi All, > > I have written a spark program on my dev box , >IDE:Intellij >

Re: aggregateByKey on PairRDD

2016-03-30 Thread Akhil Das
Isn't it what tempRDD.groupByKey does? Thanks Best Regards On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh wrote: > Hi All, > > I have an RDD having the data in the following form : > > tempRDD: RDD[(String, (String, String))] > > (brand , (product, key)) > >

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Akhil Das
Looks like the winutils.exe is missing from the environment, See https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman wrote: > Hi, > > i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine. >

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Akhil Das
You can achieve this with the normal RDD way. Have one extra stage in the pipeline where you will properly standardize all the values (like replacing doc with doctor) for all the columns before the join. Thanks Best Regards On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh

Re: How do we run that PR auto-close script again?

2016-02-22 Thread Akhil Das
This? http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-close-of-PR-s-td15862.html Thanks Best Regards On Mon, Feb 22, 2016 at 2:47 PM, Sean Owen wrote: > I know Patrick told us at some point, but I can't find the email or > wiki that describes how to run

Re: Using distinct count in over clause

2016-01-27 Thread Akhil Das
Does it support over? I couldn't find it in the documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features Thanks Best Regards On Fri, Jan 22, 2016 at 2:31 PM, 汪洋 wrote: > I think it cannot be right. > > 在 2016年1月22日,下午4:53,汪洋

Re: Generate Amplab queries set

2016-01-27 Thread Akhil Das
Have a look at the TPC-H queries, I found this repository with the quries. https://github.com/ssavvides/tpch-spark Thanks Best Regards On Fri, Jan 22, 2016 at 1:35 AM, sara mustafa wrote: > Hi, > I have downloaded the Amplab benchmark dataset from >

Re: security testing on spark ?

2015-12-18 Thread Akhil Das
If the port 7077 is open for public on your cluster, that's all you need to take over the cluster. You can read a bit about it here https://www.sigmoid.com/securing-apache-spark-cluster/ You can also look at this small exploit I wrote https://www.exploit-db.com/exploits/36562/ Thanks Best

Re: Spark basicOperators

2015-12-18 Thread Akhil Das
You can pretty much measure it from the Event timeline listed in the driver ui, You can click on jobs/tasks and get the time that it took for each of it from there. Thanks Best Regards On Thu, Dec 17, 2015 at 7:27 AM, sara mustafa wrote: > Hi, > > The class

Re: Multiplication on decimals in a dataframe query

2015-12-02 Thread Akhil Das
Not quiet sure whats happening, but its not an issue with multiplication i guess as the following query worked for me: trades.select(trades("price")*9.5).show +-+ |(price * 9.5)| +-+ |199.5| |228.0| |190.0| |199.5| |190.0| |

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Is that all you have in the executor logs? I suspect some of those jobs are having a hard time managing the memory. Thanks Best Regards On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman wrote: > [adding dev list since it's probably a bug, but i'm not sure how to > reproduce so I

Re: Guidance to get started

2015-11-09 Thread Akhil Das
You can read the installation details from here http://spark.apache.org/docs/latest/ You can read about contributing to spark from here https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Thu, Oct 29, 2015 at 3:53 PM, Aaska Shah

Re: sample or takeSample or ??

2015-11-09 Thread Akhil Das
You can't create a new RDD by selecting few elements. A rdd.take(n), takeSample etc are actions and it will trigger your entire pipeline to be executed. You can although do something like this i guess: val sample_data = rdd.take(10) val sample_rdd = sc.parallelize(sample_data) Thanks Best

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
ntsman*, *Big Data Engineer* > http://www.totango.com > > On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Is that all you have in the executor logs? I suspect some of those jobs >> are having a hard time managing the memory. >

Re: Unable to run applications on spark in standalone cluster mode

2015-11-01 Thread Akhil Das
Can you paste the contents of your spark-env.sh file? Also would be good to have a look at the /etc/hosts file. Cannot bind to the given ip address can be resolved if you put the hostname instead of the ip address. Also make sure the configuration (conf directory) across your cluster have the same

Re: Guaranteed processing orders of each batch in Spark Streaming

2015-10-22 Thread Akhil Das
I guess the order is guaranteed unless you set the spark.streaming.concurrentJobs to a higher number than 1. Thanks Best Regards On Mon, Oct 19, 2015 at 12:28 PM, Renjie Liu wrote: > Hi, all: > I've read source code and it seems that there is no guarantee that the >

Re: Too many executors are created

2015-10-11 Thread Akhil Das
For some reason the executors are getting killed, 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated: app-20150929120924-/24463 is now EXITED (Command exited with code 1) Can you paste your spark-submit command? You can also look in the executor logs and see whats going on.

Re: using JavaRDD in spark-redis connector

2015-09-30 Thread Akhil Das
You can create a JavaRDD as normal and then call the .rdd() to get the RDD. Thanks Best Regards On Mon, Sep 28, 2015 at 9:01 PM, Rohith P wrote: > Hi all, > I am trying to work with spark-redis connector (redislabs) which > requires all transactions between

Re: unsubscribe

2015-09-25 Thread Akhil Das
Send an email to dev-unsubscr...@spark.apache.org instead of dev@spark.apache.org Thanks Best Regards On Fri, Sep 25, 2015 at 4:00 PM, Nirmal R Kumar wrote: >

Re: Spark Streaming..Exception

2015-09-14 Thread Akhil Das
You should consider upgrading your spark from 1.3.0 to a higher version. Thanks Best Regards On Mon, Sep 14, 2015 at 2:28 PM, Priya Ch wrote: > Hi All, > > I came across the related old conversation on the above issue ( >

Re: Detecting configuration problems

2015-09-08 Thread Akhil Das
I found an old JIRA referring the same. https://issues.apache.org/jira/browse/SPARK-5421 Thanks Best Regards On Sun, Sep 6, 2015 at 8:53 PM, Madhu wrote: > I'm not sure if this has been discussed already, if so, please point me to > the thread and/or related JIRA. > > I have

Re: OOM in spark driver

2015-09-04 Thread Akhil Das
Or you can increase the driver heap space (export _JAVA_OPTIONS="-Xmx5g") Thanks Best Regards On Wed, Sep 2, 2015 at 11:57 PM, Mike Hynes <91m...@gmail.com> wrote: > Just a thought; this has worked for me before on standalone client > with a similar OOM error in a driver thread. Try setting: >

Re: IOError on createDataFrame

2015-08-31 Thread Akhil Das
Why not attach a bigger hard disk to the machines and point your SPARK_LOCAL_DIRS to it? Thanks Best Regards On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti wrote: > Hello, > > Similar to the thread below [1], when I tried to create an RDD from a 4GB > pandas dataframe

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-25 Thread Akhil Das
You can add it to the spark packages i guess http://spark-packages.org/ Thanks Best Regards On Fri, Aug 14, 2015 at 1:45 PM, pishen tsai pishe...@gmail.com wrote: Sorry for previous line-breaking format, try to resend the mail again. I have written a sbt plugin called spark-deployer, which

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
PM, Imran Rashid iras...@cloudera.com wrote: oh I see, you are defining your own RDD Partition types, and you had a bug where partition.index did not line up with the partitions slot in rdd.getPartitions. Is that correct? On Thu, Aug 13, 2015 at 2:40 AM, Akhil Das ak...@sigmoidanalytics.com

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
is to show overlapping partitions, duplicates. index to partition mismatch - that sort of thing. On Thu, Aug 13, 2015 at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Yep, and it works fine for operations which does not involve any shuffle (like foreach,, count etc) and those which

Re: Switch from Sort based to Hash based shuffle

2015-08-13 Thread Akhil Das
Have a look at spark.shuffle.manager, You can switch between sort and hash with this configuration. spark.shuffle.managersortImplementation to use for shuffling data. There are two implementations available:sort and hash. Sort-based shuffle is more memory-efficient and is the default option

Re: Pushing Spark to 10Gb/s

2015-08-11 Thread Akhil Das
Hi Starch, It also depends on the applications behavior, some might not be properly able to utilize the network. If you are using say Kafka, then one thing that you should keep in mind is the Size of the individual message and the number of partitions that you are having. The higher the message

Re: Inquery about contributing codes

2015-08-11 Thread Akhil Das
You can create a new Issue and send a pull request for the same i think. + dev list Thanks Best Regards On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote: Dear Sir / Madam, I have a plan to contribute some codes about passing filters to a datasource as physical

Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-11 Thread Akhil Das
Hi My Spark job (running in local[*] with spark 1.4.1) reads data from a thrift server(Created an RDD, it will compute the partitions in getPartitions() call and in computes hasNext will return records from these partitions), count(), foreach() is working fine it returns the correct number of

Re: How to help for 1.5 release?

2015-08-04 Thread Akhil Das
I think you can start from here https://issues.apache.org/jira/browse/SPARK/fixforversion/12332078/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel Thanks Best Regards On Tue, Aug 4, 2015 at 12:02 PM, Meihua Wu rotationsymmetr...@gmail.com wrote: I think the team is

Re: ReceiverStream SPARK not able to cope up with 20,000 events /sec .

2015-07-28 Thread Akhil Das
You need to find the bottleneck here, it could your network (if the data is huge) or your producer code isn't pushing at 20k/s, If you are able to produce at 20k/s then make sure you are able to receive at that rate (try it without spark). Thanks Best Regards On Sat, Jul 25, 2015 at 3:29 PM,

Re: RestSubmissionClient Basic Auth

2015-07-16 Thread Akhil Das
likely would it be that a change like that goes thru? Would it be rejected as an uncommon scenario? I really don't want to have this as a separate form of the branch. Thanks, Joel -- *From:* Akhil Das ak...@sigmoidanalytics.com *Sent:* Wednesday, July 15, 2015 2:07

Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
This will get you started https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Mon, Jul 13, 2015 at 5:29 PM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: Hello everyone, I am interested to contribute to apache spark. I

Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
You can try to resolve some Jira issues, to start with try out some newbie JIRA's. Thanks Best Regards On Tue, Jul 14, 2015 at 4:10 PM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: I saw the contribution sections. As a new contibutor, should I try to build patches or can I add

Re: Spark job hangs when History server events are written to hdfs

2015-07-08 Thread Akhil Das
Can you look in the datanode logs and see whats going on? Most likely, you are hitting the ulimit on open file handles. Thanks Best Regards On Wed, Jul 8, 2015 at 10:55 AM, Pankaj Arora pankaj.ar...@guavus.com wrote: Hi, I am running long running application over yarn using spark and I am

Re: Data interaction between various RDDs in Spark Streaming

2015-07-07 Thread Akhil Das
UpdatestateByKey? Thanks Best Regards On Wed, Jul 8, 2015 at 1:05 AM, swetha swethakasire...@gmail.com wrote: Hi, Suppose I want the data to be grouped by and Id named 12345 and I have certain amount of data coming out from one batch for 12345 and I have data related to 12345 coming after

Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with almost all those databases out there (You just need to pass the Input/Output Format classes or there are a bunch of connectors also available). Thanks Best Regards On Fri, Jun 26, 2015 at 12:07 PM, louis.hust

Re: About HostName display in SparkUI

2015-06-15 Thread Akhil Das
In the conf/slaves file, are you having the ip addresses? or the hostnames? Thanks Best Regards On Sat, Jun 13, 2015 at 9:51 PM, Sea 261810...@qq.com wrote: In spark 1.4.0, I find that the Address is ip (it was hostname in v1.3.0), why? who did it?

Re: Contribution

2015-06-13 Thread Akhil Das
This is a good start, if you haven't seen this already https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: Hi everyone, I am interest to

Re: About akka used in spark

2015-06-10 Thread Akhil Das
If you look at the maven repo, you can see its from typesafe only http://mvnrepository.com/artifact/org.spark-project.akka/akka-actor_2.10/2.3.4-spark For sbt, you can download the sources by adding withSources() like: libraryDependencies += org.spark-project.akka % akka-actor_2.10 % 2.3.4-spark

Re: Scheduler question: stages with non-arithmetic numbering

2015-06-07 Thread Akhil Das
Are you seeing the same behavior on the driver UI? (that running on port 4040), If you click on the stage id header you can sort the stages based on IDs. Thanks Best Regards On Fri, Jun 5, 2015 at 10:21 PM, Mike Hynes 91m...@gmail.com wrote: Hi folks, When I look at the output logs for an

Re: Resource usage of a spark application

2015-05-21 Thread Akhil Das
Yes Peter that's correct, you need to identify the processes and with that you can pull the actual usage metrics. Thanks Best Regards On Thu, May 21, 2015 at 2:52 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Thanks Akhil, Ryan! @Akhil: YARN can only tell me how much vcores my

Re: Resource usage of a spark application

2015-05-17 Thread Akhil Das
You can either pull the high level information from your resource manager, or if you want more control/specific information you can write a script and pull the resource usage information from the OS. Something like this

Re: s3 vfs on Mesos Slaves

2015-05-13 Thread Akhil Das
Did you happened to have a look at this https://github.com/abashev/vfs-s3 Thanks Best Regards On Tue, May 12, 2015 at 11:33 PM, Stephen Carman scar...@coldlight.com wrote: We have a small mesos cluster and these slaves need to have a vfs setup on them so that the slaves can pull down the data

Re: Getting Access is denied error while cloning Spark source using Eclipse

2015-05-12 Thread Akhil Das
May be you should check where exactly its throwing up permission denied (possibly trying to write to some directory). Also you can try manually cloning the git repo to a directory and then try opening that in eclipse. Thanks Best Regards On Tue, May 12, 2015 at 3:46 PM, Chandrashekhar Kotekar

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Akhil Das
Looks like the jar you provided has some missing classes. Try this: scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.3.0, org.apache.spark %% spark-sql % 1.3.0 % provided, org.apache.spark %% spark-mllib % 1.3.0 % provided, log4j % log4j %

Re: Back-pressure for Spark Streaming

2015-05-08 Thread Akhil Das
We had a similar issue while working on one of our usecase where we were processing at a moderate throughput (around 500MB/S). When the processing time exceeds the batch duration, it started to throw up blocknotfound exceptions, i made a workaround for that issue and is explained over here

SparkStreaming Workaround for BlockNotFound Exceptions

2015-05-07 Thread Akhil Das
Hi With Spark streaming (all versions), when my processing delay (around 2-4 seconds) exceeds the batch duration (being 1 second) and on a decent scale/throughput (consuming around 100MB/s on 1+2 node standalone 15GB, 4 cores each) the job will start to throw block not found exceptions when the

Re: java.lang.StackOverflowError when recovery from checkpoint in Streaming

2015-04-28 Thread Akhil Das
There's a similar issue reported over here https://issues.apache.org/jira/browse/SPARK-6847 Thanks Best Regards On Tue, Apr 28, 2015 at 7:35 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi everyone, I am using val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,

Re: Contributing Documentation Changes

2015-04-25 Thread Akhil Das
I also want to add mine :/ Everyone wants to add it seems. Thanks Best Regards On Fri, Apr 24, 2015 at 8:58 PM, madhu phatak phatak@gmail.com wrote: Hi, I understand that. The following page http://spark.apache.org/documentation.html has a external tutorials,blogs section which points

Re: Graphical display of metrics on application UI page

2015-04-22 Thread Akhil Das
​There were some PR's about graphical representation with D3.js, you can possibly see it on the github. Here's a few of them https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3​ Thanks Best Regards On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear

Re: How to use Spark Streaming .jar file that I've built using a different branch than master?

2015-04-20 Thread Akhil Das
I think you can override the SPARK_CLASSPATH with your newly built jar. Thanks Best Regards On Mon, Apr 20, 2015 at 2:28 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, I'm building a different version of Spark Streaming (based on a different branch than master) in my application for

Re: Using Spark with a SOCKS proxy

2015-03-18 Thread Akhil Das
Did you try ssh tunneling instead of SOCKS? Thanks Best Regards On Wed, Mar 18, 2015 at 5:45 AM, Kelly, Jonathan jonat...@amazon.com wrote: I'm trying to figure out how I might be able to use Spark with a SOCKS proxy. That is, my dream is to be able to write code in my IDE then run it

Re: Loading previously serialized object to Spark

2015-03-08 Thread Akhil Das
Can you paste the complete code? Thanks Best Regards On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I've implemented class MyClass in MLlib that does some operation on LabeledPoint. MyClass extends serializable, so I can map this operation on data of

Re: Pull Requests on github

2015-02-08 Thread Akhil Das
You can open a Jira issue pointing this PR to get it processed faster. :) Thanks Best Regards On Sat, Feb 7, 2015 at 7:07 AM, fommil sam.halli...@gmail.com wrote: Hi all, I'm the author of netlib-java and I noticed that the documentation in MLlib was out of date and misleading, so I

Re: Broken record a bit here: building spark on intellij with sbt

2015-02-04 Thread Akhil Das
Here's the sbt version https://docs.sigmoidanalytics.com/index.php/Step_by_Step_instructions_on_how_to_build_Spark_App_with_IntelliJ_IDEA Thanks Best Regards On Thu, Feb 5, 2015 at 8:55 AM, Stephen Boesch java...@gmail.com wrote: For building in intellij with sbt my mileage has varied widely:

Re: Memory config issues

2015-01-18 Thread Akhil Das
Its the executor memory (spark.executor.memory) which you can set while creating the spark context. By default it uses 0.6% of the executor memory for Storage. Now, to show some memory usage, you need to cache (persist) the RDD. Regarding the OOM Exception, you can increase the level of

Re: Bouncing Mails

2015-01-17 Thread Akhil Das
, Akhil Das ak...@sigmoidanalytics.com wrote: My mails to the mailing list are getting rejected, have opened a Jira issue, can someone take a look at it? https://issues.apache.org/jira/browse/INFRA-9032 Thanks Best Regards

Bouncing Mails

2015-01-17 Thread Akhil Das
My mails to the mailing list are getting rejected, have opened a Jira issue, can someone take a look at it? https://issues.apache.org/jira/browse/INFRA-9032 Thanks Best Regards

Re: Apache Spark client high availability

2015-01-12 Thread Akhil Das
We usually run Spark in HA with the following stack: - Apache Mesos - Marathon - init/control system for starting, stopping, and maintaining always-on applications.(Mainly SparkStreaming) - Chronos - general-purpose scheduler for Mesos, supports job dependency graphs. - Spark Job Server -

Re: Reading Data Using TextFileStream

2015-01-06 Thread Akhil Das
I think you need to start your streaming job, then put the files there to get them read. textFileStream doesn't read the existing files i believe. Also are you sure the path is not the following? (no missing / in the beginning?) JavaDStreamString textStream = ssc.textFileStream(/user/

Re: 1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-14 Thread Akhil Das
It shows nullPointerException, your data could be corrupted? Try putting a try catch inside the operation that you are doing, Are you running the worker process on the master node also? If not, then only 1 node will be doing the processing. If yes, then try setting the level of parallelism and

Re: java.lang.OutOfMemoryError while running Shark on Mesos

2014-05-23 Thread Akhil Das
Hi Prabeesh, Do a export _JAVA_OPTIONS=-Xmx10g before starting the shark. Also you can do a ps aux | grep shark and see how much memory it is being allocated, mostly it should be 512mb, in that case increase the limit. Thanks Best Regards On Fri, May 23, 2014 at 10:22 AM, prabeesh k