Re: 8080 not working

2015-12-05 Thread Akhil Das
Can you provide more information? Are you having a spark stand alone cluster running? If not, then you won't be able to access 8080. If you want to load file data, you can open up the spark-shell and type val rawData = sc.textFile("/path/to/your/file") You can also use the

Re: Improve saveAsTextFile performance

2015-12-05 Thread Akhil Das
Which version of spark are you using? Can you look at the event timeline and the DAG of the job and see where its spending more time? .save simply triggers your entire pipeline, If you are doing a join/groupBy kind of operations then you need to make sure the keys are evenly distributed throughout

Re: spark sql cli query results written to file ?

2015-12-02 Thread Akhil Das
Something like this? val df = sqlContext.read.load("examples/src/main/resources/users.parquet")df.select("name", "favorite_color").write.save("namesAndFavColors.parquet") It will save the name, favorite_color columns to a parquet file. You can read more information over here

Re: send this email to unsubscribe

2015-12-02 Thread Akhil Das
If you haven't unsubscribed already, shoot an email to user-unsubscr...@spark.apache.org Also read more here http://spark.apache.org/community.html Thanks Best Regards On Thu, Nov 26, 2015 at 7:51 AM, ngocan211 . wrote: > >

Re: Error in block pushing thread puts the KinesisReceiver in a stuck state

2015-12-02 Thread Akhil Das
Did you go through the executor logs completely? Futures timed out exception can occur mostly when one of the task/job spend way too much time and fails to respond, this happens when there's a GC pause or memory overhead. Thanks Best Regards On Tue, Dec 1, 2015 at 12:09 AM, Spark Newbie

Re: spark sql cli query results written to file ?

2015-12-02 Thread Akhil Das
Oops 3 mins late. :) Thanks Best Regards On Thu, Dec 3, 2015 at 11:49 AM, Sahil Sareen <sareen...@gmail.com> wrote: > Yeah, Thats the example from the link I just posted. > > -Sahil > > On Thu, Dec 3, 2015 at 11:41 AM, Akhil Das <ak...@sigmoidanalytics.com> &

Re: Multiplication on decimals in a dataframe query

2015-12-02 Thread Akhil Das
Not quiet sure whats happening, but its not an issue with multiplication i guess as the following query worked for me: trades.select(trades("price")*9.5).show +-+ |(price * 9.5)| +-+ |199.5| |228.0| |190.0| |199.5| |190.0| |

Re: Debug Spark

2015-12-02 Thread Akhil Das
This doc will get you started https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ Thanks Best Regards On Sun, Nov 29, 2015 at 9:48 PM, Masf wrote: > Hi > > Is it possible to debug spark locally with IntelliJ or another

Re: Spark on Mesos with Centos 6.6 NFS

2015-12-01 Thread Akhil Das
Can you try mounting the NFS directory on all machines on the same location? (say /mnt/nfs) and try it again? Thanks Best Regards On Thu, Nov 26, 2015 at 1:22 PM, leonidas wrote: > Hello, > I have a setup with spark 1.5.1 on top of Mesos with one master and 4 > slaves. I

Re: [SPARK STREAMING ] Sending data to ElasticSearch

2015-11-09 Thread Akhil Das
Have a look at https://github.com/elastic/elasticsearch-hadoop#apache-spark You can simply call the .saveToEs function to store your RDD data into ES. Thanks Best Regards On Thu, Oct 29, 2015 at 8:19 PM, Nipun Arora wrote: > Hi, > > I am sending data to an

Re: heap memory

2015-11-09 Thread Akhil Das
Its coming from parquet , you can try increasing your driver memory and see if its still coming. Thanks Best Regards On Fri, Oct 30, 2015 at 7:16 PM, Younes Naguib <

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Is that all you have in the executor logs? I suspect some of those jobs are having a hard time managing the memory. Thanks Best Regards On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman wrote: > [adding dev list since it's probably a bug, but i'm not sure how to > reproduce so I

Re: Issue on spark.driver.maxResultSize

2015-11-09 Thread Akhil Das
You can set it in your conf/spark-defaults.conf file, or you will have to set it before you create the SparkContext. Thanks Best Regards On Fri, Oct 30, 2015 at 4:31 AM, karthik kadiyam < karthik.kadiyam...@gmail.com> wrote: > Hi, > > In spark streaming job i had the following setting > >

Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-09 Thread Akhil Das
Did you go through http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup for yarn, i guess you will have to copy the spark-1.5.1-yarn-shuffle.jar to the classpath of all nodemanagers in your cluster. Thanks Best Regards On Fri, Oct 30, 2015 at 7:41 PM, Tom Stewart <

Re: How to properly read the first number lines of file into a RDD

2015-11-09 Thread Akhil Das
​There's multiple way to achieve this: 1. Read the N lines from the driver and then do a sc.parallelize(nlines) to create an RDD out of it. 2. Create an RDD with N+M, do a take on N and then broadcast or parallelize the returning list. 3. Something like this if the file is in hdfs: val n_f =

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
ntsman*, *Big Data Engineer* > http://www.totango.com > > On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Is that all you have in the executor logs? I suspect some of those jobs >> are having a hard time managing the memory. >

Re: Prevent partitions from moving

2015-11-03 Thread Akhil Das
Most likely in your case, the partition keys are not evenly distributed and hence you can notice some of your tasks taking way too longer time to process. You will have to use custom partitioner

Re: Apache Spark on Raspberry Pi Cluster with Docker

2015-11-03 Thread Akhil Das
Can you try it with just: spark-submit --master spark://master:6066 --class SimpleApp target/simple-project-1.0.jar And see if it works? Even better idea would be to spawn a spark-shell (*MASTER=spark://master:6066 bin/spark-shell*) and try out a simple *sc.parallelize(1 to 1000).collect*

Re: Submitting Spark Applications - Do I need to leave ports open?

2015-11-02 Thread Akhil Das
Yes you need to open up a few ports for that to happen, have a look at http://spark.apache.org/docs/latest/configuration.html#networking you can see *.port configurations which bounds to random by default, fix those ports to a specific number and open those ports in your firewall and it should

Re: --jars option using hdfs jars cannot effect when spark standlone deploymode with cluster

2015-11-02 Thread Akhil Das
Can you give a try putting the jar locally without hdfs? Thanks Best Regards On Wed, Oct 28, 2015 at 8:40 AM, our...@cnsuning.com wrote: > hi all, >when using command: > spark-submit *--deploy-mode cluster --jars > hdfs:///user/spark/cypher.jar* --class >

Re: How to catch error during Spark job?

2015-11-02 Thread Akhil Das
Usually you add exception handling within the transformations, in your case you have it added in the driver code. This approach won't be able to catch those exceptions happening inside the executor. eg: try { val rdd = sc.parallelize(1 to 100) rdd.foreach(x => throw new

Re: streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-11-01 Thread Akhil Das
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since you want to store the Status object and for every batch it will create a directory under /status (name will mostly be the timestamp), since the data is small (hardly couple of MBs for 1 sec interval) it will not overwhelm

Re: Unable to use saveAsSequenceFile

2015-11-01 Thread Akhil Das
Make sure your firewall isn't blocking the requests. Thanks Best Regards On Sat, Oct 24, 2015 at 5:04 PM, Amit Singh Hora wrote: > Hi All, > > I am trying to wrote an RDD as Sequence file into my Hadoop cluster but > getting connection time out again and again ,I can ping

Re: java how to configure streaming.dstream.DStream<> saveAsTextFiles() to work with hdfs?

2015-11-01 Thread Akhil Das
How are you submitting your job? You need to make sure HADOOP_CONF_DIR is pointing to your hadoop configuration directory (with core-site.xml, hdfs-site.xml files), If you have them set properly then make sure you are giving the full hdfs url like:

Re: Spark Streaming: how to StreamingContext.queueStream

2015-11-01 Thread Akhil Das
You can do something like this: val rddQueue = scala.collection.mutable.Queue(rdd1,rdd2,rdd3) val qDstream = ssc.queueStream(rddQueue) Thanks Best Regards On Sat, Oct 24, 2015 at 4:43 AM, Anfernee Xu wrote: > Hi, > > Here's my situation, I have some kind of offline

Re: Running 2 spark application in parallel

2015-11-01 Thread Akhil Das
Have a look at the dynamic resource allocation listed here https://spark.apache.org/docs/latest/job-scheduling.html Thanks Best Regards On Thu, Oct 22, 2015 at 11:50 PM, Suman Somasundar < suman.somasun...@oracle.com> wrote: > Hi all, > > > > Is there a way to run 2 spark applications in

Re: Saving RDDs in Tachyon

2015-10-30 Thread Akhil Das
I guess you can do a .saveAsObjectFiles and read it back as sc.objectFile Thanks Best Regards On Fri, Oct 23, 2015 at 7:57 AM, mark wrote: > I have Avro records stored in Parquet files in HDFS. I want to read these > out as an RDD and save that RDD in Tachyon for

Re: Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level

2015-10-30 Thread Akhil Das
You can set it to MEMORY_AND_DISK, in this case data will fall back to disk when it exceeds the memory. Thanks Best Regards On Fri, Oct 23, 2015 at 9:52 AM, JoneZhang wrote: > 1.Whether Spark will use disk when the memory is not enough on MEMORY_ONLY > Storage Level? >

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Akhil Das
eems correct. > > Thanks, > Eugen > > 2015-10-23 8:30 GMT+02:00 Akhil Das <ak...@sigmoidanalytics.com>: > >> Mostly a network issue, you need to check your network configuration from >> the aws console and make sure the ports are accessible within the cluster. >>

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Akhil Das
know why this happens, is that some > known issue? > > Thanks, > Eugen > > 2015-10-22 19:08 GMT+07:00 Akhil Das <ak...@sigmoidanalytics.com>: > >> Can you try fixing spark.blockManager.port to specific port and see if >> the issue exists? >> >>

Re: Get the previous state string

2015-10-22 Thread Akhil Das
That way, you will eventually end up bloating up that list. Instead, you could push the stream to a noSQL database (like hbase or cassandra etc) and then read it back and join it with your current stream if that's what you are looking for. Thanks Best Regards On Thu, Oct 15, 2015 at 6:11 PM,

Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-10-22 Thread Akhil Das
Did you try passing the mysql connector jar through --driver-class-path Thanks Best Regards On Sat, Oct 17, 2015 at 6:33 AM, Hurshal Patel wrote: > Hi all, > > I've been struggling with a particularly puzzling issue after upgrading to > Spark 1.5.1 from Spark 1.4.1. > >

Re: Output println info in LogMessage Info ?

2015-10-22 Thread Akhil Das
Yes, using log4j you can log everything. Here's a thread with example http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs Thanks Best Regards On Sun, Oct 18, 2015 at 12:10 AM, kali.tumm...@gmail.com <

Re: Can I convert RDD[My_OWN_JAVA_CLASS] to DataFrame in Spark 1.3.x?

2015-10-22 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection if you haven't seen that already. Thanks Best Regards On Thu, Oct 15, 2015 at 10:56 PM, java8964 wrote: > Hi, Sparkers: > > I wonder if I can convert a RDD

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Akhil Das
Can you try fixing spark.blockManager.port to specific port and see if the issue exists? Thanks Best Regards On Mon, Oct 19, 2015 at 6:21 PM, Eugen Cepoi wrote: > Hi, > > I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. > The job is reading data from

Re: multiple pyspark instances simultaneously (same time)

2015-10-22 Thread Akhil Das
Did you read https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application Thanks Best Regards On Thu, Oct 15, 2015 at 11:31 PM, jeff.sadow...@gmail.com < jeff.sadow...@gmail.com> wrote: > I am having issues trying to setup spark to run jobs simultaneously. > > I

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Akhil Das
Convert your data to parquet, it saves space and time. Thanks Best Regards On Mon, Oct 19, 2015 at 11:43 PM, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like

Re: java TwitterUtils.createStream() how create "user stream" ???

2015-10-22 Thread Akhil Das
I don't think the one that comes with spark would listen to specific user feeds, but yes you can filter out the public tweets by passing the filters argument. Here's an example for you to start

Re: Can we split partition

2015-10-22 Thread Akhil Das
Did you try coalesce? It doesn't shuffle the data around. Thanks Best Regards On Wed, Oct 21, 2015 at 10:27 AM, shahid wrote: > Hi > > I have a large partition(data skewed) i need to split it to no. of > partitions, repartitioning causes lot of shuffle. Can we do that..? > >

Re: Spark handling parallel requests

2015-10-19 Thread Akhil Das
the number of > concurrent requests? ... > > As Akhil said, Kafka might help in your case. Otherwise, you need to read > the designs or even source codes of Kafka and Spark Streaming. > > Best wishes, > > Xiao Li > > > 2015-10-11 23:19 GMT-07:00 Akhil Das <ak...@sigm

Re: spark streaming filestream API

2015-10-14 Thread Akhil Das
Key and Value are the ones that you are using with your InputFormat. Eg: JavaReceiverInputDStream lines = jssc.fileStream("/sigmoid", LongWritable.class, Text.class, TextInputFormat.class); TextInputFormat uses the LongWritable as Key and Text as Value classes. If your data is plain CSV or text

Re: Changing application log level in standalone cluster

2015-10-14 Thread Akhil Das
You should be able to do that from your application. In the beginning of the application, just add: import org.apache.log4j.Loggerimport org.apache.log4j.Level Logger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF) That will switch off the logs. Thanks Best

Re: Cannot connect to standalone spark cluster

2015-10-14 Thread Akhil Das
Open a spark-shell by: MASTER=Ellens-MacBook-Pro.local:7077 bin/spark-shell And if its able to connect, then check your java projects build file and make sure you are having the proper spark version. Thanks Best Regards On Sat, Oct 10, 2015 at 3:07 AM, ekraffmiller

Re: how to use SharedSparkContext

2015-10-14 Thread Akhil Das
Did a quick search and found the following, I haven't tested it myself. Add the following to your build.sbt libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.5.0_1.4.0_1.4.1_0.1.2" Create a class extending com.holdenkarau.spark.testing.SharedSparkContext And you

Re: spark streaming filestream API

2015-10-14 Thread Akhil Das
les/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3 > In this case, key would be Nullwritable and value would be BytesWritable > right? > > > > Unfortunately my files are binary and not text files. > > > > Regards, > > Anand.C > > > > *From:*

Re: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-13 Thread Akhil Das
You need to add "org.apache.spark" % "spark-streaming_2.10" % "1.5.0" to the dependencies list. Thanks Best Regards On Tue, Oct 6, 2015 at 3:20 PM, shahab wrote: > Hi, > > I am trying to use Spark 1.5, Mlib, but I keep getting > "sbt.ResolveException: unresolved

Re: Spark handling parallel requests

2015-10-12 Thread Akhil Das
Instead of pushing your requests to the socket, why don't you push them to a Kafka or any other message queue and use spark streaming to process them? Thanks Best Regards On Mon, Oct 5, 2015 at 6:46 PM, wrote: > Hi , > i am using Scala , doing a socket

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-12 Thread Akhil Das
Can you look a bit deeper in the executor logs? It could be filling up the memory and getting killed. Thanks Best Regards On Mon, Oct 5, 2015 at 8:55 PM, Lan Jiang wrote: > I am still facing this issue. Executor dies due to > > org.apache.avro.AvroRuntimeException:

Re: How to change verbosity level and redirect verbosity to file?

2015-10-12 Thread Akhil Das
Have a look http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs Thanks Best Regards On Mon, Oct 5, 2015 at 9:42 PM, wrote: > Hi, > > I would like to read the full spark-submit log once a job

Re: Spark cluster - use machine name in WorkerID, not IP address

2015-10-11 Thread Akhil Das
Did you try setting the SPARK_LOCAL_IP in the conf/spark-env.sh file on each node? Thanks Best Regards On Fri, Oct 2, 2015 at 4:18 AM, markluk wrote: > I'm running a standalone Spark cluster of 1 master and 2 slaves. > > My slaves file under /conf list the fully qualified

Re: Compute Real-time Visualizations using spark streaming

2015-10-11 Thread Akhil Das
Simplest approach would be to push the streaming data (after the computations) to a SQL-Like DB and then let your visualization piece pull it from the DB. Another approach would be to make your visualization piece a web-socket (If you are using D3JS etc) and then from your streaming application

Re: UnknownHostException with Mesos and custom Jar

2015-09-30 Thread Akhil Das
Can you try replacing your code with the hdfs uri? like: sc.textFile("hdfs://...").collect().foreach(println) Thanks Best Regards On Tue, Sep 29, 2015 at 1:45 AM, Stephen Hankinson wrote: > Hi, > > Wondering if anyone can help me with the issue I am having. > > I am

Re: Reading kafka stream and writing to hdfs

2015-09-30 Thread Akhil Das
Like: counts.saveAsTestFiles("hdfs://host:port/some/location") Thanks Best Regards On Tue, Sep 29, 2015 at 2:15 AM, Chengi Liu wrote: > Hi, > I am going thru this example here: > >

Re: log4j Spark-worker performance problem

2015-09-30 Thread Akhil Das
Depends how big the lines are, on a typical HDD you can write at max 10-15MB/s, and on SSDs it can be upto 30-40MB/s. Thanks Best Regards On Mon, Sep 28, 2015 at 3:57 PM, vaibhavrtk wrote: > Hello > > We need a lot of logging for our application about 1000 lines needed to

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-30 Thread Akhil Das
Each Json Doc should be in a single line i guess. http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Note that the file that is offered as *a json file* is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a

Re: UnknownHostException with Mesos and custom Jar

2015-09-30 Thread Akhil Das
gt; Stephen Hankinson, P. Eng. > CTO > Affinio Inc. > 301 - 211 Horseshoe Lake Dr. > Halifax, Nova Scotia, Canada > B3S 0B9 > > http://www.affinio.com > > On Wed, Sep 30, 2015 at 4:21 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Can you try rep

Re: HDFS is undefined

2015-09-28 Thread Akhil Das
For some reason Spark isnt picking up your hadoop confs, Did you download spark compiled with the hadoop version that you are having in the cluster? Thanks Best Regards On Fri, Sep 25, 2015 at 7:43 PM, Angel Angel wrote: > hello, > I am running the spark application. >

Re: how to handle OOMError from groupByKey

2015-09-28 Thread Akhil Das
You can try to increase the number of partitions to get ride of the OOM errors. Also try to use reduceByKey instead of groupByKey. Thanks Best Regards On Sat, Sep 26, 2015 at 1:05 AM, Elango Cheran wrote: > Hi everyone, > I have an RDD of the format (user: String,

Re: Master getting down with Memory issue.

2015-09-28 Thread Akhil Das
explaine to me how increasing number of partition (which >> is thing is worker nodes) will help. >> >> As issue is that my master is getting OOM. >> >> Thanks, >> Saurav Sinha >> >> On Mon, Sep 28, 2015 at 2:32 PM, Akhil Das <ak...@sigmoidanalytics.com&g

Re: Spark 1.5.0 Not able to submit jobs using cluster URL

2015-09-28 Thread Akhil Das
Update the dependency version in your jobs build file, Also make sure you have updated the spark version to 1.5.0 everywhere. (in the cluster, code) Thanks Best Regards On Mon, Sep 28, 2015 at 11:29 AM, lokeshkumar wrote: > Hi forum > > I have just upgraded spark from 1.4.0

Re: Master getting down with Memory issue.

2015-09-28 Thread Akhil Das
This behavior totally depends on the job that you are doing. Usually increasing the # of partitions will sort out this issue. It would be good if you can paste the code snippet or explain what type of operations that you are doing. Thanks Best Regards On Mon, Sep 28, 2015 at 11:37 AM, Saurav

Re: Spark 1.5.0 Not able to submit jobs using cluster URL

2015-09-28 Thread Akhil Das
d I did that. > > On Mon, Sep 28, 2015 at 2:30 PM Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Update the dependency version in your jobs build file, Also make sure you >> have updated the spark version to 1.5.0 everywhere. (in the cluster, code) >> >

Re: Weird worker usage

2015-09-26 Thread Akhil Das
that will solve your problem. Thanks Best Regards On Sun, Sep 27, 2015 at 12:50 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > That means only > > Thanks > Best Regards > > On Sun, Sep 27, 2015 at 12:07 AM, N B <nb.nos...@gmail.com> wrote: > >> Hello, >

Re: Weird worker usage

2015-09-26 Thread Akhil Das
y >> disregarding N2 until its the final stage where data is being written out >> to Elasticsearch. I am not sure I understand the reason behind it not >> distributing more partitions to N2 to begin with and use it effectively. >> Since there are only 12 cores on N1 and 25 total partiti

Re: executor-cores setting does not work under Yarn

2015-09-25 Thread Akhil Das
Which version of spark are you having? Can you also check whats set in your conf/spark-defaults.conf file? Thanks Best Regards On Fri, Sep 25, 2015 at 1:58 AM, Gavin Yue wrote: > Running Spark app over Yarn 2.7 > > Here is my sparksubmit setting: > --master yarn-cluster

Re: Error: Asked to remove non-existent executor

2015-09-25 Thread Akhil Das
What you mean by you are behind a NAT? Does it mean you are submitting your jobs to a remote spark cluster from your local machine? If that's the case then you need to take care of few ports (in the NAT) http://spark.apache.org/docs/latest/configuration.html#networking which assume random as

Re: Weird worker usage

2015-09-25 Thread Akhil Das
Parallel tasks totally depends on the # of partitions that you are having, if you are not receiving sufficient partitions (partitions > total # cores) then try to do a .repartition. Thanks Best Regards On Fri, Sep 25, 2015 at 1:44 PM, N B wrote: > Hello all, > > I have a

Re: Setting Spark TMP Directory in Cluster Mode

2015-09-25 Thread Akhil Das
Try with spark.local.dir in the spark-defaults.conf or SPARK_LOCAL_DIR in the spark-env.sh file. Thanks Best Regards On Fri, Sep 25, 2015 at 2:14 PM, mufy wrote: > Faced with an issue where Spark temp files get filled under > /opt/spark-1.2.1/tmp on the local filesystem

Re: unsubscribe

2015-09-23 Thread Akhil Das
To unsubscribe, you need to send an email to user-unsubscr...@spark.apache.org as described here http://spark.apache.org/community.html Thanks Best Regards On Wed, Sep 23, 2015 at 1:23 AM, Stuart Layton wrote: > > > -- > Stuart Layton >

Re: Has anyone used the Twitter API for location filtering?

2015-09-23 Thread Akhil Das
> > > On Tue, Sep 22, 2015 at 2:20 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> ​That's because sometime getPlace returns null and calling getLang over >> null throws up either null pointer exception or noSuchMethodError. You need >> to filter out

Re: SparkContext declared as object variable

2015-09-23 Thread Akhil Das
createStream and applying some > transformations. > > Is creating sparContext at object level instead of creating in main > doesn't work > > On Tue, Sep 22, 2015 at 2:59 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Its a "value" not a variable,

Re: Spark Web UI + NGINX

2015-09-22 Thread Akhil Das
Can you not just tunnel it? Like on Machine A: ssh -L 8080:127.0.0.1:8080 machineB And on your local machine: ssh -L 80:127.0.0.1:8080 machineA And then simply open http://localhost/ that will show up the spark ui running on machineB. People at digitalOcean has made wonder article on how

Re: Spark Lost executor && shuffle.FetchFailedException

2015-09-22 Thread Akhil Das
If you can look a bit deeper in the executor logs, then you might find the root cause for this issue. Also make sure the ports (seems 34869 here) are accessible between all the machines. Thanks Best Regards On Mon, Sep 21, 2015 at 12:40 PM, wrote: > Hi All: > > > When I

Re: Cache after filter Vs Writing back to HDFS

2015-09-22 Thread Akhil Das
Instead of .map you can try doing a .mapPartitions and see the performance. Thanks Best Regards On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue wrote: > For a large dataset, I want to filter out something and then do the > computing intensive work. > > What I am doing now: >

Re: Lost tasks in Spark SQL join jobs

2015-09-22 Thread Akhil Das
If you look a bit in the error logs, you can possibly see other issues like GC over head etc, which causes the next set of tasks to fail. Thanks Best Regards On Thu, Sep 17, 2015 at 9:26 AM, Gang Bai wrote: > Hi all, > > I’m joining two tables on a specific

Re: Has anyone used the Twitter API for location filtering?

2015-09-22 Thread Akhil Das
​That's because sometime getPlace returns null and calling getLang over null throws up either null pointer exception or noSuchMethodError. You need to filter out those statuses which doesn't include location data.​ Thanks Best Regards On Fri, Sep 18, 2015 at 12:46 AM, Jo Sunad

Re: SparkContext declared as object variable

2015-09-22 Thread Akhil Das
Its a "value" not a variable, and what are you parallelizing here? Thanks Best Regards On Fri, Sep 18, 2015 at 11:21 PM, Priya Ch wrote: > Hello All, > > Instead of declaring sparkContext in main, declared as object variable > as - > > object sparkDemo > { > >

Re: AWS_CREDENTIAL_FILE

2015-09-22 Thread Akhil Das
No, you can either set the configurations within your SparkConf's hadoop configuration: val hadoopConf = sparkContext.hadoopConfiguration hadoopConf.set("fs.s3n.awsAccessKeyId", s3Key) hadoopConf.set("fs.s3n.awsSecretAccessKey", s3Secret) or you can set it in the environment as: export

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Akhil Das
ob is failed and > auto restart? If so, both the checkpoint data and database data are loaded, > won't this a problem? > > > > Bin Wang <wbi...@gmail.com>于2015年9月16日周三 下午8:40写道: > >> Will StreamingContex.getOrCreate do this work?What kind of code change >

Re: Getting parent RDD

2015-09-16 Thread Akhil Das
ot > sure how), so as to refer to that in catch block. > > > > Regards, > > Sam > > > > *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] > *Sent:* Wednesday, September 16, 2015 12:24 PM > *To:* Samya MAITI <samya.ma...@amadeus.com> >

Re: Managing scheduling delay in Spark Streaming

2015-09-16 Thread Akhil Das
I had a workaround for exactly the same scenario http://apache-spark-developers-list.1001551.n3.nabble.com/SparkStreaming-Workaround-for-BlockNotFound-Exceptions-td12096.html Apart from that, if you are using this consumer https://github.com/dibbhatt/kafka-spark-consumer it also has a built-in

Re: Getting parent RDD

2015-09-16 Thread Akhil Das
​How many RDDs are you having in that stream? If its a single RDD then you could do a .foreach and log the message, something like: val ssc = val msgStream = . //SparkKafkaDirectAPI val wordCountPair = TransformStream.transform(msgStream) /wordCountPair.foreach( ​msg​ => try{

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Akhil Das
You can't really recover from checkpoint if you alter the code. A better approach would be to use some sort of external storage (like a db or zookeeper etc) to keep the state (the indexes etc) and then when you deploy new code they can be easily recovered. Thanks Best Regards On Wed, Sep 16,

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Akhil Das
As of now i think its a no. Not sure if its a naive approach, but yes you can have a separate program to keep an eye in the webui (possibly parsing the content) and make it trigger the kill task/job once it detects a lag. (Again you will have to figure out the correct numbers before killing any

Re: why spark and kafka always crash

2015-09-15 Thread Akhil Das
Can you be more precise? Thanks Best Regards On Tue, Sep 15, 2015 at 11:28 AM, Joanne Contact wrote: > How to prevent it? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For

Re: Problems with Local Checkpoints

2015-09-14 Thread Akhil Das
You need to set your HADOOP_HOME and make sure the winutils.exe is available in the PATH. Here's a discussion around the same issue http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path Also this JIRA

Re: SparkR - Support for Other Models

2015-09-14 Thread Akhil Das
You can look into the Spark JIRA page for the same, if it isn't available there then you could possibly create an issue for support and hopefully in later releases it will be added. Thanks Best Regards On Thu, Sep 10, 2015 at 11:26 AM, Manish

Re: java.lang.NullPointerException with Twitter API

2015-09-14 Thread Akhil Das
Some status might not have the geoLocation and hence you are doing a null.toString.contains which ends up in that exception, put a condition or try...catch around it to make it work. Thanks Best Regards On Fri, Sep 11, 2015 at 12:59 AM, Jo Sunad wrote: > Hello! > > I am

Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread Akhil Das
You can look into this doc regarding the connection (its for gce though but it should be similar). Thanks Best Regards On Thu, Sep 10, 2015 at 11:20 PM, roni wrote: > I have spark installed on a EC2 cluster. Can I

Re: Spark task hangs infinitely when accessing S3

2015-09-14 Thread Akhil Das
Are you sitting behind a proxy or something? Can you look more into the executor logs? I have a strange feeling that you are blowing the memory (and possibly hitting GC etc). Thanks Best Regards On Thu, Sep 10, 2015 at 10:05 PM, Mario Pastorelli < mario.pastore...@teralytics.ch> wrote: > Dear

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Akhil Das
This consumer pretty much covers all those scenarios you listed github.com/dibbhatt/kafka-spark-consumer Give it a try. Thanks Best Regards On Thu, Sep 10, 2015 at 3:32 PM, Krzysztof Zarzycki wrote: > Hi there, > I have a problem with fulfilling all my needs when using

Re: Contribution in Apche Spark

2015-09-09 Thread Akhil Das
Have a look https://issues.apache.org/jira/browse/spark/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel Thanks Best Regards On Wed, Sep 9, 2015 at 9:50 AM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > I want to contribute in Apache dspark especially in MLlib in

Re: foreachRDD causing executor lost failure

2015-09-09 Thread Akhil Das
If you can look a bit in the executor logs, you would see the exact reason (mostly a OOM/GC etc). Instead of using foreach, try to use mapPartitions or foreachPartitions. Thanks Best Regards On Tue, Sep 8, 2015 at 10:45 PM, Priya Ch wrote: > Hello All, > > I am

Re: Partitions with zero records & variable task times

2015-09-09 Thread Akhil Das
gt; partitions to have zero records and others to have roughly equal sized > chunks (~50k in this case)? > > Before writing a custom partitioner, I would like to understand why has > the default partitioner failed in my case? > On 8 Sep 2015 3:00 pm, "Akhil Das" <ak.

Re: No auto decompress in Spark Java textFile function?

2015-09-09 Thread Akhil Das
textFile used to work with .gz files, i haven't tested it on bz2 files. If it isn't decompressing by default then what you have to do is to use the sc.wholeTextFiles and then decompress each record (that being file) with the corresponding codec. Thanks Best Regards On Tue, Sep 8, 2015 at 6:49

Re: I am very new to Spark. I have a very basic question. I have an array of values: listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A) I want to filter an RDD for all o

2015-09-09 Thread Akhil Das
Try this: val tocks = Array("EC-17A5206955089011B","EC-17A5206955089011A") val rddAll = sc.parallelize(List("This contains EC-17A5206955089011B","This doesnt")) rddAll.filter(line => { var found = false for(item <- tocks){ if(line.contains(item)) found = true } found

Re: Partitions with zero records & variable task times

2015-09-08 Thread Akhil Das
Try using a custom partitioner for the keys so that they will get evenly distributed across tasks Thanks Best Regards On Fri, Sep 4, 2015 at 7:19 PM, mark wrote: > I am trying to tune a Spark job and have noticed some strange behavior - > tasks in a stage vary in

Re: buildSupportsSnappy exception when reading the snappy file in Spark

2015-09-08 Thread Akhil Das
Looks like you are having different versions of snappy library. Here's a similar discussion if you haven't seen it already http://stackoverflow.com/questions/22150417/hadoop-mapreduce-java-lang-unsatisfiedlinkerror-org-apache-hadoop-util-nativec Thanks Best Regards On Mon, Sep 7, 2015 at 7:41

Re: Can not allocate executor when running spark on mesos

2015-09-08 Thread Akhil Das
In which mode are you submitting your application? (coarse-grained or fine-grained(default)). Have you gone through this documentation already? http://spark.apache.org/docs/latest/running-on-mesos.html#using-a-mesos-master-url Thanks Best Regards On Tue, Sep 8, 2015 at 12:54 PM, canan chen

Re: Exception when restoring spark streaming with batch RDD from checkpoint.

2015-09-08 Thread Akhil Das
Try to add a filter to remove/replace the null elements within/before the map operation. Thanks Best Regards On Mon, Sep 7, 2015 at 3:34 PM, ZhengHanbin wrote: > Hi, > > I am using spark streaming to join every RDD of a DStream to a stand alone > RDD to generate a new

<    1   2   3   4   5   6   7   8   9   10   >