Re: No Twitter Input from Kafka to Spark Streaming

2015-08-06 Thread Akhil Das
You just pasted your twitter credentials, consider changing it. :/ Thanks Best Regards On Wed, Aug 5, 2015 at 10:07 PM, narendra narencs...@gmail.com wrote: Thanks Akash for the answer. I added endpoint to the listener and now it is working. -- View this message in context:

Re: How do I Process Streams that span multiple lines?

2015-08-04 Thread Akhil Das
If you are using Kafka, then you can basically push an entire file as a message to Kafka. In that case in your DStream, you will receive the single message which is the contents of the file and it can of course span multiple lines. Thanks Best Regards On Mon, Aug 3, 2015 at 8:27 PM, Spark

Re: Running multiple batch jobs in parallel using Spark on Mesos

2015-08-04 Thread Akhil Das
One approach would be to use a Jobserver in between, create SparkContexts in it. Lets say you create two, one which is configured to run on coarse-grained and another set to fine-grained. Let the high priority jobs hit the coarse-grained SparkContext and the other jobs use the fine-grained one.

Re: Writing to HDFS

2015-08-04 Thread Akhil Das
Just to add rdd.take(1) won't trigger the entire computation, it will just pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files to trigger the complete pipeline. How many partitions do you see in the last stage? Thanks Best Regards On Tue, Aug 4, 2015 at 7:10 AM, ayan guha

Re: Twitter Connector-Spark Streaming

2015-08-04 Thread Akhil Das
that i want to ask is that i have used twitters streaming api.and it seems that the above solution uses rest api. how can i used both simultaneously ? Any response will be much appreciated :) Regards On Tue, Aug 4, 2015 at 1:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Yes you can

Re: spark cluster setup

2015-08-03 Thread Akhil Das
Are you sitting behind a firewall and accessing a remote master machine? In that case, have a look at this http://spark.apache.org/docs/latest/configuration.html#networking, you might want to fix few properties like spark.driver.host, spark.driver.host etc. Thanks Best Regards On Mon, Aug 3,

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Akhil Das
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA to add this feature and may be in future release it could be added. Thanks Best Regards On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly moreill...@qub.ac.uk wrote: Hi, I am currently working on the latest version of

Re: Does Spark Streaming need to list all the files in a directory?

2015-08-02 Thread Akhil Das
I guess it goes through that 500k files https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193for the first time and then use a filter from next time. Thanks Best Regards On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das

Re: unsubscribe

2015-08-02 Thread Akhil Das
​LOL Brandon! @ziqiu See http://spark.apache.org/community.html You need to send an email to user-unsubscr...@spark.apache.org​ Thanks Best Regards On Fri, Jul 31, 2015 at 2:06 AM, Brandon White bwwintheho...@gmail.com wrote: https://www.youtube.com/watch?v=JncgoPKklVE On Thu, Jul 30, 2015

Re: Twitter Connector-Spark Streaming

2015-07-30 Thread Akhil Das
specific to my account? Thanks in anticipation :) On Thu, Jul 30, 2015 at 6:17 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Owh, this one fetches the public tweets, not the one specific to your account. Thanks Best Regards On Thu, Jul 30, 2015 at 6:11 PM, Sadaf Khan sa

Re: Spark SQL Error

2015-07-30 Thread Akhil Das
It seem an issue with the ES connector https://github.com/elastic/elasticsearch-hadoop/issues/482 Thanks Best Regards On Tue, Jul 28, 2015 at 6:14 AM, An Tran tra...@gmail.com wrote: Hello all, I am currently having an error with Spark SQL access Elasticsearch using Elasticsearch Spark

Re: streaming issue

2015-07-30 Thread Akhil Das
What operation are you doing with streaming? Also can you look in the datanode logs and see whats going on? Thanks Best Regards On Tue, Jul 28, 2015 at 8:18 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi, I got a error when running spark streaming as below .

Re: Spark and Speech Recognition

2015-07-30 Thread Akhil Das
Like this? val data = sc.textFile(/sigmoid/audio/data/, 24).foreachPartition(urls = speachRecognizer(urls)) Let 24 be the total number of cores that you have on all the workers. Thanks Best Regards On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf opus...@gmail.com wrote: Hello, I am writing a

Re: Spark build/sbt assembly

2015-07-30 Thread Akhil Das
Did you try removing this jar? build/sbt-launch-0.13.7.jar Thanks Best Regards On Tue, Jul 28, 2015 at 12:08 AM, Rahul Palamuttam rahulpala...@gmail.com wrote: Hi All, I hope this is the right place to post troubleshooting questions. I've been following the install instructions and I get

Re: Heatmap with Spark Streaming

2015-07-30 Thread Akhil Das
You can easily push data to an intermediate storage from spark streaming (like HBase or a SQL/NoSQL DB etc) and then power your dashboards with d3 js. Thanks Best Regards On Tue, Jul 28, 2015 at 12:18 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: I have just started using Spark Streaming and

Re: sc.parallelize(512k items) doesn't always use 64 executors

2015-07-30 Thread Akhil Das
sc.parallelize takes a second parameter which is the total number of partitions, are you using that? Thanks Best Regards On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios kostas.koug...@googlemail.com wrote: Hi, I do an sc.parallelize with a list of 512k items. But sometimes not all executors

Re: Heatmap with Spark Streaming

2015-07-30 Thread Akhil Das
You can integrate it with any language (like php) and use ajax calls to update the charts. Thanks Best Regards On Thu, Jul 30, 2015 at 2:11 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Thanks For the suggestion Akhil! I looked at https://github.com/mbostock/d3/wiki/Gallery to know more

Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Akhil Das
) at java.lang.Thread.run(Thread.java:745) *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Tuesday, July 28, 2015 2:30 PM *To:* Manohar Reddy *Cc:* user@spark.apache.org *Subject:* Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client You need to trigger an action on your

Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Akhil Das
Put a try catch inside your code and inside the catch print out the length or the list itself which causes the ArrayIndexOutOfBounds. It might happen that some of your data is not proper. Thanks Best Regards On Mon, Jul 27, 2015 at 8:24 PM, Manohar753 manohar.re...@happiestminds.com wrote: Hi

Re: ReceiverStream SPARK not able to cope up with 20,000 events /sec .

2015-07-28 Thread Akhil Das
You need to find the bottleneck here, it could your network (if the data is huge) or your producer code isn't pushing at 20k/s, If you are able to produce at 20k/s then make sure you are able to receive at that rate (try it without spark). Thanks Best Regards On Sat, Jul 25, 2015 at 3:29 PM,

Re: use S3-Compatible Storage with spark

2015-07-28 Thread Akhil Das
With s3n try this out: *s3service.s3-endpoint*The host name of the S3 service. You should only ever change this value from the default if you need to contact an alternative S3 endpoint for testing purposes. Default: s3.amazonaws.com Thanks Best Regards On Tue, Jul 28, 2015 at 1:54 PM, Schmirr

Re: Question abt serialization

2015-07-28 Thread Akhil Das
Did you try it with just: (comment out line 27) println Count of spark: + file.filter({s - s.contains('spark')}).count() Thanks Best Regards On Sun, Jul 26, 2015 at 12:43 AM, tog guillaume.all...@gmail.com wrote: Hi I have been using Spark for quite some time using either scala or python.

Re: Multiple operations on same DStream in Spark Streaming

2015-07-28 Thread Akhil Das
One approach would be to store the batch data in an intermediate storage (like HBase/MySQL or even in zookeeper), and inside your filter function you just go and read the previous value from this storage and do whatever operation that you are supposed to do. Thanks Best Regards On Sun, Jul 26,

Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-28 Thread Akhil Das
Did you try binding to 0.0.0.0? Thanks Best Regards On Mon, Jul 27, 2015 at 10:37 PM, Wayne Song wayne.e.s...@gmail.com wrote: Hello, I am trying to start a Spark master for a standalone cluster on an EC2 node. The CLI command I'm using looks like this: Note that I'm specifying the

Re: java.lang.ArrayIndexOutOfBoundsException: 0 on Yarn Client

2015-07-28 Thread Akhil Das
You need to trigger an action on your rowrdd for it to execute the map, you can do a rowrdd.count() for that. Thanks Best Regards On Tue, Jul 28, 2015 at 2:18 PM, Manohar Reddy manohar.re...@happiestminds.com wrote: Hi Akhil, Thanks for thereply.I found the root cause but don’t know how

Re: use S3-Compatible Storage with spark

2015-07-27 Thread Akhil Das
) 2015-07-27 11:17 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: So you are able to access your AWS S3 with s3a now? What is the error that you are getting when you try to access the custom storage with fs.s3a.endpoint? Thanks Best Regards On Mon, Jul 27, 2015 at 2:44 PM, Schmirr Wurst

Re: suggest coding platform

2015-07-27 Thread Akhil Das
How about IntelliJ? It also has a Terminal tab. Thanks Best Regards On Fri, Jul 24, 2015 at 6:06 PM, saif.a.ell...@wellsfargo.com wrote: Hi all, I tried Notebook Incubator Zeppelin, but I am not completely happy with it. What do you people use for coding? Anything with auto-complete,

Re: Encryption on RDDs or in-memory on Apache Spark

2015-07-27 Thread Akhil Das
Have a look at the current security support https://spark.apache.org/docs/latest/security.html, Spark does not have any encryption support for objects in memory out of the box. But if your concern is to protect the data being cached in memory, then you can easily encrypt your objects in memory

Re: Spark - Eclipse IDE - Maven

2015-07-27 Thread Akhil Das
You can follow this doc https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup Thanks Best Regards On Fri, Jul 24, 2015 at 10:56 AM, Siva Reddy ksiv...@gmail.com wrote: Hi All, I am trying to setup the Eclipse (LUNA) with Maven so that I

Re: ERROR TaskResultGetter: Exception while getting task result when reading avro files that contain arrays

2015-07-27 Thread Akhil Das
Its a serialization error with nested schema i guess. You can look at the twitters chill avro serializer library. Here's two discussion on the same: - https://issues.apache.org/jira/browse/SPARK-3447 -

Re: java.lang.NoSuchMethodError for list.toMap.

2015-07-27 Thread Akhil Das
Whats in your build.sbt? You could be messing with the scala version it seems. Thanks Best Regards On Fri, Jul 24, 2015 at 2:15 AM, Dan Dong dongda...@gmail.com wrote: Hi, When I ran with spark-submit the following simple Spark program of: import org.apache.spark.SparkContext._ import

Re: spark dataframe gc

2015-07-27 Thread Akhil Das
This spark.shuffle.sort.bypassMergeThreshold might help, You could also try setting the shuffle manager to hash from sort. You can see more configuration options from here https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior. Thanks Best Regards On Fri, Jul 24, 2015 at 3:33

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-27 Thread Akhil Das
For each of your job, you can pass spark.ui.port to bind to a different port. Thanks Best Regards On Fri, Jul 24, 2015 at 7:49 PM, Joji John jj...@ebates.com wrote: Thanks Ajay. The way we wrote our spark application is that we have a generic python code, multiple instances of which can

Re: use S3-Compatible Storage with spark

2015-07-27 Thread Akhil Das
? 2015-07-20 18:11 GMT+02:00 Schmirr Wurst schmirrwu...@gmail.com: Thanks, that is what I was looking for... Any Idea where I have to store and reference the corresponding hadoop-aws-2.6.0.jar ?: java.io.IOException: No FileSystem for scheme: s3n 2015-07-20 8:33 GMT+02:00 Akhil Das

Re: Writing binary files in Spark

2015-07-25 Thread Akhil Das
alternative from Python? And also, I want to write the raw bytes of my object into files on disk, and not using some Serialization format to be read back into Spark. Is it possible? Any alternatives for that? Thanks, Oren On Thu, Jul 23, 2015 at 8:04 PM Akhil Das ak...@sigmoidanalytics.com

Re: How to restart Twitter spark stream

2015-07-24 Thread Akhil Das
Best Regards On Fri, Jul 24, 2015 at 11:25 AM, Zoran Jeremic zoran.jere...@gmail.com wrote: Hi Akhil, Thank you for sending this code. My apologize if I will ask something that is obvious here, since I'm newbie in Scala, but I still don't see how I can use this code. Maybe my original

Re: Using Dataframe write with newHdoopApi

2015-07-24 Thread Akhil Das
PM, ayan guha guha.a...@gmail.com wrote: Hi Akhil Thanks.Will definitely take a look. Couple of questions 1. Is it possible to use newHadoopAPI from dataframe.write or saveAs? 2. is esDF usable rom Python? On Fri, Jul 24, 2015 at 2:29 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did

Re: What if request cores are not satisfied

2015-07-24 Thread Akhil Das
I guess it would wait for sometime and throw up something like this: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Thanks Best Regards On Thu, Jul 23, 2015 at 7:53 AM, bit1...@163.com bit1...@163.com wrote:

Re: writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-23 Thread Akhil Das
Currently, the only way for you would be to create proper schema for the data. This is not a bug, but you could open a jira (since this would help others to solve their similar use-cases) for feature and in future version it could be implemented and included. Thanks Best Regards On Tue, Jul 21,

Re: spark thrift server supports timeout?

2015-07-23 Thread Akhil Das
Here's a few more configurations https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-ConfigurationPropertiesinthehive-site.xmlFile can't find anything on the timeouts though. Thanks Best Regards On Wed, Jul 22, 2015 at 1:01 AM, Judy Nash

Re: 1.4.0 classpath issue with spark-submit

2015-07-23 Thread Akhil Das
You can try adding that jar in SPARK_CLASSPATH (its deprecated though) in spark-env.sh file. Thanks Best Regards On Tue, Jul 21, 2015 at 7:34 PM, Michal Haris michal.ha...@visualdna.com wrote: I have a spark program that uses dataframes to query hive and I run it both as a spark-shell for

Re: NullPointerException inside RDD when calling sc.textFile

2015-07-23 Thread Akhil Das
Did you try: val data = indexed_files.groupByKey val *modified_data* = data.map { a = var name = a._2.mkString(,) (a._1, name) } *modified_data*.foreach { a = var file = sc.textFile(a._2) println(file.count) } Thanks Best Regards On Wed, Jul 22, 2015 at 2:18 AM, MorEru

Re: problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-23 Thread Akhil Das
It looks like its picking up the wrong namenode uri from the HADOOP_CONF_DIR, make sure it is proper. Also for submitting a spark job to a remote cluster, you might want to look at the spark.driver host and spark.driver.port Thanks Best Regards On Wed, Jul 22, 2015 at 8:56 PM, rok

Re: Using Dataframe write with newHdoopApi

2015-07-23 Thread Akhil Das
Did you happened to look into esDF https://github.com/elastic/elasticsearch-hadoop/issues/441? You can open an issue over here if that doesn't solve your problem https://github.com/elastic/elasticsearch-hadoop/issues Thanks Best Regards On Tue, Jul 21, 2015 at 5:33 PM, ayan guha

Re: Writing binary files in Spark

2015-07-23 Thread Akhil Das
You can look into .saveAsObjectFiles Thanks Best Regards On Thu, Jul 23, 2015 at 8:44 PM, Oren Shpigel o...@yowza3d.com wrote: Hi, I use Spark to read binary files using SparkContext.binaryFiles(), and then do some calculations, processing, and manipulations to get new objects (also

Re: How to restart Twitter spark stream

2015-07-22 Thread Akhil Das
,#android.toLowerCase,#iphone.toLowerCase))* val newRDD = samplehashtags.map { x = (x,1) } val joined = newRDD.join(rdd) joined }) filteredStream.print() Thanks Best Regards On Wed, Jul 22, 2015 at 3:58 AM, Zoran Jeremic zoran.jere...@gmail.com wrote: Hi Akhil and Jorn, I

Re: use S3-Compatible Storage with spark

2015-07-21 Thread Akhil Das
where I have to store and reference the corresponding hadoop-aws-2.6.0.jar ?: java.io.IOException: No FileSystem for scheme: s3n 2015-07-20 8:33 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: Not in the uri, but in the hadoop configuration you can specify it. property namefs.s3a.endpoint

Re: spark streaming 1.3 issues

2015-07-21 Thread Akhil Das
I'd suggest you upgrading to 1.4 as it has better metrices and UI. Thanks Best Regards On Mon, Jul 20, 2015 at 7:01 PM, Shushant Arora shushantaror...@gmail.com wrote: Is coalesce not applicable to kafkaStream ? How to do coalesce on kafkadirectstream its not there in api ? Shall calling

Re: Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-21 Thread Akhil Das
Do you have HADOOP_HOME, HADOOP_CONF_DIR and hadoop's winutils.exe in the environment? Thanks Best Regards On Mon, Jul 20, 2015 at 5:45 PM, nitinkalra2000 nitinkalra2...@gmail.com wrote: Hi All, I am working on Spark 1.4 on windows environment. I have to set eventLog directory so that I can

Re: What is the correct syntax of using Spark streamingContext.fileStream()?

2015-07-21 Thread Akhil Das
​Here's two ways of doing that: Without the filter function : JavaPairDStreamString, String foo = ssc.String, String, SequenceFileInputFormatfileStream(/tmp/foo);​ With the filter function: JavaPairInputDStreamLongWritable, Text foo = ssc.fileStream(/tmp/foo, LongWritable.class,

Re: use S3-Compatible Storage with spark

2015-07-21 Thread Akhil Das
(fs.s3n.endpoint,test.com ) And I continue to get my data from amazon, how could it be ? (I also use s3n in my text url) 2015-07-21 9:30 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: You can add the jar in the classpath, and you can set the property like: sc.hadoopConfiguration.set(fs.s3a.endpoint

Re: Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-21 Thread Akhil Das
...@gmail.com wrote: Hi Akhil, I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's the configuration required for this ? From where can I get winutils.exe ? Thanks and Regards, Nitin Kalra On Tue, Jul 21, 2015 at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote

Re: k-means iteration not terminate

2015-07-21 Thread Akhil Das
It could be a GC pause or something, you need to check in the stages tab and see what is taking time, If you upgrade to Spark 1.4, it has better UI and DAG visualization which helps you debug better. Thanks Best Regards On Mon, Jul 20, 2015 at 8:21 PM, Pa Rö paul.roewer1...@googlemail.com wrote:

Re: use S3-Compatible Storage with spark

2015-07-20 Thread Akhil Das
) is assumed. /description /property Thanks Best Regards On Sun, Jul 19, 2015 at 9:13 PM, Schmirr Wurst schmirrwu...@gmail.com wrote: I want to use pithos, were do I can specify that endpoint, is it possible in the url ? 2015-07-19 17:22 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com

Re: Exception while triggering spark job from remote jvm

2015-07-20 Thread Akhil Das
Just make sure there is no firewall/network blocking the requests as its complaining about timeout. Thanks Best Regards On Mon, Jul 20, 2015 at 1:14 AM, ankit tyagi ankittyagi.mn...@gmail.com wrote: Just to add more information. I have checked the status of this file, not a single block is

Re: How to restart Twitter spark stream

2015-07-20 Thread Akhil Das
Jorn meant something like this: val filteredStream = twitterStream.transform(rdd ={ val newRDD = scc.sc.textFile(/this/file/will/be/updated/frequently).map(x = (x,1)) rdd.join(newRDD) }) ​newRDD will work like a filter when you do the join.​ Thanks Best Regards On Sun, Jul 19, 2015 at 9:32

Re: use S3-Compatible Storage with spark

2015-07-19 Thread Akhil Das
Could you name the Storage service that you are using? Most of them provides a S3 like RestAPI endpoint for you to hit. Thanks Best Regards On Fri, Jul 17, 2015 at 2:06 PM, Schmirr Wurst schmirrwu...@gmail.com wrote: Hi, I wonder how to use S3 compatible Storage in Spark ? If I'm using

Re: Spark APIs memory usage?

2015-07-19 Thread Akhil Das
. (no matrices loaded), Same exception is coming. Can anyone tell what createDataFrame does internally? Are there any alternatives for it? On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I suspect its the numpy filling up Memory. Thanks Best Regards On Fri

Re: streaming and piping to R, sending all data in window to pipe()

2015-07-19 Thread Akhil Das
Did you try inputs.repartition(1).foreachRDD(..)? Thanks Best Regards On Fri, Jul 17, 2015 at 9:51 PM, PAULI, KEVIN CHRISTIAN [AG-Contractor/1000] kevin.christian.pa...@monsanto.com wrote: Spark newbie here, using Spark 1.3.1. I’m consuming a stream and trying to pipe the data from the

Re: Spark APIs memory usage?

2015-07-17 Thread Akhil Das
Can you paste the code? How much memory does your system have and how big is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)? Thanks Best Regards On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma harit.vishwaka...@gmail.com wrote: Thanks, Code is running on a single

Re: Spark APIs memory usage?

2015-07-17 Thread Akhil Das
= sqlCtx.createDataFrame(rdd2) 4. df.save() # in parquet format It throws exception in createDataFrame() call. I don't know what exactly it is creating ? everything in memory? or can I make it to persist simultaneously while getting created. Thanks On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das ak

Re: DataFrame InsertIntoJdbc() Runtime Exception on cluster

2015-07-16 Thread Akhil Das
Which version of spark are you using? insertIntoJDBC is deprecated (from 1.4.0), you may use write.jdbc() instead. Thanks Best Regards On Wed, Jul 15, 2015 at 2:43 PM, Manohar753 manohar.re...@happiestminds.com wrote: Hi All, Am trying to add few new rows for existing table in mysql using

Re: Job aborted due to stage failure: Task not serializable:

2015-07-16 Thread Akhil Das
Did you try this? *val out=lines.filter(xx={* val y=xx val x=broadcastVar.value var flag:Boolean=false for(a-x) { if(y.contains(a)) flag=true } flag } *})* Thanks Best Regards On Wed, Jul 15, 2015 at 8:10 PM, Naveen Dabas naveen.u...@ymail.com wrote: I

Re: Spark cluster read local files

2015-07-16 Thread Akhil Das
Yes you can do that, just make sure you rsync the same file to the same location on every machine. Thanks Best Regards On Thu, Jul 16, 2015 at 5:50 AM, Julien Beaudan jbeau...@stottlerhenke.com wrote: Hi all, Is it possible to use Spark to assign each machine in a cluster the same task, but

Re: Spark on EMR with S3 example (Python)

2015-07-15 Thread Akhil Das
I think any requests going to s3*:// requires the credentials. If they have made it public (via http) then you won't require the keys. Thanks Best Regards On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: Hi Sujit, I just wanted to access public datasets on

Re: Research ideas using spark

2015-07-15 Thread Akhil Das
Try to repartition it to a higher number (at least 3-4 times the total # of cpu cores). What operation are you doing? It may happen that if you are doing a join/groupBy sort of operation that task which is taking time is having all the values, in that case you need to use a Partitioner which will

Re: Spark Intro

2015-07-14 Thread Akhil Das
This is where you can get started https://spark.apache.org/docs/latest/sql-programming-guide.html Thanks Best Regards On Mon, Jul 13, 2015 at 3:54 PM, vinod kumar vinodsachin...@gmail.com wrote: Hi Everyone, I am developing application which handles bulk of data around millions(This may

Re: Spark executor memory information

2015-07-14 Thread Akhil Das
1. Yes open up the webui running on 8080 to see the memory/cores allocated to your workers, and open up the ui running on 4040 and click on the Executor tab to see the memory allocated for the executor. 2. mllib codes can be found over here https://github.com/apache/spark/tree/master/mllib and

Re: Does Spark Streaming support streaming from a database table?

2015-07-14 Thread Akhil Das
Why not add a trigger to your database table and whenever its updated push the changes to kafka etc and use normal sparkstreaming? You can also write a receiver based architecture https://spark.apache.org/docs/latest/streaming-custom-receivers.html for this, but that will be a bit time consuming.

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Akhil Das
Look in the worker logs and see whats going on. Thanks Best Regards On Tue, Jul 14, 2015 at 4:02 PM, Arthur Chan arthur.hk.c...@gmail.com wrote: Hi, I use Spark 1.4. When saving the model to HDFS, I got error? Please help! Regards my scala command:

Re: Spark Intro

2015-07-14 Thread Akhil Das
have more memory, also if you have enough cores 4 records are nothing. Thanks Best Regards On Tue, Jul 14, 2015 at 3:09 PM, vinod kumar vinodsachin...@gmail.com wrote: Hi Akhil Is my choice to switch to spark is good? because I don't have enough information regards limitation and working

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Akhil Das
Someone else also reported this error with spark 1.4.0 Thanks Best Regards On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan arthur.hk.c...@gmail.com wrote: Hi, Below is the log form the worker. 15/07/14 17:18:56 ERROR FileAppender: Error writing stream to file

Re: hive-site.xml spark1.3

2015-07-14 Thread Akhil Das
Try adding it in your SPARK_CLASSPATH inside conf/spark-env.sh file. Thanks Best Regards On Tue, Jul 14, 2015 at 7:05 AM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql CLI doesn't pick it up. (copying the same

Re: Standalone mode connection failure from worker node to master

2015-07-14 Thread Akhil Das
Can you paste your conf/spark-env.sh file? Put SPARK_MASTER_IP as the master machine's host name in spark-env.sh file. Also add your slaves hostnames into conf/slaves file and do a sbin/start-all.sh Thanks Best Regards On Tue, Jul 14, 2015 at 1:26 PM, sivarani whitefeathers...@gmail.com wrote:

Re: Caching in spark

2015-07-13 Thread Akhil Das
wrote: Hi Akhil, It's interesting if RDDs are stored internally in a columnar format as well? Or it is only when an RDD is cached in SQL context, it is converted to columnar format. What about data frames? Thanks! -- Ruslan Dautkhanov On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das ak

Re: Master vs. Slave Nodes Clarification

2015-07-13 Thread Akhil Das
You are a bit confused about master node, slave node and the driver machine. 1. Master node can be kept as a smaller machine in your dev environment, mostly in production you will be using Mesos or Yarn cluster manager. 2. Now, if you are running your driver program (the streaming job) on the

Re: Spark Standalone Mode not working in a cluster

2015-07-13 Thread Akhil Das
Just make sure you are having the same installation of spark-1.4.0-bin-hadoop2.6 everywhere. (including the slaves, master, and from where you start the spark-shell). Thanks Best Regards On Mon, Jul 13, 2015 at 4:34 AM, Eduardo erocha@gmail.com wrote: My installation of spark is not

Re: Starting Spark-Application without explicit submission to cluster?

2015-07-12 Thread Akhil Das
Yes, that is correct. You can use this boiler plate to avoid spark-submit. //The configurations val sconf = new SparkConf() .setMaster(spark://spark-ak-master:7077) .setAppName(SigmoidApp) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)

Re: Issues when combining Spark and a third party java library

2015-07-12 Thread Akhil Das
Did you try setting the HADOOP_CONF_DIR? Thanks Best Regards On Sat, Jul 11, 2015 at 3:17 AM, maxdml maxdemou...@gmail.com wrote: Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4 and higher from the official website. -- View this message in context:

Re: Linear search between particular log4j log lines

2015-07-12 Thread Akhil Das
Can you not use sc.wholeTextFile() and use a custom parser or a regex to extract out the TransactionIDs? Thanks Best Regards On Sat, Jul 11, 2015 at 8:18 AM, ssbiox sergey.korytni...@gmail.com wrote: Hello, I have a very specific question on how to do a search between particular lines of

Re: Worker dies with java.io.IOException: Stream closed

2015-07-12 Thread Akhil Das
Can you dig a bit more in the worker logs? Also make sure that spark has permission to write to /opt/ on that machine as its one machine always throwing up. Thanks Best Regards On Sat, Jul 11, 2015 at 11:18 PM, gaurav sharma sharmagaura...@gmail.com wrote: Hi All, I am facing this issue in

Re: query on Spark + Flume integration using push model

2015-07-10 Thread Akhil Das
Here's an example https://github.com/przemek1990/spark-streaming Thanks Best Regards On Thu, Jul 9, 2015 at 4:35 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello all, I'm trying to configure the flume to push data into a sink so that my stream job could pick up the data. My events

Re: Accessing Spark Web UI from another place than where the job actually ran

2015-07-10 Thread Akhil Das
When you connect to the machines you can create an ssh tunnel to access the UI : ssh -L 8080:127.0.0.1:8080 MasterMachinesIP And then you can simply open localhost:8080 in your browser and it should show up the UI. Thanks Best Regards On Thu, Jul 9, 2015 at 7:44 PM, rroxanaioana

Re: DataFrame insertInto fails, saveAsTable works (Azure HDInsight)

2015-07-10 Thread Akhil Das
It seems an issue with the azure, there was a discussion over here https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-spark-install/ Thanks Best Regards On Thu, Jul 9, 2015 at 9:42 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm running Spark 1.4 on

Re: Caching in spark

2015-07-10 Thread Akhil Das
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory Thanks Best Regards On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar vinodsachin...@gmail.com wrote: Hi Guys, Can any one please share me how to use caching feature of spark via spark sql queries? -Vinod

Re: SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use when running spark-shell

2015-07-10 Thread Akhil Das
that's because sc is already initialized. You can do sc.stop() before you initialize another one. Thanks Best Regards On Fri, Jul 10, 2015 at 3:54 PM, Prateek . prat...@aricent.com wrote: Hi, I am running single spark-shell but observing this error when I give val sc = new

Re: Job completed successfully without processing anything

2015-07-09 Thread Akhil Das
Looks like a configuration problem with your spark setup, are you running the driver on a different network? Can you try a simple program from spark-shell and make sure your setup is proper? (like sc.parallelize(1 to 1000).collect()) Thanks Best Regards On Thu, Jul 9, 2015 at 1:02 AM, ÐΞ€ρ@Ҝ

Re: Connecting to nodes on cluster

2015-07-09 Thread Akhil Das
On Wed, Jul 8, 2015 at 7:31 PM, Ashish Dutt ashish.du...@gmail.com wrote: Hi, We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two days I have been trying to connect my laptop to the server using spark master ip:port but its been unsucessful. The server contains data

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-09 Thread Akhil Das
Did you try sc.shutdown and creating a new one? Thanks Best Regards On Wed, Jul 8, 2015 at 8:12 PM, Terry Hole hujie.ea...@gmail.com wrote: I am using spark 1.4.1rc1 with default hive settings Thanks - Terry Hi All, I'd like to use the hive context in spark shell, i need to recreate the

Re: What does RDD lineage refer to ?

2015-07-09 Thread Akhil Das
Yes, just to add see the following scenario of rdd lineage: RDD1 - RDD2 - RDD3 - RDD4 here RDD2 depends on the RDD1's output and the lineage goes till RDD4. Now, for some reason RDD3 is lost, and spark will recompute it from RDD2. Thanks Best Regards On Thu, Jul 9, 2015 at 5:51 AM, canan

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Have a look http://alvinalexander.com/scala/how-to-create-java-thread-runnable-in-scala, create two threads and call thread1.start(), thread2.start() Thanks Best Regards On Wed, Jul 8, 2015 at 1:06 PM, Ashish Dutt ashish.du...@gmail.com wrote: Thanks for your reply Akhil. How do you

Re: unable to bring up cluster with ec2 script

2015-07-08 Thread Akhil Das
Its showing connection refused, for some reason it was not able to connect to the machine either its the machine\s start up time or its with the security group. Thanks Best Regards On Wed, Jul 8, 2015 at 2:04 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I'm following the tutorial

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Whats the point of creating them in parallel? You can multi-thread it run it in parallel though. Thanks Best Regards On Wed, Jul 8, 2015 at 5:34 AM, Brandon White bwwintheho...@gmail.com wrote: Say I have a spark job that looks like following: def loadTable1() { val table1 =

Re: Master doesn't start, no logs

2015-07-07 Thread Akhil Das
Strange. What are you having in $SPARK_MASTER_IP? It may happen that it is not able to bind to the given ip but again it should be in the logs. Thanks Best Regards On Tue, Jul 7, 2015 at 12:54 AM, maxdml maxdemou...@gmail.com wrote: Hi, I've been compiling spark 1.4.0 with SBT, from the

Re: How to debug java.io.OptionalDataException issues

2015-07-07 Thread Akhil Das
Did you try kryo? Wrap everything with kryo and see if you are still hitting the exception. (At least you could see a different exception stack). Thanks Best Regards On Tue, Jul 7, 2015 at 6:05 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, suffering from a pretty strange issue:

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-07 Thread Akhil Das
Can you try adding sc.stop at the end of your program? looks like its having a hard-time closing off sparkcontext. Thanks Best Regards On Tue, Jul 7, 2015 at 4:08 PM, Hafsa Asif hafsa.a...@matchinguu.com wrote: Hi, I run the following simple Java spark standalone app with maven command

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Akhil Das
Here's a simplified example: SparkConf conf = new SparkConf().setAppName( Sigmoid).setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ListString user = new ArrayListString(); user.add(Jack); user.add(Jill);

Re: Master doesn't start, no logs

2015-07-07 Thread Akhil Das
instances having successively run on the same machine? -- Henri Maxime Demoulin 2015-07-07 4:10 GMT-04:00 Akhil Das ak...@sigmoidanalytics.com: Strange. What are you having in $SPARK_MASTER_IP? It may happen that it is not able to bind to the given ip but again it should be in the logs. Thanks

Re: Spark got stuck with BlockManager after computing connected components using GraphX

2015-07-06 Thread Akhil Das
If you don't want those logs flood your screen, you can disable it simply with: import org.apache.log4j.{Level, Logger} Logger.getLogger(org).setLevel(Level.OFF) Logger.getLogger(akka).setLevel(Level.OFF) Thanks Best Regards On Sun, Jul 5, 2015 at 7:27 PM, Hellen

Re: cores and resource management

2015-07-06 Thread Akhil Das
Try with *spark.cores.max*, executor cores is used when you usually run it on yarn mode. Thanks Best Regards On Mon, Jul 6, 2015 at 1:22 AM, nizang ni...@windward.eu wrote: hi, We're running spark 1.4.0 on ec2, with 6 machines, 4 cores each. We're trying to run an application on a number of

<    1   2   3   4   5   6   7   8   9   10   >