Use only latest values

2016-04-09 Thread Daniela S
Hi, I would like to cache values and to use only the latest "valid" values to build a sum. In more detail, I receive values from devices periodically. I would like to add up all the valid values each minute. But not every device sends a new value every minute. And as long as there is no new

Re: Use only latest values

2016-04-09 Thread Sebastian Piu
Have a look at mapWithState if you are using 1.6+ On Sat, 9 Apr 2016, 08:04 Daniela S, wrote: > Hi, > > I would like to cache values and to use only the latest "valid" values to > build a sum. > In more detail, I receive values from devices periodically. I would like > to

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread programminggeek72
Someone please correct me if I am wrong as I am still rather green to spark, however it appears that through the S3 notification mechanism described below, you can publish events to SQS and use SQS as a streaming source into spark. The project at https://github.com/imapi/spark-sqs-receiver

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This is awesome! I have someplace to start from. Thanks, Ben > On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com wrote: > > Someone please correct me if I am wrong as I am still rather green to spark, > however it appears that through the S3 notification mechanism described > below,

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Can you elaborate a bit more in your approach using s3 notifications ? Just curious. dealing with a similar issue right now that might benefit from this. On 09 Apr 2016 9:25 AM, "Nezih Yigitbasi" wrote: > While it is doable in Spark, S3 also supports notifications: >

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Nezih Yigitbasi
While it is doable in Spark, S3 also supports notifications: http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande wrote: > Hi Benjamin, > > I have done it . The critical configuration items are the ones below

Re: How Spark handles dead machines during a job.

2016-04-09 Thread Reynold Xin
The driver has the data and wouldn't need to rerun. On Friday, April 8, 2016, Sung Hwan Chung wrote: > Hello, > > Say, that I'm doing a simple rdd.map followed by collect. Say, also, that > one of the executors finish all of its tasks, but there are still other > executors

Spark GUI, Workers and Executors

2016-04-09 Thread Ashok Kumar
On Spark GUI I can see the list of Workers. I always understood that workers are used by executors. What is the relationship between workers and executors please. Is it one to one? Thanks

Re: Use only latest values

2016-04-09 Thread Natu Lauchande
I don't see this happening without a store. You can try parquet on top of hdfs. This will at least avoid third party systems burden. On 09 Apr 2016 9:04 AM, "Daniela S" wrote: > Hi, > > I would like to cache values and to use only the latest "valid" values to > build a sum.

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Natu, Do you know if textFileStream can see if new files are created underneath a whole bucket? For example, if the bucket name is incoming and new files underneath it are 2016/04/09/00/00/01/data.csv and 2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will Spark Streaming

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Nezih, This looks like a good alternative to having the Spark Streaming job check for new files on its own. Do you know if there is a way to have the Spark Streaming job get notified with the new file information and act upon it? This can reduce the overhead and cost of polling S3. Plus, I can

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Nezih Yigitbasi
Natu, Benjamin, With this mechanism you can configure notifications for *buckets* (if you only care about some key prefixes you can take a look at object key name filtering, see the docs) for various event types, and then these events can be published to SNS, SQS or Lambdas. I think using SQS as

Re: Unable run Spark in YARN mode

2016-04-09 Thread Natu Lauchande
How are you trying to run spark ? locally ? spark submit ? On Sat, Apr 9, 2016 at 7:57 AM, maheshmath wrote: > I have set SPARK_LOCAL_IP=127.0.0.1 still getting below error > > 16/04/09 10:36:50 INFO spark.SecurityManager: Changing view acls to: mahesh > 16/04/09 10:36:50

Re: Unable run Spark in YARN mode

2016-04-09 Thread Ted Yu
mahesh : bq. :16: error: not found: value sqlContext Please take a look at: https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext for how the import should be used. Please include version of Spark and the commandline you used in the reply.

Re: Weird error while serialization

2016-04-09 Thread Gourav Sengupta
Hi, why are you not using data frames and SPARK CSV? Regards, Gourav On Sat, Apr 9, 2016 at 10:00 PM, SURAJ SHETH wrote: > Hi, > I am using Spark 1.5.2 > > The file contains 900K rows each with twelve fields (tab separated): > The first 11 fields are Strings with a maximum

Re: Weird error while serialization

2016-04-09 Thread SURAJ SHETH
Hi, I am using Spark 1.5.2 The file contains 900K rows each with twelve fields (tab separated): The first 11 fields are Strings with a maximum of 20 chars each. The last field is a comma separated array of floats with 8,192 values. It works perfectly if I change the below code for groupBy from

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Gourav Sengupta
why not use AWS Lambda? Regards, Gourav On Fri, Apr 8, 2016 at 8:14 PM, Benjamin Kim wrote: > Has anyone monitored an S3 bucket or directory using Spark Streaming and > pulled any new files to process? If so, can you provide basic Scala coding > help on this? > > Thanks, >

java.lang.ClassCastException when I execute a Spark SQL command

2016-04-09 Thread P.S. Aravind
Need help in resolving java.lang.ClassCastException when I execute a Spark SQL   command %sql insert overwrite table  table2 partition(node) select  * from  table1  where  field1 '%google%'and node = 'DCP2' Job aborted due to stage failure: Task 0 in stage 66.0 failed 4 times, most recent

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Ah, I spoke too soon. I thought the SQS part was going to be a spark package. It looks like it has be compiled into a jar for use. Am I right? Can someone help with this? I tried to compile it using SBT, but I’m stuck with a SonatypeKeys not found error. If there’s an easier alternative,

Datasets combineByKey

2016-04-09 Thread Amit Sela
Is there (planned ?) a combineByKey support for Dataset ? Is / Will there be a support for combiner lifting ? Thanks, Amit

Re: java.lang.ClassCastException when I execute a Spark SQL command

2016-04-09 Thread Mich Talebzadeh
where field1 '%google%' should read where field1 like '%google%' Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Graphframes pattern causing java heap space errors

2016-04-09 Thread Jacek Laskowski
Hi, (I haven't played with GraphFrames) What's your `sc.master`? How do you run your application -- spark-submit or java -jar or sbt run or...? The reason I'm asking is that few options might not be in use whatsoever, e.g. spark.driver.memory and spark.executor.memory in local mode. Pozdrawiam,

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This was easy! I just created a notification on a source S3 bucket to kick off a Lambda function that would decompress the dropped file and save it to another S3 bucket. In return, this S3 bucket has a notification to send a SNS message to me via email. I can just as easily setup SQS to be the

Graphframes pattern causing java heap space errors

2016-04-09 Thread Buntu Dev
I'm running this motif pattern against 1.5M vertices (5.5mb) and 10M (60mb) edges: tgraph.find("(a)-[]->(b); (c)-[]->(b); (c)-[]->(d)") I keep running into Java heap space errors: ~ ERROR actor.ActorSystemImpl: Uncaught fatal error from thread

Re: Graphframes pattern causing java heap space errors

2016-04-09 Thread Buntu Dev
I'm running it via pyspark against yarn in client deploy mode. I do notice in the spark web ui under Environment tab all the options I've set, so I'm guessing these are accepted. On Sat, Apr 9, 2016 at 5:52 PM, Jacek Laskowski wrote: > Hi, > > (I haven't played with

Re: java.lang.ClassCastException when I execute a Spark SQL command

2016-04-09 Thread P.S. Aravind
sorry, no syntax errors in the sql, I missed the 'like' when I pasted the sql in the email.I'm getting the exception for this sql %sql insert overwrite table  table2 partition(node) select  * from  table1  where  field1 like '%google%'and node = 'DCP2' P. S. "Arvind" Aravind

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Buntu Dev
I've allocated about 4g for the driver. For the count stage, I notice the Shuffle Write to be 13.9 GB. On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR wrote: > What's the size of your driver? > On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: > >> Actually,

alter table add columns aternatives or hive refresh

2016-04-09 Thread Maurin Lenglart
Hi, I am trying to add columns to table that I created with the “saveAsTable” api. I update the columns using sqlContext.sql(‘alter table myTable add columns (mycol string)’). The next time I create a df and save it in the same table, with the new columns I get a : “ParquetRelation requires

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
Thanks Mandar, I couldn't see anything under the 'Storage Section' but under the Executors I noticed it to be 3.1 GB: Executors (1) Memory: 0.0 B Used (3.1 GB Total) -- View this message in context:

How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
I keep running out of memory on the driver when I attempt to do df.show(). Can anyone let me know how to estimate the size of the dataframe? Thanks! -- View this message in context:

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Do you know if textFileStream can see if new files are created underneath a whole bucket? Only at the level of the folder that you specify . They don't do subfolders. So your approach would be detecting everything under path s3://bucket/path/2016040902_data.csv Also, will Spark Streaming not

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Buntu Dev
Actually, df.show() works displaying 20 rows but df.count() is the one which is causing the driver to run out of memory. There are just 3 INT columns. Any idea what could be the reason? On Sat, Apr 9, 2016 at 10:47 AM, wrote: > You seem to have a lot of column :-) ! >

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Ndjido Ardo BAR
What's the size of your driver? On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: > Actually, df.show() works displaying 20 rows but df.count() is the one > which is causing the driver to run out of memory. There are just 3 INT > columns. > > Any idea what could be the reason? >

Re: Spark GUI, Workers and Executors

2016-04-09 Thread Mark Hamstra
https://spark.apache.org/docs/latest/cluster-overview.html On Sat, Apr 9, 2016 at 12:28 AM, Ashok Kumar wrote: > On Spark GUI I can see the list of Workers. > > I always understood that workers are used by executors. > > What is the relationship between workers and

Weird error while serialization

2016-04-09 Thread SURAJ SHETH
I am trying to perform some processing and cache and count the RDD. Any solutions? Seeing a weird error : File "/mnt/yarn/usercache/hadoop/appcache/application_1456909219314_0014/container_1456909219314_0014_01_04/pyspark.zip/pyspark/serializers.py", line 550, in write_int

Re: Weird error while serialization

2016-04-09 Thread Ted Yu
The value was out of the range of integer. Which Spark release are you using ? Can you post snippet of code which can reproduce the error ? Thanks On Sat, Apr 9, 2016 at 12:25 PM, SURAJ SHETH wrote: > I am trying to perform some processing and cache and count the RDD. >