alter table add columns aternatives or hive refresh

2016-04-09 Thread Maurin Lenglart
Hi, I am trying to add columns to table that I created with the “saveAsTable” api. I update the columns using sqlContext.sql(‘alter table myTable add columns (mycol string)’). The next time I create a df and save it in the same table, with the new columns I get a : “ParquetRelation requires

Re: Graphframes pattern causing java heap space errors

2016-04-09 Thread Buntu Dev
I'm running it via pyspark against yarn in client deploy mode. I do notice in the spark web ui under Environment tab all the options I've set, so I'm guessing these are accepted. On Sat, Apr 9, 2016 at 5:52 PM, Jacek Laskowski wrote: > Hi, > > (I haven't played with

Datasets combineByKey

2016-04-09 Thread Amit Sela
Is there (planned ?) a combineByKey support for Dataset ? Is / Will there be a support for combiner lifting ? Thanks, Amit

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Ah, I spoke too soon. I thought the SQS part was going to be a spark package. It looks like it has be compiled into a jar for use. Am I right? Can someone help with this? I tried to compile it using SBT, but I’m stuck with a SonatypeKeys not found error. If there’s an easier alternative,

Re: Graphframes pattern causing java heap space errors

2016-04-09 Thread Jacek Laskowski
Hi, (I haven't played with GraphFrames) What's your `sc.master`? How do you run your application -- spark-submit or java -jar or sbt run or...? The reason I'm asking is that few options might not be in use whatsoever, e.g. spark.driver.memory and spark.executor.memory in local mode. Pozdrawiam,

Graphframes pattern causing java heap space errors

2016-04-09 Thread Buntu Dev
I'm running this motif pattern against 1.5M vertices (5.5mb) and 10M (60mb) edges: tgraph.find("(a)-[]->(b); (c)-[]->(b); (c)-[]->(d)") I keep running into Java heap space errors: ~ ERROR actor.ActorSystemImpl: Uncaught fatal error from thread

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Buntu Dev
I've allocated about 4g for the driver. For the count stage, I notice the Shuffle Write to be 13.9 GB. On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR wrote: > What's the size of your driver? > On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: > >> Actually,

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
Thanks Mandar, I couldn't see anything under the 'Storage Section' but under the Executors I noticed it to be 3.1 GB: Executors (1) Memory: 0.0 B Used (3.1 GB Total) -- View this message in context:

Re: java.lang.ClassCastException when I execute a Spark SQL command

2016-04-09 Thread P.S. Aravind
sorry, no syntax errors in the sql, I missed the 'like' when I pasted the sql in the email.I'm getting the exception for this sql %sql insert overwrite table  table2 partition(node) select  * from  table1  where  field1 like '%google%'and node = 'DCP2' P. S. "Arvind" Aravind

Re: java.lang.ClassCastException when I execute a Spark SQL command

2016-04-09 Thread Mich Talebzadeh
where field1 '%google%' should read where field1 like '%google%' Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

java.lang.ClassCastException when I execute a Spark SQL command

2016-04-09 Thread P.S. Aravind
Need help in resolving java.lang.ClassCastException when I execute a Spark SQL   command %sql insert overwrite table  table2 partition(node) select  * from  table1  where  field1 '%google%'and node = 'DCP2' Job aborted due to stage failure: Task 0 in stage 66.0 failed 4 times, most recent

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This was easy! I just created a notification on a source S3 bucket to kick off a Lambda function that would decompress the dropped file and save it to another S3 bucket. In return, this S3 bucket has a notification to send a SNS message to me via email. I can just as easily setup SQS to be the

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Gourav Sengupta
why not use AWS Lambda? Regards, Gourav On Fri, Apr 8, 2016 at 8:14 PM, Benjamin Kim wrote: > Has anyone monitored an S3 bucket or directory using Spark Streaming and > pulled any new files to process? If so, can you provide basic Scala coding > help on this? > > Thanks, >

Re: Weird error while serialization

2016-04-09 Thread Gourav Sengupta
Hi, why are you not using data frames and SPARK CSV? Regards, Gourav On Sat, Apr 9, 2016 at 10:00 PM, SURAJ SHETH wrote: > Hi, > I am using Spark 1.5.2 > > The file contains 900K rows each with twelve fields (tab separated): > The first 11 fields are Strings with a maximum

Re: Weird error while serialization

2016-04-09 Thread SURAJ SHETH
Hi, I am using Spark 1.5.2 The file contains 900K rows each with twelve fields (tab separated): The first 11 fields are Strings with a maximum of 20 chars each. The last field is a comma separated array of floats with 8,192 values. It works perfectly if I change the below code for groupBy from

Re: Spark GUI, Workers and Executors

2016-04-09 Thread Mark Hamstra
https://spark.apache.org/docs/latest/cluster-overview.html On Sat, Apr 9, 2016 at 12:28 AM, Ashok Kumar wrote: > On Spark GUI I can see the list of Workers. > > I always understood that workers are used by executors. > > What is the relationship between workers and

Re: Weird error while serialization

2016-04-09 Thread Ted Yu
The value was out of the range of integer. Which Spark release are you using ? Can you post snippet of code which can reproduce the error ? Thanks On Sat, Apr 9, 2016 at 12:25 PM, SURAJ SHETH wrote: > I am trying to perform some processing and cache and count the RDD. >

Weird error while serialization

2016-04-09 Thread SURAJ SHETH
I am trying to perform some processing and cache and count the RDD. Any solutions? Seeing a weird error : File "/mnt/yarn/usercache/hadoop/appcache/application_1456909219314_0014/container_1456909219314_0014_01_04/pyspark.zip/pyspark/serializers.py", line 550, in write_int

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Nezih Yigitbasi
Natu, Benjamin, With this mechanism you can configure notifications for *buckets* (if you only care about some key prefixes you can take a look at object key name filtering, see the docs) for various event types, and then these events can be published to SNS, SQS or Lambdas. I think using SQS as

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Ndjido Ardo BAR
What's the size of your driver? On Sat, 9 Apr 2016 at 20:33, Buntu Dev wrote: > Actually, df.show() works displaying 20 rows but df.count() is the one > which is causing the driver to run out of memory. There are just 3 INT > columns. > > Any idea what could be the reason? >

Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Buntu Dev
Actually, df.show() works displaying 20 rows but df.count() is the one which is causing the driver to run out of memory. There are just 3 INT columns. Any idea what could be the reason? On Sat, Apr 9, 2016 at 10:47 AM, wrote: > You seem to have a lot of column :-) ! >

Re: Unable run Spark in YARN mode

2016-04-09 Thread Ted Yu
mahesh : bq. :16: error: not found: value sqlContext Please take a look at: https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext for how the import should be used. Please include version of Spark and the commandline you used in the reply.

Re: Unable run Spark in YARN mode

2016-04-09 Thread Natu Lauchande
How are you trying to run spark ? locally ? spark submit ? On Sat, Apr 9, 2016 at 7:57 AM, maheshmath wrote: > I have set SPARK_LOCAL_IP=127.0.0.1 still getting below error > > 16/04/09 10:36:50 INFO spark.SecurityManager: Changing view acls to: mahesh > 16/04/09 10:36:50

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Do you know if textFileStream can see if new files are created underneath a whole bucket? Only at the level of the folder that you specify . They don't do subfolders. So your approach would be detecting everything under path s3://bucket/path/2016040902_data.csv Also, will Spark Streaming not

How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
I keep running out of memory on the driver when I attempt to do df.show(). Can anyone let me know how to estimate the size of the dataframe? Thanks! -- View this message in context:

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This is awesome! I have someplace to start from. Thanks, Ben > On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com wrote: > > Someone please correct me if I am wrong as I am still rather green to spark, > however it appears that through the S3 notification mechanism described > below,

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread programminggeek72
Someone please correct me if I am wrong as I am still rather green to spark, however it appears that through the S3 notification mechanism described below, you can publish events to SQS and use SQS as a streaming source into spark. The project at https://github.com/imapi/spark-sqs-receiver

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Nezih, This looks like a good alternative to having the Spark Streaming job check for new files on its own. Do you know if there is a way to have the Spark Streaming job get notified with the new file information and act upon it? This can reduce the overhead and cost of polling S3. Plus, I can

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Natu, Do you know if textFileStream can see if new files are created underneath a whole bucket? For example, if the bucket name is incoming and new files underneath it are 2016/04/09/00/00/01/data.csv and 2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will Spark Streaming

Re: Use only latest values

2016-04-09 Thread Sebastian Piu
Have a look at mapWithState if you are using 1.6+ On Sat, 9 Apr 2016, 08:04 Daniela S, wrote: > Hi, > > I would like to cache values and to use only the latest "valid" values to > build a sum. > In more detail, I receive values from devices periodically. I would like > to

Re: Use only latest values

2016-04-09 Thread Natu Lauchande
I don't see this happening without a store. You can try parquet on top of hdfs. This will at least avoid third party systems burden. On 09 Apr 2016 9:04 AM, "Daniela S" wrote: > Hi, > > I would like to cache values and to use only the latest "valid" values to > build a sum.

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Can you elaborate a bit more in your approach using s3 notifications ? Just curious. dealing with a similar issue right now that might benefit from this. On 09 Apr 2016 9:25 AM, "Nezih Yigitbasi" wrote: > While it is doable in Spark, S3 also supports notifications: >

Spark GUI, Workers and Executors

2016-04-09 Thread Ashok Kumar
On Spark GUI I can see the list of Workers. I always understood that workers are used by executors. What is the relationship between workers and executors please. Is it one to one? Thanks

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Nezih Yigitbasi
While it is doable in Spark, S3 also supports notifications: http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande wrote: > Hi Benjamin, > > I have done it . The critical configuration items are the ones below

Use only latest values

2016-04-09 Thread Daniela S
Hi, I would like to cache values and to use only the latest "valid" values to build a sum. In more detail, I receive values from devices periodically. I would like to add up all the valid values each minute. But not every device sends a new value every minute. And as long as there is no new

Re: How Spark handles dead machines during a job.

2016-04-09 Thread Reynold Xin
The driver has the data and wouldn't need to rerun. On Friday, April 8, 2016, Sung Hwan Chung wrote: > Hello, > > Say, that I'm doing a simple rdd.map followed by collect. Say, also, that > one of the executors finish all of its tasks, but there are still other > executors