Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Andrew Or
Hi Patrick, The fix you need is SPARK-6954: https://github.com/apache/spark/pull/5704. If possible, you may cherry-pick the following commit into your Spark deployment and it should resolve the issue: https://github.com/apache/spark/commit/98ac39d2f5828fbdad8c9a4e563ad1169e3b9948 Note that this

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders =

How to set up a Spark Client node?

2015-06-13 Thread MrAsanjar .
I have following hadoop spark cluster nodes configuration: Nodes 1 2 are resourceManager and nameNode respectivly Nodes 3, 4, and 5 each includes nodeManager dataNode Node 7 is Spark-master configured to run yarn-client or yarn-master modes I have tested it and it works fine. Is there any

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Michael Armbrust
Yes, its all just RDDs under the covers. DataFrames/SQL is just a more concise way to express your parallel programs. On Sat, Jun 13, 2015 at 5:25 PM, Rex X dnsr...@gmail.com wrote: Thanks, Don! Does SQL implementation of spark do parallel processing on records by default? -Rex On Sat,

Re: How to read avro in SparkR

2015-06-13 Thread Burak Yavuz
Hi, Not sure if this is it, but could you please try com.databricks.spark.avro instead of just avro. Thanks, Burak On Jun 13, 2015 9:55 AM, Shing Hing Man mat...@yahoo.com.invalid wrote: Hi, I am trying to read a avro file in SparkR (in Spark 1.4.0). I started R using the following.

Re: How to read avro in SparkR

2015-06-13 Thread Shivaram Venkataraman
Yep - Burak's answer should work. FWIW the error message from the stack trace that shows this is the line Failed to load class for data source: avro Thanks Shivaram On Sat, Jun 13, 2015 at 6:13 PM, Burak Yavuz brk...@gmail.com wrote: Hi, Not sure if this is it, but could you please try

Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread pth001
I got it. Thanks! Patcharee On 13/06/15 23:00, Will Briggs wrote: The context that is created by spark-shell is actually an instance of HiveContext. If you want to use it programmatically in your driver, you need to make sure that your context is a HiveContext, and not a SQLContext.

Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread Cheng Lian
As the error message says, were you using a |SQLContext| instead of a |HiveContext| to create the DataFrame? In Spark shell, although the variable name is |sqlContext|, the type of that variable is actually |org.apache.spark.sql.hive.HiveContext|, which has the ability to communicate with

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Sanjay Subramanian
hey guys I tried the following settings as well. No luck --total-executor-cores 24 --executor-memory 4G BTW on the same cluster , impala absolutely kills it. same query 9 seconds. no memory issues. no issues. In fact I am pretty disappointed with Spark-SQL.I have worked with Hive during the

spark stream twitter question ..

2015-06-13 Thread Mike Frampton
Hi I have a question about Spark Twitter stream processing in Spark 1.3.1, the code sample below just opens up a twitter stream, uses auth keys, splits out has tags and creates a temp table. However, when I try to compile it using sbt ( CentOS 6.5) I get the error [error]

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Rex X
Thanks, Don! Does SQL implementation of spark do parallel processing on records by default? -Rex On Sat, Jun 13, 2015 at 10:13 AM, Don Drake dondr...@gmail.com wrote: Take a look at https://github.com/databricks/spark-csv to read in the tab-delimited file (change the default delimiter) and

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Josh Rosen
Try using Spark 1.4.0 with SQL code generation turned on; this should make a huge difference. On Sat, Jun 13, 2015 at 5:08 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: hey guys I tried the following settings as well. No luck --total-executor-cores 24 --executor-memory 4G

Re: Extracting k-means cluster values along with centers?

2015-06-13 Thread Robin East
trying again On 13 Jun 2015, at 10:15, Robin East robin.e...@xense.co.uk wrote: Here’s typical way to do it: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse

Re: Building scaladoc using build/sbt unidoc failure

2015-06-13 Thread Reynold Xin
Try build/sbt clean first. On Tue, May 26, 2015 at 4:45 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I am trying to build scala doc from the 1.4 branch. But it failed due to [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: List(object package$DebugNode,

Re: --jars not working?

2015-06-13 Thread Alex Nakos
Are you using a build for scala 2.11? I’ve encountered the same behaviour trying to run on Yarn with scala 2.11 and Spark 1.3.0, 1.3.1 and 1.4.0.RC3 and raised JIRA issue here: https://issues.apache.org/jira/browse/SPARK-7944. Would be good to know if this is identical to what you’re seeing on

Re: [Spark 1.4.0] java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-13 Thread Steve Loughran
That's the Tachyon FS there, which appears to be missing a method override. On 12 Jun 2015, at 19:58, Peter Haumer phau...@us.ibm.commailto:phau...@us.ibm.com wrote: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-13 Thread Juan Rodríguez Hortalá
Perfect! I'll start working on it 2015-06-13 2:23 GMT+02:00 Amit Ramesh a...@yelp.com: Hi Juan, I have created a ticket for this: https://issues.apache.org/jira/browse/SPARK-8337 Thanks! Amit On Fri, Jun 12, 2015 at 3:17 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Don Drake
Take a look at https://github.com/databricks/spark-csv to read in the tab-delimited file (change the default delimiter) and once you have that as a DataFrame, SQL can do the rest. https://spark.apache.org/docs/latest/sql-programming-guide.html -Don On Fri, Jun 12, 2015 at 8:46 PM, Rex X

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders =

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Sandy Ryza
Hi Patrick, I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic allocation in 1.4 that permitted requesting negative numbers of executors. Any chance you'd be able to try with the newer version and see if the problem persists? -Sandy On Fri, Jun 12, 2015 at 7:42 PM, Patrick

Re: How to split log data into different files according to severity

2015-06-13 Thread Hao Wang
I am currently using filter inside a loop of all severity levels to do this, which I think is pretty inefficient. It has to read the entire data set once for each severity. I wonder if there is a more efficient way that takes just one pass of the data? Thanks. Best, Hao Wang On Jun 13, 2015,

Re: How to split log data into different files according to severity

2015-06-13 Thread Will Briggs
Check out this recent post by Cheng Liam regarding dynamic partitioning in Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html On June 13, 2015, at 5:41 AM, Hao Wang bill...@gmail.com wrote: Hi, I have a bunch of large log files on Hadoop. Each line contains a log and

How to read avro in SparkR

2015-06-13 Thread Shing Hing Man
Hi,  I am trying to read a avro file in SparkR (in Spark 1.4.0). I started R using the following. matmsh@gauss:~$ sparkR --packages com.databricks:spark-avro_2.10:1.0.0 Inside the R shell, when I issue the following, read.df(sqlContext, file:///home/matmsh/myfile.avro,avro) I get the following

How to silence Parquet logging?

2015-06-13 Thread Chris Freeman
Hey everyone, I’m trying to figure out how to silence all of the logging info that gets printed to the console when dealing with Parquet files. I’ve seen that there have been several PRs addressing this issue, but I can’t seem to figure out how to actually change the logging config. I’ve

--packages Failed to load class for data source v1.4

2015-06-13 Thread Don Drake
I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source:

Not albe to run FP-growth Example

2015-06-13 Thread masoom alam
Hi every one, I am trying to run the FP growth example. I have tried to compile the following POM file: project groupIdcom.oreilly.learningsparkexamples.mini/groupId artifactIdlearning-spark-mini-example/artifactId modelVersion4.0.0/modelVersion nameexample/name

Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread cizmazia
I would like to have a Spark Streaming *SQS Receiver* which deletes SQS messages only *after* they were successfully stored on S3. For this a *Custom Receiver* can be implemented with the semantics of the Reliable Receiver. The store(multiple-records) call blocks until the given records have been

Re: Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread Akhil Das
Yes, if you have enabled WAL and checkpointing then after the store, you can simply delete the SQS Messages from your receiver. Thanks Best Regards On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia mici...@gmail.com wrote: I would like to have a Spark Streaming SQS Receiver which deletes SQS

Re: Not albe to run FP-growth Example

2015-06-13 Thread masoom alam
Thanks for the answer. Any example? On Jun 13, 2015 2:13 PM, Sonal Goyal sonalgoy...@gmail.com wrote: I think you need to add dependency to spark mllib too. On Jun 13, 2015 11:10 AM, masoom alam masoom.a...@wanclouds.net wrote: Hi every one, I am trying to run the FP growth example. I have

How to split log data into different files according to severity

2015-06-13 Thread Hao Wang
Hi, I have a bunch of large log files on Hadoop. Each line contains a log and its severity. Is there a way that I can use Spark to split the entire data set into different files on Hadoop according the severity field? Thanks. Below is an example of the input and output. Input: [ERROR] log1

Re: How to split log data into different files according to severity

2015-06-13 Thread Akhil Das
Are you looking for something like filter? See a similar example here https://spark.apache.org/examples.html Thanks Best Regards On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang bill...@gmail.com wrote: Hi, I have a bunch of large log files on Hadoop. Each line contains a log and its severity. Is

Re: Are there ways to restrict what parameters users can set for a Spark job?

2015-06-13 Thread Akhil Das
I think the straight answer would be No, but yes you can actually hardcode these parameters if you want. Look in the SparkContext.scala https://github.com/apache/spark/blob/master/core%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2FSparkContext.scala#L364 where all these properties are being

Re: Spark 1.4 release date

2015-06-13 Thread Giovanni Paolo Gibilisco
Does the pre-build come with hive support? Namely, has it been built with -Phive and -Phive-thriftserver? On Fri, Jun 12, 2015, 9:32 AM ayan guha guha.a...@gmail.com wrote: Thanks guys, my question must look like a stupid one today :) Looking forward to test out 1.4.0, just downloaded it.

Re: Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread Michal Čizmazia
Thanks Akhil! I just looked it up in the code as well. Receiver.store(ArrayBuffer[T], ...) ReceiverSupervisorImpl.pushArrayBuffer(ArrayBuffer[T], ...) ReceiverSupervisorImpl.pushAndReportBlock(...) WriteAheadLogBasedBlockHandler.storeBlock(...) This implementation

Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread pth001
Hi, I am using spark 0.14. I try to insert data into a hive table (in orc format) from DF. partitionedTestDF.write.format(org.apache.spark.sql.hive.orc.DefaultSource) .mode(org.apache.spark.sql.SaveMode.Append).partitionBy(zone,z,year,month).saveAsTable(testorc) When this job is submitted by

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Patrick Woody
Hey Sandy, I'll test it out on 1.4. Do you have a bug number or PR that I could reference as well? Thanks! -Pat Sent from my iPhone On Jun 13, 2015, at 11:38 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Patrick, I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic