Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-13 Thread Juan Rodríguez Hortalá
Perfect! I'll start working on it 2015-06-13 2:23 GMT+02:00 Amit Ramesh : > > Hi Juan, > > I have created a ticket for this: > https://issues.apache.org/jira/browse/SPARK-8337 > > Thanks! > Amit > > > On Fri, Jun 12, 2015 at 3:17 PM, Juan Rodríguez Hortalá < > juan.rodriguez.hort...@gmail.com> wr

Re: [Spark 1.4.0] java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-13 Thread Steve Loughran
That's the Tachyon FS there, which appears to be missing a method override. On 12 Jun 2015, at 19:58, Peter Haumer mailto:phau...@us.ibm.com>> wrote: Exception in thread "main" java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.f

Not albe to run FP-growth Example

2015-06-13 Thread masoom alam
Hi every one, I am trying to run the FP growth example. I have tried to compile the following POM file: com.oreilly.learningsparkexamples.mini learning-spark-mini-example 4.0.0 example jar 0.0.1 org.apache.spark spark-core_2.10

How to split log data into different files according to severity

2015-06-13 Thread Hao Wang
Hi, I have a bunch of large log files on Hadoop. Each line contains a log and its severity. Is there a way that I can use Spark to split the entire data set into different files on Hadoop according the severity field? Thanks. Below is an example of the input and output. Input: [ERROR] log1 [INFO]

Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread cizmazia
I would like to have a Spark Streaming *SQS Receiver* which deletes SQS messages only *after* they were successfully stored on S3. For this a *Custom Receiver* can be implemented with the semantics of the Reliable Receiver. The store(multiple-records) call blocks until the given records have been

Re: Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread Akhil Das
Yes, if you have enabled WAL and checkpointing then after the store, you can simply delete the SQS Messages from your receiver. Thanks Best Regards On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia wrote: > I would like to have a Spark Streaming SQS Receiver which deletes SQS > messages only aft

Re: How to split log data into different files according to severity

2015-06-13 Thread Akhil Das
Are you looking for something like filter? See a similar example here https://spark.apache.org/examples.html Thanks Best Regards On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang wrote: > Hi, > > I have a bunch of large log files on Hadoop. Each line contains a log and > its severity. Is there a way th

Re: Are there ways to restrict what parameters users can set for a Spark job?

2015-06-13 Thread Akhil Das
I think the straight answer would be No, but yes you can actually hardcode these parameters if you want. Look in the SparkContext.scala where all these properties are being initi

Re: Spark 1.4 release date

2015-06-13 Thread Giovanni Paolo Gibilisco
Does the pre-build come with hive support? Namely, has it been built with -Phive and -Phive-thriftserver? On Fri, Jun 12, 2015, 9:32 AM ayan guha wrote: > Thanks guys, my question must look like a stupid one today :) Looking > forward to test out 1.4.0, just downloaded it. > > Congrats to the t

Re: Not albe to run FP-growth Example

2015-06-13 Thread masoom alam
Thanks for the answer. Any example? On Jun 13, 2015 2:13 PM, "Sonal Goyal" wrote: > I think you need to add dependency to spark mllib too. > On Jun 13, 2015 11:10 AM, "masoom alam" wrote: > >> Hi every one, >> >> I am trying to run the FP growth example. I have tried to compile the >> following

Re: Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread Michal Čizmazia
Thanks Akhil! I just looked it up in the code as well. Receiver.store(ArrayBuffer[T], ...) ReceiverSupervisorImpl.pushArrayBuffer(ArrayBuffer[T], ...) ReceiverSupervisorImpl.pushAndReportBlock(...) WriteAheadLogBasedBlockHandler.storeBlock(...) This implementation stor

Re: Extracting k-means cluster values along with centers?

2015-06-13 Thread Robin East
trying again > On 13 Jun 2015, at 10:15, Robin East wrote: > > Here’s typical way to do it: > > > 1 > 2 > 3 > 4 > 5 > 6 > 7 > 8 > 9 > 10 > 11 > 12 > 13 > 14 > import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} > import org.apache.spark.mllib.linalg.Vectors > > // Load and parse th

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Sandy Ryza
Hi Patrick, I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic allocation in 1.4 that permitted requesting negative numbers of executors. Any chance you'd be able to try with the newer version and see if the problem persists? -Sandy On Fri, Jun 12, 2015 at 7:42 PM, Patrick Wo

How to silence Parquet logging?

2015-06-13 Thread Chris Freeman
Hey everyone, I’m trying to figure out how to silence all of the logging info that gets printed to the console when dealing with Parquet files. I’ve seen that there have been several PRs addressing this issue, but I can’t seem to figure out how to actually change the logging config. I’ve alrea

Re: How to split log data into different files according to severity

2015-06-13 Thread Hao Wang
I am currently using filter inside a loop of all severity levels to do this, which I think is pretty inefficient. It has to read the entire data set once for each severity. I wonder if there is a more efficient way that takes just one pass of the data? Thanks. Best, Hao Wang > On Jun 13, 2015,

Re: How to split log data into different files according to severity

2015-06-13 Thread Will Briggs
Check out this recent post by Cheng Liam regarding dynamic partitioning in Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html On June 13, 2015, at 5:41 AM, Hao Wang wrote: Hi, I have a bunch of large log files on Hadoop. Each line contains a log and its severity. Is

--packages & Failed to load class for data source v1.4

2015-06-13 Thread Don Drake
I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spar

How to read avro in SparkR

2015-06-13 Thread Shing Hing Man
Hi,  I am trying to read a avro file in SparkR (in Spark 1.4.0). I started R using the following. matmsh@gauss:~$ sparkR --packages com.databricks:spark-avro_2.10:1.0.0 Inside the R shell, when I issue the following, > read.df(sqlContext, "file:///home/matmsh/myfile.avro","avro") I get the follow

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Don Drake
Take a look at https://github.com/databricks/spark-csv to read in the tab-delimited file (change the default delimiter) and once you have that as a DataFrame, SQL can do the rest. https://spark.apache.org/docs/latest/sql-programming-guide.html -Don On Fri, Jun 12, 2015 at 8:46 PM, Rex X wrote

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders = Array("folde

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Patrick Woody
Hey Sandy, I'll test it out on 1.4. Do you have a bug number or PR that I could reference as well? Thanks! -Pat Sent from my iPhone > On Jun 13, 2015, at 11:38 AM, Sandy Ryza wrote: > > Hi Patrick, > > I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic > allocation in 1

Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread pth001
Hi, I am using spark 0.14. I try to insert data into a hive table (in orc format) from DF. partitionedTestDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource") .mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("testorc") When this job is s

Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread Will Briggs
The context that is created by spark-shell is actually an instance of HiveContext. If you want to use it programmatically in your driver, you need to make sure that your context is a HiveContext, and not a SQLContext. https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables H

Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread pth001
I got it. Thanks! Patcharee On 13/06/15 23:00, Will Briggs wrote: The context that is created by spark-shell is actually an instance of HiveContext. If you want to use it programmatically in your driver, you need to make sure that your context is a HiveContext, and not a SQLContext. https://s

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Andrew Or
Hi Patrick, The fix you need is SPARK-6954: https://github.com/apache/spark/pull/5704. If possible, you may cherry-pick the following commit into your Spark deployment and it should resolve the issue: https://github.com/apache/spark/commit/98ac39d2f5828fbdad8c9a4e563ad1169e3b9948 Note that this

Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread Cheng Lian
As the error message says, were you using a |SQLContext| instead of a |HiveContext| to create the DataFrame? In Spark shell, although the variable name is |sqlContext|, the type of that variable is actually |org.apache.spark.sql.hive.HiveContext|, which has the ability to communicate with Hive

Re: How to silence Parquet logging?

2015-06-13 Thread Cheng Lian
Hi Chris, Which Spark version were you using? And could you provide some sample log lines you saw? Parquet uses java.util.logging internally and can't be controlled by log4j.properties. The most recent master branch should have muted most Parquet logs. However, it's known that if you explicitl

Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Sanjay Subramanian
hey guys I tried the following settings as well. No luck --total-executor-cores 24 --executor-memory 4G BTW on the same cluster , impala absolutely kills it. same query 9 seconds. no memory issues. no issues. In fact I am pretty disappointed with Spark-SQL.I have worked with Hive during the 0.9.

spark stream twitter question ..

2015-06-13 Thread Mike Frampton
Hi I have a question about Spark Twitter stream processing in Spark 1.3.1, the code sample below just opens up a twitter stream, uses auth keys, splits out has tags and creates a temp table. However, when I try to compile it using sbt ( CentOS 6.5) I get the error [error] /home/hadoop/spark/

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Rex X
Thanks, Don! Does SQL implementation of spark do parallel processing on records by default? -Rex On Sat, Jun 13, 2015 at 10:13 AM, Don Drake wrote: > Take a look at https://github.com/databricks/spark-csv to read in the > tab-delimited file (change the default delimiter) > > and once you have

Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Josh Rosen
Try using Spark 1.4.0 with SQL code generation turned on; this should make a huge difference. On Sat, Jun 13, 2015 at 5:08 PM, Sanjay Subramanian < sanjaysubraman...@yahoo.com> wrote: > hey guys > > I tried the following settings as well. No luck > > --total-executor-cores 24 --executor-memory 4G

How to set up a Spark Client node?

2015-06-13 Thread MrAsanjar .
I have following hadoop & spark cluster nodes configuration: Nodes 1 & 2 are resourceManager and nameNode respectivly Nodes 3, 4, and 5 each includes nodeManager & dataNode Node 7 is Spark-master configured to run yarn-client or yarn-master modes I have tested it and it works fine. Is there any ins

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Michael Armbrust
Yes, its all just RDDs under the covers. DataFrames/SQL is just a more concise way to express your parallel programs. On Sat, Jun 13, 2015 at 5:25 PM, Rex X wrote: > Thanks, Don! Does SQL implementation of spark do parallel processing on > records by default? > > -Rex > > > > On Sat, Jun 13, 20

Re: How to read avro in SparkR

2015-06-13 Thread Burak Yavuz
Hi, Not sure if this is it, but could you please try "com.databricks.spark.avro" instead of just "avro". Thanks, Burak On Jun 13, 2015 9:55 AM, "Shing Hing Man" wrote: > Hi, > I am trying to read a avro file in SparkR (in Spark 1.4.0). > > I started R using the following. > matmsh@gauss:~$ spa

Re: How to read avro in SparkR

2015-06-13 Thread Shivaram Venkataraman
Yep - Burak's answer should work. FWIW the error message from the stack trace that shows this is the line "Failed to load class for data source: avro" Thanks Shivaram On Sat, Jun 13, 2015 at 6:13 PM, Burak Yavuz wrote: > Hi, > Not sure if this is it, but could you please try > "com.databricks.s

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders = Array("folder1"

Job marked as killed in spark 1.4

2015-06-13 Thread nizang
hi, I have a running and working cluster with spark 1.3.1, and I tried to install a new cluster that is working with spark 1.4.0 I ran a job on the new 1.4.0 cluster, and the same job on the old 1.3.1 cluster After the job finished (in both clusters), I entered the job in the UI, and in the new

Re: Not albe to run FP-growth Example

2015-06-13 Thread masoom alam
These two imports are missing and thus FP-growth is not compiling... import org.apache.spark.*mllib.fpm.FPGrowth*; import org.apache.spark.*mllib.fpm.FPGrowthModel*; How to include the dependency in the POM file? On Sat, Jun 13, 2015 at 4:26 AM, masoom alam wrote: > Thanks for the answer. Any