code generation memory issue

2016-11-21 Thread geoHeil
I am facing a strange issue when trying to correct some errors in my raw data The problem is reported here: https://issues.apache.org/jira/browse/SPARK-18532 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/code-generation-memory-issue-tp28114.html Sent

Re: newbie question about RDD

2016-11-21 Thread Raghav
Sorry I forgot to ask how can I use spark context here ? I have hdfs directory path of the files, as well as the name node of hdfs cluster. Thanks for your help. On Mon, Nov 21, 2016 at 9:45 PM, Raghav wrote: > Hi > > I am extremely new to Spark. I have to read a file

newbie question about RDD

2016-11-21 Thread Raghav
Hi I am extremely new to Spark. I have to read a file form HDFS, and get it in memory in RDD format. I have a Java class as follows: class Person { private long UUID; private String FirstName; private String LastName; private String zip; // public methods } The file in

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-21 Thread kant kodali
Hi Michael, I only see spark 2.0.2 which is what I am using currently. Any idea on when 2.1 will be released? Thanks, kant On Mon, Nov 21, 2016 at 5:12 PM, Michael Armbrust wrote: > In Spark 2.1 we've added a from_json >

Re: RDD Partitions not distributed evenly to executors

2016-11-21 Thread Thunder Stumpges
Has anyone figured this out yet!? I have gone looking for this exact problem (spark 1.6.1) and I cannot get my partitions to be distributed evenly across executors no matter what I've tried. it has been mentioned several other times in the user group as well as the dev group (as mentioned by Mike

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-21 Thread Michael Armbrust
In Spark 2.1 we've added a from_json function that I think will do what you want. On Fri, Nov 18, 2016 at 2:29 AM, kant kodali wrote: > This seem to work > >

Re: Stateful aggregations with Structured Streaming

2016-11-21 Thread Michael Armbrust
We are planning on adding mapWithState or something similar in a future release. In the mean time, standard Dataframe aggregations should work (count, sum, etc). If you are looking to do something custom, I'd suggest looking at Aggregators

Re: Create a Column expression from a String

2016-11-21 Thread Stuart White
Yes, that's what I was looking for. Thanks! On Mon, Nov 21, 2016 at 6:56 PM, Michael Armbrust wrote: > You are looking for org.apache.spark.sql.functions.expr() > > On Sat, Nov 19, 2016 at 6:12 PM, Stuart White > wrote: >> >> I'd like to allow

Re: Spark 2.0.2, Structured Streaming with kafka source... Unable to parse the value to Object..

2016-11-21 Thread Michael Armbrust
You could also do this with Datasets, which will probably be a little more efficient (since you are telling us you only care about one column) ds1.select($"value".as[Array[Byte]]).map(Student.parseFrom) On Thu, Nov 17, 2016 at 1:05 PM, shyla deshpande wrote: > Hello

Re: Create a Column expression from a String

2016-11-21 Thread Michael Armbrust
You are looking for org.apache.spark.sql.functions.expr() On Sat, Nov 19, 2016 at 6:12 PM, Stuart White wrote: > I'd like to allow for runtime-configured Column expressions in my > Spark SQL application. For example, if my application needs a 5-digit > zip code, but

Re: SparkILoop doesn't run

2016-11-21 Thread Jakob Odersky
The issue I was having had to do with missing classpath settings; in sbt it can be solved by setting `fork:=true` to run tests in new jvms with appropriate classpaths. Mohit, from the looks of the error message, it also appears to be some classpath issue. This typically happens when there are

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread yeshwanth kumar
Thanks for your reply, i can definitely change the underlying compression format. but i am trying to understand the Locality Level, why executor ran on a different node, where the blocks are not present, when Locality Level is RACK_LOCAL can you shed some light on this. Thanks, Yesh

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Jörn Franke
Use as a format orc, parquet or avro because they support any compression type with parallel processing. Alternatively split your file in several smaller ones. Another alternative would be bzip2 (but slower in general) or Lzo (usually it is not included by default in many distributions). > On

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Aniket Bhatnagar
Try changing compression to bzip2 or lzo. For reference - http://comphadoop.weebly.com Thanks, Aniket On Mon, Nov 21, 2016, 10:18 PM yeshwanth kumar wrote: > Hi, > > we are running Hive on Spark, we have an external table over snappy > compressed csv file of size 917.4 M

Re: sort descending with multiple columns

2016-11-21 Thread Sreekanth Jella
Yes, thank you. Thanks, Sreekanth, +1 (571) 376-0714 On Nov 18, 2016 6:33 AM, "Stuart White" wrote: > Is this what you're looking for? > > val df = Seq( > (1, "A"), > (1, "B"), > (1, "C"), > (2, "D"), > (3, "E") > ).toDF("foo", "bar") > > val colList =

Re: SparkILoop doesn't run

2016-11-21 Thread Jakob Odersky
Trying it out locally gave me an NPE. I'll look into it in more detail, however the SparkILoop.run() method is dead code. It's used nowhere in spark and can be removed without any issues. On Thu, Nov 17, 2016 at 11:16 AM, Mohit Jaggi wrote: > Thanks Holden. I did post to

RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread yeshwanth kumar
Hi, we are running Hive on Spark, we have an external table over snappy compressed csv file of size 917.4 M HDFS block size is set to 256 MB as per my Understanding, if i run a query over that external table , it should launch 4 tasks. one for each block. but i am seeing one executor and one

Potential memory leak in yarn ApplicationMaster

2016-11-21 Thread Spark User
Hi All, It seems like the heap usage for org.apache.spark.deploy.yarn.ApplicationMaster keeps growing continuously. The driver crashes with OOM eventually. More details: I have a spark streaming app that runs on spark-2.0. The spark.driver.memory is 10G and spark.yarn.driver.memoryOverhead is

Re: How to write a custom file system?

2016-11-21 Thread Samy Dindane
We don't use HDFS but GlusterFS which works like your typical local POSIX file system. On 11/21/2016 06:49 PM, Jörn Franke wrote: Once you configured a custom file system in Hadoop it can be used by Spark out of the box. Depending what you implement in the custom file system you may think

Re: Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Anthony May
A sensible default strategy is to use the same language in which a system was developed or a highly compatible language. That would be Scala for Spark, however I assume you don't currently know Scala to the same degree as Python or at all. In which case to help you make the decision you should

Cluster deploy mode driver location

2016-11-21 Thread Saif.A.Ellafi
Hello there, I have a Spark program in 1.6.1, however, when I submit it to cluster, it randomly picks the driver. I know there is a driver specification option, but along with it it is mandatory to define many other options I am not familiar with. The trouble is, the .jars I am launching need

Re: Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Jon Gregg
Spark is written in Scala, so yes it's still the strongest option. You also get the Dataset type with Scala (compile time type-safety), and that's not an available feature with Python. That said, I think the Python API is a viable candidate if you use Pandas for Data Science. There are

Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Brandon White
Hello all, I will be starting a new Spark codebase and I would like to get opinions on using Python over Scala. Historically, the Scala API has always been the strongest interface to Spark. Is this still true? Are there still many benefits and additional features in the Scala API that are not

Re: How to write a custom file system?

2016-11-21 Thread Jörn Franke
Once you configured a custom file system in Hadoop it can be used by Spark out of the box. Depending what you implement in the custom file system you may think about side effects to any application including spark (memory consumption etc). > On 21 Nov 2016, at 18:26, Samy Dindane

How to write a custom file system?

2016-11-21 Thread Samy Dindane
Hi, I'd like to extend the file:// file system and add some custom logic to the API that lists files. I think I need to extend FileSystem or LocalFileSystem from org.apache.hadoop.fs, but I am not sure how to go about it exactly. How to write a custom file system and make it usable by Spark?

Pasting into spark-shell doesn't work for Databricks example

2016-11-21 Thread jggg777
I'm simply pasting in the UDAF example from this page and getting errors (basic EMR setup with Spark 2.0): https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/03%20UDF%20and%20UDAF%20-%20scala.html The imports appear to work, but then

Re: Linear regression + Janino Exception

2016-11-21 Thread Kazuaki Ishizaki
Thank you for reporting the error. I think that this is associated to https://issues.apache.org/jira/browse/SPARK-18492 The reporter of this JIRA entry has not posted the program yet. Would it be possible to add your program that can reproduce this issue to this JIRA entry? Regards, Kazuaki

Re: using StreamingKMeans

2016-11-21 Thread Julian Keppel
I do research in anomaly detection with methods of machine learning at the moment. And currently I do kmeans clustering, too in an offline learning setting. In further work we want to compare the two paradigms of offline and online learning. I would like to share some thoughts on this disscussion.

Re: Flume integration

2016-11-21 Thread Ian Brooks
Hi Mich, Thanks. I would prefer not to add another system into the mix as we currently don't use kafka at all. We are still in the prototype phase at the moment and it seems to be working well though it doesn't like you restrating the flume sink part without restarting the SPARK application.

Re: Flume integration

2016-11-21 Thread Mich Talebzadeh
Hi Ian, Flume is great for ingesting data into HDFS and Hbase. However, that is part of batch layer. For real time processing, I would go through Kafka into spark streaming. Except your case, I have not established if anyone else does Flume directly into Spark? If so how mature is it. Thanks

Re: Flume integration

2016-11-21 Thread Ian Brooks
*-Ian* Hi While I am following this discussion with interest, I am trying to comprehend any architectural benefit of a spark sink. Is there any feature in flume makes it more suitable to ingest stream data than sppark streaming, so that we should chain them? For example does it help

Will different receivers run on different worker?

2016-11-21 Thread Cyanny LIANG
Hi, I am new to Spark Streaming. In our project we want to implement a custom receiver to subscribe our log data. I have two questions: 1. Do Muti DStream Receivers run in different process or different threads? 2. Union muti DStream, such as 10 DStream, we observed that spark will create 10