Spark-Kafka integration - build failing with sbt

2017-06-16 Thread karan alang
I'm trying to compile kafka & Spark Streaming integration code i.e. reading from Kafka using Spark Streaming, and the sbt build is failing with error - [error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found Scala version

Error while doing mvn release for spark 2.0.2 using scala 2.10

2017-06-16 Thread Kanagha Kumar
Hey all, I'm trying to use Spark 2.0.2 with scala 2.10 by following this https://spark.apache.org/docs/2.0.2/building-spark.html#building-for-scala-210 ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package I could build the distribution

Re: Spark SQL within a DStream map function

2017-06-16 Thread Burak Yavuz
Do you really need to create a DStream from the original messaging queue? Can't you just read them in a while loop or something on the driver? On Fri, Jun 16, 2017 at 1:01 PM, Mike Hugo wrote: > Hello, > > I have a web application that publishes JSON messages on to a messaging

Spark SQL within a DStream map function

2017-06-16 Thread Mike Hugo
Hello, I have a web application that publishes JSON messages on to a messaging queue that contain metadata and a link to a CSV document on S3. I'd like to iterate over these JSON messages, and for each one pull the CSV document into spark SQL to transform it (based on the metadata in the JSON

Fwd: Repartition vs PartitionBy Help/Understanding needed

2017-06-16 Thread Aakash Basu
Hi all, Can somebody put some light on this pls? Thanks, Aakash. -- Forwarded message -- From: "Aakash Basu" Date: 15-Jun-2017 2:57 PM Subject: Repartition vs PartitionBy Help/Understanding needed To: "user" Cc: Hi all, > >

Re: Max number of columns

2017-06-16 Thread Jan Holmberg
Ups, wrong user list. Sorry. :-) > On 16 Jun 2017, at 10.44, Jan Holmberg wrote: > > Hi, > I ran into Kudu limitation of max columns (300). Same limit seemed to apply > latest Kudu version as well but not ex. Impala/Hive (in the same extent at > least). > * is this

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-16 Thread Georg Heiler
I assume you want to have this life cycle in oder to create big/ heavy / complex objects only once ( per partition) map partitions should fit this usecase pretty well. RD schrieb am Fr. 16. Juni 2017 um 17:37: > Thanks Georg. But I'm not sure how mapPartitions is relevant

Re: What is the charting library used by Databricks UI?

2017-06-16 Thread kant kodali
I have a Chrome plugin that detected all js libraries! so to answer my own question its D3.js On Fri, Jun 16, 2017 at 6:12 AM, Mahesh Sawaiker < mahesh_sawai...@persistent.com> wrote: > Is there a live url on internet, where I can see the UI? I could help by > checking the js code in firebug. >

RE: spark-submit: file not found exception occurs

2017-06-16 Thread LisTree Team
you may use hdfs file not local file under yarn. Original Message Subject: spark-submit: file not found exception occurs From: Shupeng Geng Date: Thu, June 15, 2017 8:14 pm To: "user@spark.apache.org" ,

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-16 Thread RD
Thanks Georg. But I'm not sure how mapPartitions is relevant here. Can you elaborate? On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler wrote: > What about using map partitions instead? > > RD schrieb am Do. 15. Juni 2017 um 06:52: > >> Hi Spark

Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Pralabh Kumar
Hi Saatvik You can write your own transformer to make sure that column contains ,value which u provided , and filter out rows which doesn't follow the same. Something like this case class CategoryTransformer(override val uid : String) extends Transformer{ override def transform(inputData:

Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Saatvik Shah
Hi Pralabh, I want the ability to create a column such that its values be restricted to a specific set of predefined values. For example, suppose I have a column called EMOTION: I want to ensure each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA. Thanks and Regards, Saatvik Shah On Fri, Jun

Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Pralabh Kumar
Hi satvik Can u please provide an example of what exactly you want. On 16-Jun-2017 7:40 PM, "Saatvik Shah" wrote: > Hi Yan, > > Basically the reason I was looking for the categorical datatype is as > given here

Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Saatvik Shah
Hi Yan, Basically the reason I was looking for the categorical datatype is as given here : ability to fix column values to specific categories. Is it possible to create a user defined data type which could do so? Thanks and Regards,

RE: What is the charting library used by Databricks UI?

2017-06-16 Thread Mahesh Sawaiker
Is there a live url on internet, where I can see the UI? I could help by checking the js code in firebug. From: kant kodali [mailto:kanth...@gmail.com] Sent: Friday, June 16, 2017 1:26 PM To: user @spark Subject: What is the charting library used by Databricks UI? Hi All, I am wondering what

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-16 Thread Ryan
I don't think Broadcast itself can be serialized. you can get the value out on the driver side and refer to it in foreach, then the value would be serialized with the lambda expr and sent to workers. On Fri, Jun 16, 2017 at 2:29 AM, Anton Kravchenko < kravchenko.anto...@gmail.com> wrote: > How

[Error] Python version mismatch in CDH cluster when running pyspark job

2017-06-16 Thread Divya Gehlot
Hi , I have a CDH cluster and running pyspark script in client mode There are different python version installed in client and worker nodes and was getting python version mismatch error. To resolve this issue I followed below cludera document

What is the charting library used by Databricks UI?

2017-06-16 Thread kant kodali
Hi All, I am wondering what is the charting library used by Databricks UI to display graphs in real time while streaming jobs? Thanks!

Max number of columns

2017-06-16 Thread Jan Holmberg
Hi, I ran into Kudu limitation of max columns (300). Same limit seemed to apply latest Kudu version as well but not ex. Impala/Hive (in the same extent at least). * is this limitation going to be loosened in near future? * any suggestions how to get over this limitation? Table splitting is the

RE: [SparkSQL] Escaping a query for a dataframe query

2017-06-16 Thread mark.jenki...@baesystems.com
Thanks both! FYI the suggestion to escape the quote does not seem to work. I should have mentioned I am using spark 1.6.2 and have tried to escape the double quote with \\ and . My gut feel is that escape chars are not considered for UDF parameters for this version of spark – I would like