Re: Spark 2 and existing code with sqlContext

2016-08-12 Thread Koert Kuipers
you can get it from the SparkSession for backwards compatibility: val sqlContext = spark.sqlContext On Mon, Aug 8, 2016 at 9:11 AM, Mich Talebzadeh wrote: > Hi, > > In Spark 1.6.1 this worked > > scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(),

Re: Spark 2 and existing code with sqlContext

2016-08-12 Thread Jacek Laskowski
Hi, Also, in shell you have sql function available without the object. Jacek On 8 Aug 2016 6:11 a.m., "Mich Talebzadeh" wrote: > Hi, > > In Spark 1.6.1 this worked > > scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ > HH:mm:ss.ss')

Re: Spark 2 and existing code with sqlContext

2016-08-12 Thread Jacek Laskowski
What about the following : val sqlContext = spark ? On 8 Aug 2016 6:11 a.m., "Mich Talebzadeh" wrote: > Hi, > > In Spark 1.6.1 this worked > > scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ > HH:mm:ss.ss') ").collect.foreach(println) >

Re: restart spark streaming app

2016-08-12 Thread Jacek Laskowski
Hi, I think it's cluster deploy mode. spark-submit --deploy-mode cluster --master yarn myStreamingApp.jar Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On

Re: Accessing HBase through Spark with Security enabled

2016-08-12 Thread Jacek Laskowski
Hi, How do you access HBase? What's the version of Spark? (I don't see spark packages in the stack trace) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On

Re: Flattening XML in a DataFrame

2016-08-12 Thread Hyukjin Kwon
Hi Sreekanth, Assuming you are using Spark 1.x, I believe this code below: sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "emp").load("/tmp/sample.xml") .selectExpr("manager.id", "manager.name", "explode(manager.subordinates.clerk) as clerk") .selectExpr("id", "name",

Re: Spark 2 cannot create ORC table when CLUSTERED. This worked in Spark 1.6.1

2016-08-12 Thread Jacek Laskowski
Hi Mich, File a JIRA issue as that seems as if they overlooked that part. Spark 2.0 has less and less HiveQL with more and more native support. (My take on this is that the days of Hive in Spark are counted and Hive is gonna disappear soon) Pozdrawiam, Jacek Laskowski

Re: Single point of failure with Driver host crashing

2016-08-12 Thread Jacek Laskowski
Hi, I'm sure that cluster deploy mode would solve it very well. It'd be a cluster issue then to re-execute the driver then? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Re: Unable to run spark examples in eclipse

2016-08-12 Thread Jacek Laskowski
Hi, You need to add spark-* libs to the project (using the Eclipse-specific way). The best bet would be to use maven as the main project management tool and import the project then. Seen https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse ?

Re: Why I can't use broadcast var defined in a global object?

2016-08-12 Thread yaochunnan
Hi David, Thank you for detailed reply. I understand what you said about the ideas on broadcast variable. But I am still a little bit confused. In your reply, you said: *It has sent largeValue across the network to each worker already, and gave you a/ key /to retrieve it.* So my question is,

Re: Using spark package XGBoost

2016-08-12 Thread janardhan shetty
I tried using *sparkxgboost package *in build.sbt file but it failed. Spark 2.0 Scala 2.11.8 Error: [warn] http://dl.bintray.com/spark-packages/maven/rotationsymmetry/sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.1-s_2.10-javadoc.jar [warn]

Accessing SparkConfig from mapWithState function

2016-08-12 Thread Govindasamy, Nagarajan
Hi, Is there a way to get access to SparkConfig from mapWithState function? I am looking to implement logic using the config property in mapWithState function. Thanks, Raj

Why I can't use broadcast var defined in a global object?

2016-08-12 Thread yaochunnan
Hi all, Here is a simplified example to show my concern. This example contains 3 files with 3 objects, depending on spark 1.6.1. //file globalObject.scala import org.apache.spark.broadcast.Broadcast object globalObject { var br_value: Broadcast[Map[Int, Double]] = null } //file

Flattening XML in a DataFrame

2016-08-12 Thread Sreekanth Jella
Hi Folks, I am trying flatten variety of XMLs using DataFrames. I'm using spark-xml package which is automatically inferring my schema and creating a DataFrame. I do not want to hard code any column names in DataFrame as I have lot of varieties of XML documents and each might be lot more

Re: dataframe row list question

2016-08-12 Thread ayan guha
You can use dot notations. select myList.vList from t where myList.nm=IP' On Fri, Aug 12, 2016 at 9:11 AM, vr spark wrote: > Hi Experts, > Please suggest > > On Thu, Aug 11, 2016 at 7:54 AM, vr spark wrote: > >> >> I have data which is json in this

restart spark streaming app

2016-08-12 Thread Shifeng Xiao
Hi folks, I am using Spark streaming, and I am not clear if there is smart way to restart the app once it fails, currently we just have one cron job to check if the job is running every 2 or 5 minutes and restart the app when necessary. According to spark streaming guide: - *YARN* - Yarn

Using spark package XGBoost

2016-08-12 Thread janardhan shetty
Is there a dataframe version of XGBoost in spark-ml ?. Has anyone used sparkxgboost package ?

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-12 Thread dhruve ashar
I see a similar issue being resolved recently: https://issues.apache.org/jira/browse/SPARK-15285 On Fri, Aug 12, 2016 at 3:33 PM, Aris wrote: > Hello folks, > > I'm on Spark 2.0.0 working with Datasets -- and despite the fact that > smaller data unit tests work on my

Re: Rebalancing when adding kafka partitions

2016-08-12 Thread Cody Koeninger
Hrrm, that's interesting. Did you try with subscribe pattern, out of curiosity? I haven't tested repartitioning on the underlying new Kafka consumer, so its possible I misunderstood something. On Aug 12, 2016 2:47 PM, "Srikanth" wrote: > I did try a test with spark 2.0 +

Spark 2.0.0 JaninoRuntimeException

2016-08-12 Thread Aris
Hello folks, I'm on Spark 2.0.0 working with Datasets -- and despite the fact that smaller data unit tests work on my laptop, when I'm on a cluster, I get cryptic error messages: Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method >

Re: Rebalancing when adding kafka partitions

2016-08-12 Thread Srikanth
I did try a test with spark 2.0 + spark-streaming-kafka-0-10-assembly. Partition was increased using "bin/kafka-topics.sh --alter" after spark job was started. I don't see messages from new partitions in the DStream. KafkaUtils.createDirectStream[Array[Byte], Array[Byte]] ( > ssc,

Re: Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Adamantios Corais
Great. I like your second solution. But how can I make sure that cvModel holds the best model overall (as opposed to the last one that was tired out but the grid search)? In addition, do you have an idea how to collect the average error of each grid search (here 1x1x1)? On 12/08/2016

Re: countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker
On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker wrote: > val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c", > "a"))).toDF("x", "y") > val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y")) > This workaround executes with no exceptions: val

Re: KafkaUtils.createStream not picking smallest offset

2016-08-12 Thread Cody Koeninger
Are you checkpointing? Beyond that, why are you using createStream instead of createDirectStream On Fri, Aug 12, 2016 at 12:32 PM, Diwakar Dhanuskodi wrote: > Okay . > I could delete the consumer group in zookeeper and start again to re > use same consumer

Re: Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Bryan Cutler
You will need to cast bestModel to include the MLWritable trait. The class Model does not mix it in by default. For instance: cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path") Alternatively, you could save the CV model directly, which takes care of this cvModel.save("/my/path") On

countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker
Hi everyone, I've started experimenting with my codebase to see how much work I will need to port it from 1.6.1 to 2.0.0. In regressing some of my dataframe transforms, I've discovered I can no longer pair a countDistinct with a collect_set in the same aggregation. Consider: val df =

Unable to run spark examples in eclipse

2016-08-12 Thread subash basnet
Hello all, I am completely new to spark. I downloaded the spark project from github ( https://github.com/apache/spark) and wanted to run the examples. I successfully ran the maven command: build/mvn -DskipTests clean package But I am not able to build the spark-examples_2.11 project. There are

Mailing list

2016-08-12 Thread Inam Ur Rehman
UNSUBSCRIBE

How to add custom steps to Pipeline models?

2016-08-12 Thread evanzamir
I'm building an LDA Pipeline, currently with 4 steps, Tokenizer, StopWordsRemover, CountVectorizer, and LDA. I would like to add more steps, for example, stemming and lemmatization, and also 1-gram and 2-grams (which I believe is not supported by the default NGram class). Is there a way to add

Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Adamantios Corais
Hi, Assuming that I have run the following pipeline and have got the best logistic regression model. How can I then save that model for later use? The following command throws an error: cvModel.bestModel.save("/my/path") Also, is it possible to get the error (a collection of) for each

PySpark read from HBase

2016-08-12 Thread Bin Wang
Hi there, I have lots of raw data in several Hive tables where we built a workflow to "join" those records together and restructured into HBase. It was done using plain MapReduce to generate HFile, and then load incremental from HFile into HBase to guarantee the best performance. However, we

[Spark 2.0] spark.sql.hive.metastore.jars doesn't work

2016-08-12 Thread Yan Facai
Hi, everyone. According the official guide, I copied hdfs-site.xml, core-site.xml and hive-site.xml to $SPARK_HOME/conf, and write code as below: ```Java SparkSession spark = SparkSession .builder() .appName("Test Hive for Spark")

RE: Spark join and large temp files

2016-08-12 Thread Ashic Mahtab
Hi Gourav,Thanks for your input. As mentioned previously, we've tried the broadcast. We've failed to broadcast 1.5GB...perhaps some tuning can help. We see CPU go up to 100%, and then workers die during the broadcast. I'm not sure if it's a good idea to broadcast that much, as spark's broadcast

Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-12 Thread olivierjeunen
I'm using pyspark ML's logistic regression implementation to do some classification on an AWS EMR Yarn cluster. The cluster consists of 10 m3.xlarge nodes and is set up as follows: spark.driver.memory 10g, spark.driver.cores 3 , spark.executor.memory 10g, spark.executor-cores 4. I enabled

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-12 Thread Cheng Lian
OK, I've merged this PR to master and branch-2.0. On 8/11/16 8:27 AM, Cheng Lian wrote: Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the issue:

Re: Spark join and large temp files

2016-08-12 Thread Gourav Sengupta
The point is that if you have skewed data then one single reducer will finally take a very long time, and you do not need to try this even, just search in Google and skewed data is a known problem in joins even in SPARK. Therefore instead of using join, in case the used case permits, just write a

Re: Log messages for shuffle phase

2016-08-12 Thread Jacek Laskowski
Hi, Have you looked at web UI? You should find such task metrics. Jacek On 11 Aug 2016 6:28 p.m., "Suman Somasundar" wrote: > Hi, > > > > While going through the logs of an application, I noticed that I could not > find any logs to dig deeper into any of the

Re: Losing executors due to memory problems

2016-08-12 Thread Bedrytski Aliaksandr
Hi Vinay, just out of curiosity, why are you converting your Dataframes into RDDs before the join? Join works quite well with Dataframes. As for your problem, it looks like you gave to your executors more memory than you physically have. As an example of executors configuration: > Cluster of 6

Re: Losing executors due to memory problems

2016-08-12 Thread Koert Kuipers
you could have a very large key? perhaps a token value? i love the rdd api but have found that for joins dataframe/dataset performs better. maybe can you do the joins in that? On Thu, Aug 11, 2016 at 7:41 PM, Muttineni, Vinay wrote: > Hello, > > I have a spark job that

type inference csv dates

2016-08-12 Thread Koert Kuipers
i generally like the type inference feature of the spark-sql csv datasource, however i have been stung several times by date inference. the problem is that when a column is converted to a date type the original data is lost. this is not a lossless conversion. and i often have a requirement where i

Re: dataframe row list question

2016-08-12 Thread vr spark
Hi Experts, Please suggest On Thu, Aug 11, 2016 at 7:54 AM, vr spark wrote: > > I have data which is json in this format > > myList: array > |||-- elem: struct > ||||-- nm: string (nullable = true) > ||||-- vList: array (nullable =

KafkaUtils.createStream not picking smallest offset

2016-08-12 Thread Diwakar Dhanuskodi
Hi, We are  using  spark  1.6.1 and  kafka 0.9. KafkaUtils.createStream is  showing strange behaviour. Though   auto.offset.reset is  set  to  smallest .  Whenever we  need  to  restart  the   stream it  is  picking up  the  latest  offset which  is not  expected. Do  we  need  to  set  any  

Re: Spark join and large temp files

2016-08-12 Thread Gourav Sengupta
Hi Ashic, That is a pretty 2011 way of solving the problem, what is more painful about this way of working is that you need to load the data in to REDIS, keep a REDIS cluster running and in case you are workign across several clusters then may be install REDIS in all of them or hammer your