Re: Using spark package XGBoost

2016-09-08 Thread janardhan shetty
Tried to implement spark package in 2.0 https://spark-packages.org/package/rotationsymmetry/sparkxgboost but it is throwing the error: error: not found: type SparkXGBoostClassifier On Tue, Sep 6, 2016 at 11:26 AM, janardhan shetty wrote: > Is this merged to Spark ML ?

Re: spark-xml to avro - SchemaParseException: Can't redefine

2016-09-08 Thread Yong Zhang
Do you take a look about this -> https://github.com/databricks/spark-avro/issues/54 Yong [https://avatars0.githubusercontent.com/u/1457102?v=3=400] spark-avro fails to save DF with nested records having the

Spark 2 does not recognize CURRENT_TIMESTAMP of Hive 2.0

2016-09-08 Thread Mich Talebzadeh
The current time in Hive 2 is called CURREN_TIMESTAMP, hive> SELECT FROM_unixtime(unix_timestamp(), '/MM/dd HH:mm:ss.ss') , current_timestamp; unix_timestamp(void) is deprecated. Use current_timestamp instead. OK 2016/09/09 00:10:11.11 2016-09-09 00:10:11.808 The old

Re: year out of range

2016-09-08 Thread ayan guha
Another way of debugging would be writing another UDF, returning string. Also, in that function, put something useful in catch block, so you can filter those records from df. On 9 Sep 2016 03:41, "Daniel Lopes" wrote: > Thanks Mike, > > A good way to debug! Was that

Access application-jar name within main method.

2016-09-08 Thread sagarcasual .
Hello, I am running Spark 1.6.1 and would like to access application Jar name within my main() method. somehow I am using following code to get the version name, String sparkJarName = new java.io.File(MySparkProcessor.class.getProtectionDomain() .getCodeSource() .getLocation()

Re: spark streaming kafka connector questions

2016-09-08 Thread Cody Koeninger
- If you're seeing repeated attempts to process the same message, you should be able to look in the UI or logs and see that a task has failed. Figure out why that task failed before chasing other things - You're not using the latest version, the latest version is for spark 2.0. There are two

Graphhopper/routing in Spark

2016-09-08 Thread kodonnell
Just wondering if anyone has experience at running Graphhopper (or similar) in Spark? In short, I can get it running in the master, but not in worker nodes. The key trouble seems to be that Graphhopper depends on a pre-processed graph, which it obtains from OSM data. In normal (desktop) use, it

spark streaming kafka connector questions

2016-09-08 Thread Cheng Yi
I am using the lastest streaming kafka connector org.apache.spark spark-streaming-kafka_2.11 1.6.2 I am facing the problem that a message is delivered two times to my consumers. these two deliveries are 10+ seconds apart, it looks this is caused by my lengthy message processing (took about 60

spark-xml to avro - SchemaParseException: Can't redefine

2016-09-08 Thread Arun Patel
I'm trying to convert XML to AVRO. But, I am getting SchemaParser exception for 'Rules' which is existing in two separate containers. Any thoughts? XML is attached. df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='GGLResponse',attributePrefix='').load('GGL.xml')

"Job duration" and "Processing time" don't match

2016-09-08 Thread Srikanth
Hello, I was looking at Spark streaming UI and noticed a big difference between "Processing time" and "Job duration" [image: Inline image 1] Processing time/Output Op duration is show as 50s but sum of all job duration is ~25s. What is causing this difference? Based on logs I know that the

Re: Spark Metrics: custom source/sink configurations not getting recognized

2016-09-08 Thread map reduced
Can this be listed as an issue on JIRA? On Wed, Sep 7, 2016 at 10:19 AM, map reduced wrote: > Thanks for the reply, I wish it did. We have an internal metrics system > where we need to submit to. I am sure that the ways I've tried work with > yarn deployment, but not with

Re: Returning DataFrame as Scala method return type

2016-09-08 Thread Jakob Odersky
(Maybe unrelated FYI): in case you're using only Scala or Java with Spark, I would recommend to use Datasets instead of DataFrames. They provide exactly the same functionality, yet offer more type-safety. On Thu, Sep 8, 2016 at 11:05 AM, Lee Becker wrote: > > On Thu, Sep

Re: Returning DataFrame as Scala method return type

2016-09-08 Thread Lee Becker
On Thu, Sep 8, 2016 at 11:35 AM, Ashish Tadose wrote: > I wish to organize these dataframe operations by grouping them Scala > Object methods. > Something like below > > > >> *Object Driver {* >> *def main(args: Array[String]) {* >> * val df =

Re: year out of range

2016-09-08 Thread Daniel Lopes
Thanks Mike, A good way to debug! Was that already! Best, *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br On Thu, Sep 8, 2016

Returning DataFrame as Scala method return type

2016-09-08 Thread Ashish Tadose
Hi Team, I have Spark job with large number of dataframe operations. This job reads various lookup data from external table as MySql and also run lot of dataframe operations on large data on hdfs in parquet. Job works fine in cluster however jobdriver code looks clumsy because of large number

Posting selected rows of Spark streaming data to Hive table

2016-09-08 Thread Mich Talebzadeh
Hi, Within spark streaming I have identified the data that I want to persist to a Hive table. Table is already created. These are the values for columns extracted for(line <- pricesRDD.collect.toArray) { var index = line._2.split(',').view(0).toInt var

Re: year out of range

2016-09-08 Thread Mike Metzger
My guess is there's some row that does not match up with the expected data. While slower, I've found RDDs to be easier to troubleshoot this kind of thing until you sort out exactly what's happening. Something like: raw_data = sc.textFile("") rowcounts = raw_data.map(lambda x:

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
Hi Todd, Thanks for the hint. As it happened this works //Create the sparkconf for streaming as usual val sparkConf = new SparkConf(). setAppName(sparkAppName). set("spark.driver.allowMultipleContexts", "true").

Re: Forecasting algorithms in spark ML

2016-09-08 Thread Dirceu Semighini Filho
Hi Madabhattula Rajesh Kumar, There is an open source project called sparkts (Time Series for Spark) that implement ARIMA and Holtwinters algorithms on top of Spark, which can be used for forecast. In some cases, Linear Regression, which is avalilable

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Todd Nist
Hi Mich, Perhaps the issue is having multiple SparkContexts in the same JVM ( https://issues.apache.org/jira/browse/SPARK-2243). While it is possible, I don't think it is encouraged. As you know, the call your currently invoking to create the StreamingContext also creates a SparkContext. /** *

Re: year out of range

2016-09-08 Thread Daniel Lopes
Thanks, I *tested* the function offline and works Tested too with select * from after convert the data and see the new data good *but* if I *register as temp table* to *join other table* stilll shows *the same error*. ValueError: year out of range Best, *Daniel Lopes* Chief Data and Analytics

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
Ok I managed to sort that one out. This is what I am facing val sparkConf = new SparkConf(). setAppName(sparkAppName). set("spark.driver.allowMultipleContexts", "true"). set("spark.hadoop.validateOutputSpecs", "false") // change the values

Re: year out of range

2016-09-08 Thread Marco Mistroni
Daniel Test the parse date offline to make sure it returns what you expect If it does in spark shell create a df with 1 row only and run ur UDF. U should b able to see issue If not send me a reduced CSV file at my email and I give it a try this eve hopefully someone else will b able to

Re: year out of range

2016-09-08 Thread Daniel Lopes
Thanks Marco for your response. The field came encoded by SQL Server in locale pt_BR. The code that I am formating is: -- def parse_date(argument, format_date='%Y-%m%d %H:%M:%S'): try: locale.setlocale(locale.LC_TIME, 'pt_BR.utf8') return

Will be there any ml.linalg.distributed?

2016-09-08 Thread Boris Schminke
Hi, how can I create a distributed sparse matrix using DataFrame API? AFAIK there is a package mllib.linalg.distributed which has no counterpart in ml.linalg. Am I right? What are best practices/workarounds/advice for this case? Are distributed matrices on DataFrames on the roadmap? Regards,

Re: LabeledPoint creation

2016-09-08 Thread 市场部
Hi, Below are what I typed in my scale-sql command line based on your first email, the result is different with yours. Just for your reference. My spark version is 1.6.1 import org.apache.spark.ml.feature._ import org.apache.spark.ml.classification.LogisticRegression import

Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
Hi, This may not be feasible in Spark streaming. I am trying to create a HiveContext in Spark streaming within the streaming context // Create a local StreamingContext with two working thread and batch interval of 2 seconds. val sparkConf = new SparkConf().

Re: Error while calling udf Spark submit

2016-09-08 Thread Marco Mistroni
Not enough info. But u can try same code in spark shell and get hold of the exception Hth On 8 Sep 2016 11:16 am, "Divya Gehlot" wrote: > Hi, > I am on Spark 1.6.1 > I am getting below error when I am trying to call UDF in my spark > Dataframe column > UDF > /* get the

Spark yarn use IP instead hostname

2016-09-08 Thread 李剑
Hi: spark submit task by yarn-cluster, and the node info show hostname : [image: Inline images 4] how to make it show ips: [image: Inline images 2] --

Error while calling udf Spark submit

2016-09-08 Thread Divya Gehlot
Hi, I am on Spark 1.6.1 I am getting below error when I am trying to call UDF in my spark Dataframe column UDF /* get the train line */ val deriveLineFunc :(String => String) = (str:String) => { val build_key = str.split(",").toList val getValue = if(build_key.length > 1)

Re: Calling udf in Spark

2016-09-08 Thread Deepak Sharma
No its not required for UDF. Its required when you convert from rdd to df. Thanks Deepak On 8 Sep 2016 2:25 pm, "Divya Gehlot" wrote: > Hi, > > Is it necessary to import sqlContext.implicits._ whenever define and > call UDF in Spark. > > > Thanks, > Divya > > >

Calling udf in Spark

2016-09-08 Thread Divya Gehlot
Hi, Is it necessary to import sqlContext.implicits._ whenever define and call UDF in Spark. Thanks, Divya

Re: MLib : Non Linear Optimization

2016-09-08 Thread Robin East
Do you have any particular algorithms in mind? If you state the most common algorithms you use then it might stimulate the appropriate comments. > On 8 Sep 2016, at 05:04, nsareen wrote: > > Any answer to this question group ? > > > > -- > View this message in context:

Re: year out of range

2016-09-08 Thread Marco Mistroni
Pls paste code and sample CSV I m guessing it has to do with formatting time? Kr On 8 Sep 2016 12:38 am, "Daniel Lopes" wrote: > Hi, > > I'm* importing a few CSV*s with spark-csv package, > Always when I give a select at each one looks ok > But when i join then with

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-08 Thread Yan Facai
many thanks, Peter. On Wed, Sep 7, 2016 at 10:14 PM, Peter Figliozzi wrote: > Here's a decent GitHub book: Mastering Apache Spark > > . > > I'm new at Scala too. I found it very helpful to

RE: pyspakr 1.5.0 boradcast join

2016-09-08 Thread ming.he
Got one from stackoverflow. http://stackoverflow.com/questions/34053302/pyspark-and-broadcast-join-example From: pseudo oduesp [mailto:pseudo20...@gmail.com] Sent: Thursday, September 08, 2016 4:00 PM To: user@spark.apache.org Subject: pyspakr 1.5.0 boradcast join hi , some one can show me an

pyspakr 1.5.0 boradcast join

2016-09-08 Thread pseudo oduesp
hi , some one can show me an example for broadcast join in this version 1.5.0 with data frame in pyspark thanks

Re: Forecasting algorithms in spark ML

2016-09-08 Thread Robin East
Sparks algorithms are summarised on this page (https://spark.apache.org/mllib/) and details are available from the MLLib user guide which is linked from the above URL Sent from my iPhone > On 8 Sep 2016, at 05:30, Madabhattula Rajesh Kumar > wrote: > > Hi, > > Please

Re: How to convert an ArrayType to DenseVector within DataFrame?

2016-09-08 Thread Nick Pentreath
You can use a udf like this: Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Python version 2.7.12 (default, Jul 2 2016 17:43:17) SparkSession available as 'spark'. In [1]: from

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Sonal Goyal
Are you looking at the worker logs or the driver? On Thursday, September 8, 2016, Nisha Menon wrote: > I have an RDD created as follows: > > *JavaPairRDD inputDataFiles = > sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* > >

How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Nisha Menon
I have an RDD created as follows: *JavaPairRDD inputDataFiles = sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* On this RDD I perform a map to process individual files and invoke a foreach to trigger the same map. * JavaRDD output =

Re: LabeledPoint creation

2016-09-08 Thread Madabhattula Rajesh Kumar
Hi, I have done this in different way. Please correct me, is this approach right ? val df = spark.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"), (6, "d"))).toDF("id", "category") val categories: List[String] =

Re: No SparkR on Mesos?

2016-09-08 Thread ray
Hi, Rodrick, Interesting. SparkR is expected not to work with Mesos due to lack of support for mesos in some places, and it has not been tested yet. Have you modified Spark source code by yourself? Have you deployed Spark binary distribution on all salve nodes, and set