Spark SQL 1.6.1 issue

2016-08-17 Thread thbeh
Running the query below I have been hitting - local class incompatible exception, anyone know the cause? val rdd = csc.cassandraSql("""select *, concat('Q', d_qoy) as qoy from store_sales join date_dim on ss_sold_date_sk = d_date_sk join item on ss_item_sk =

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread Divya Gehlot
Can you please check order of all the data set of union all operations. Are they in same order ? On 9 August 2016 at 02:47, max square wrote: > Hey guys, > > I'm trying to save Dataframe in CSV format after performing unionAll > operations on it. > But I get this

RE: pyspark.sql.functions.last not working as expected

2016-08-17 Thread Alexander Peletz
So here is the test case from the commit adding the first/last methods here: https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162 + test("last/first with ignoreNulls") { +val nullStr: String = null +val df = Seq( + ("a", 0, nullStr),

Re: How to combine two DStreams(pyspark)?

2016-08-17 Thread ayan guha
Wondering why are you creating separate dstreams? You should apply the logic directly on input dstream On 18 Aug 2016 06:40, "vidhan" wrote: > I have a *kafka* stream coming in with some input topic. > This is the code i wrote for accepting *kafka* stream. > > *>>> conf =

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread max square
Thanks Harsh for the reply. When I change the code to something like this - def saveAsLatest(df: DataFrame, fileSystem: FileSystem, bakDir: String) = { fileSystem.rename(new Path(bakDir + latest), new Path(bakDir + "/" + ScalaUtil.currentDateTimeString)) fileSystem.create(new

How to combine two DStreams(pyspark)?

2016-08-17 Thread vidhan
I have a *kafka* stream coming in with some input topic. This is the code i wrote for accepting *kafka* stream. *>>> conf = SparkConf().setAppName(appname) >>> sc = SparkContext(conf=conf) >>> ssc = StreamingContext(sc) >>> kvs = KafkaUtils.createDirectStream(ssc, topics,\

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-17 Thread Michał Zieliński
I'm using Spark 1.6.2 for Vector-based UDAF and this works: def inputSchema: StructType = new StructType().add("input", new VectorUDT()) Maybe it was made private in 2.0 On 17 August 2016 at 05:31, Alexey Svyatkovskiy wrote: > Hi Yanbo, > > Thanks for your reply. I will

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread Nisha Muktewar
The OneHotEncoder does *not* accept multiple columns. You can use Michal's suggestion where he uses Pipeline to set the stages and then executes them. The other option is to write a function that performs one hot encoding on a column and returns a dataframe with the encoded column and then call

Re: Undefined function json_array_to_map

2016-08-17 Thread vr spark
Hi Ted/All, i did below to get fullstack and see below, not able to understand root cause.. except Exception as error: traceback.print_exc() and this what i get... File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql return

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread janardhan shetty
I had already tried this way : scala> val featureCols = Array("category","newone") featureCols: Array[String] = Array(category, newone) scala> val indexer = new StringIndexer().setInputCol(featureCols).setOutputCol("categoryIndex").fit(df1) :29: error: type mismatch; found : Array[String]

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread Nisha Muktewar
I don't think it does. From the documentation: https://spark.apache.org/docs/2.0.0-preview/ml-features.html#onehotencoder, I see that it still accepts one column at a time. On Wed, Aug 17, 2016 at 10:18 AM, janardhan shetty wrote: > 2.0: > > One hot encoding currently

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread Michał Zieliński
You can it just map over your columns and create a pipeline: val columns = Array("colA", "colB", "colC") val transformers: Array[PipelineStage] = columns.map { x => new OneHotEncoder().setInputCol(x).setOutputCol(x + "Encoded") } val pipeline = new Pipeline() .setStages(transformers) On 17

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread HARSH TAKKAR
Hi I can see that exception is caused by following, csn you check where in your code you are using this path Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://testcluster:8020/experiments/vol/spark_chomp_data/bak/restaurants-bak/latest On Wed, 17 Aug

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread max square
/bump It'd be great if someone can point me to the correct direction. On Mon, Aug 8, 2016 at 5:07 PM, max square wrote: > Here's the complete stacktrace - https://gist.github.com/rohann/ > 649b0fcc9d5062ef792eddebf5a315c1 > > For reference, here's the complete function

error when running spark from oozie launcher

2016-08-17 Thread tkg_cangkul
hi i try to submit job spark with oozie. but i've got one problem here. when i submit the same job. sometimes my job succeed but sometimes my job was failed. i've got this error message when the job was failed :

Extract year from string format of date

2016-08-17 Thread Selvam Raman
Spark Version : 1.5.0 Record: 01-Jan-16 Expected Result: 2016 I used the below code which is shared in user group. from_unixtime(unix_timestamp($"Creation Date","dd-MMM-yy"),"")) is this right approach or do we have any other approach. NOTE: i tried *year() *function but it gives only

Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread janardhan shetty
2.0: One hot encoding currently accepts single input column is there a way to include multiple columns ?

[Community] Python support added to Spark Job Server

2016-08-17 Thread Evan Chan
Hi folks, Just a friendly message that we have added Python support to the REST Spark Job Server project. If you are a Python user looking for a RESTful way to manage your Spark jobs, please come have a look at our project! https://github.com/spark-jobserver/spark-jobserver -Evan

Re: UDF in SparkR

2016-08-17 Thread Yann-Aël Le Borgne
I experienced very slow execution time http://stackoverflow.com/questions/38803546/spark-r-2-0-dapply-very-slow and wondering why... On Wed, Aug 17, 2016 at 1:12 PM, Felix Cheung wrote: > This is supported in Spark 2.0.0 as dapply and gapply. Please see the API >

Re: Attempting to accept an unknown offer

2016-08-17 Thread vr spark
My code is very simple, if i use other hive tables, my code works fine. This particular table (virtual view) is huge and might have more metadata. It has only two columns. virtual view name is : cluster_table # col_namedata_type ln string parti

Re: Attempting to accept an unknown offer

2016-08-17 Thread Ted Yu
Please include user@ in your reply. Can you reveal the snippet of hive sql ? On Wed, Aug 17, 2016 at 9:04 AM, vr spark wrote: > spark 1.6.1 > mesos > job is running for like 10-15 minutes and giving this message and i killed > it. > > In this job, i am creating data frame

pyspark.sql.functions.last not working as expected

2016-08-17 Thread Alexander Peletz
Hi, I am using Spark 2.0 and I am getting unexpected results using the last() method. Has anyone else experienced this? I get the sense that last() is working correctly within a given data partition but not across the entire RDD. First() seems to work as expected so I can work around this by

Re: Spark DF CacheTable method. Will it save data to disk?

2016-08-17 Thread neil90
>From the spark documentation(http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence) yes you can use persist on a dataframe instead of cache. All cache is, is a shorthand for the default persist storage level "MEMORY_ONLY". If you want to persist the dataframe to disk you

Attempting to accept an unknown offer

2016-08-17 Thread vr spark
W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910492 W0816 23:17:01.984987 16360 sched.cpp:1195] Attempting to accept an unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910493 W0816 23:17:01.985124 16360

Re: Attempting to accept an unknown offer

2016-08-17 Thread Ted Yu
Can you provide more information ? Were you running on YARN ? Which version of Spark are you using ? Was your job failing ? Thanks On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote: > > W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an > unknown offer

Re: Undefined function json_array_to_map

2016-08-17 Thread Ted Yu
Can you show the complete stack trace ? Which version of Spark are you using ? Thanks On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote: > Hi, > I am getting error on below scenario. Please suggest. > > i have a virtual view in hive > > view name log_data > it has 2

Undefined function json_array_to_map

2016-08-17 Thread vr spark
Hi, I am getting error on below scenario. Please suggest. i have a virtual view in hive view name log_data it has 2 columns query_map map parti_date int Here is my snippet for the spark data frame my dataframe res=sqlcont.sql("select parti_date FROM

Aggregations with scala pairs

2016-08-17 Thread Andrés Ivaldi
Hello, I'd like to report a wrong behavior of DataSet's API, I don´t know how I can do that. My Jira account doesn't allow me to add a Issue I'm using Apache 2.0.0 but the problem came since at least version 1.4 (given the doc since 1.3) The problem is simple to reporduce, also the work arround,

Re: UDF in SparkR

2016-08-17 Thread Felix Cheung
This is supported in Spark 2.0.0 as dapply and gapply. Please see the API doc: https://spark.apache.org/docs/2.0.0/api/R/ Feedback welcome and appreciated! _ From: Yogesh Vyas > Sent: Tuesday, August 16, 2016 11:39 PM

How to implement a InputDStream like the twitter stream in Spark?

2016-08-17 Thread Xi Shen
Hi, First I am not sure if I should inherit from InputDStream, or ReceiverInputDStream. For ReceiverInputDStream, why would I want to run a receiver on each worker nodes? If I want to inherit InputDStream, what should I do in the comput() method? -- Thanks, David S.

Spark standalone or Yarn for resourcing

2016-08-17 Thread Ashok Kumar
Hi, for small to medium size clusters I think Spark Standalone mode is a good choice. We are contemplating moving to Yarn as our cluster grows.  What are the pros and cons of using either please. Which one offers the best Thanking you

Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-17 Thread luohui20001
Hello guys: I have a problem in loading recommend model. I have 2 models, one is good(able to get recommend result) and another is not working. I checked these 2 models, both are MatrixFactorizationModel object. But in the metadata, one is a PipelineModel and another is a

Re: Change nullable property in Dataset schema

2016-08-17 Thread Kazuaki Ishizaki
Thank you for your comments > You should just Seq(...).toDS I tried this, however the result is not changed. >> val ds2 = ds1.map(e => e) > Why are you e => e (since it's identity) and does nothing? Yes, e => e does nothing. For the sake of simplicity of an example, I used the simplest

UDF in SparkR

2016-08-17 Thread Yogesh Vyas
Hi, Is there is any way of using UDF in SparkR ? Regards, Yogesh - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Change nullable property in Dataset schema

2016-08-17 Thread Kazuaki Ishizaki
My motivation is to simplify Java code generated by a compiler of Tungsten. Here is a dump of generated code from the program. https://gist.github.com/kiszk/402bd8bc45a14be29acb3674ebc4df24 If we can succeeded to let catalyst the result of map is never null, we can eliminate conditional

Re: [SQL] Why does spark.read.csv.cache give me a WARN about cache but not text?!

2016-08-17 Thread Jacek Laskowski
Hi Michael, Thanks a lot for your help. See below explains for csv and text. Do you see anything worth investigating? scala> spark.read.csv("people.csv").cache.explain(extended = true) == Parsed Logical Plan == Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv == Analyzed Logical Plan == _c0: string,