Re: [Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Chanh Le
If I have a column store in a parquet file under INT type and I create a table with the same column but change the time from int to bigint. in Spark 2.0 it shows error: Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 259.0 failed 4 times, most recent

Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-31 Thread janardhan shetty
Developing in scala but any help with difference between UnaryTransformer (Is this experimental still ?)and Transformer class is appreciated. Right now encountering error for the code which extends UnaryTransformer override protected def outputDataType: DataType = new StringType Error:(26, 53)

is Hadoop need to be installed?

2016-07-31 Thread ayan guha
Hi I am trying to run spark 2.0 prebuilt with hadoop 2.7 on windows. I do not have hadoop installed as I wanted to test spark alone. When I run pyspark it does start up, but reading any file using dataframe APIs fail. I recall it was doable in earlier versions of spark, but is it something not

Re: [Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Chanh Le
Sorry my bad, I ran in Spark 1.6.1 but what about this error? Why Int cannot be cast to Long? Thanks. > On Aug 1, 2016, at 2:44 AM, Michael Armbrust wrote: > > Are you sure you are running Spark 2.0? > > In your stack trace I see SqlNewHadoopRDD, which was removed in

Re: spark java - convert string to date

2016-07-31 Thread Hyukjin Kwon
I haven't used this by myself but I guess those functions should work. unix_timestamp() ​ See https://github.com/apache/spark/blob/480c870644595a71102be6597146d80b1c0816e4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2513-L2530 2016-07-31 22:57 GMT+09:00 Tony Lane

Re: Tuning level of Parallelism: Increase or decrease?

2016-07-31 Thread Jestin Ma
It seems that the number of tasks being this large do not matter. Each task was set default by the HDFS as 128 MB (block size) which I've heard to be ok. I've tried tuning the block (task) size to be larger and smaller to no avail. I tried coalescing to 50 but that introduced large data skew and

Windows - Spark 2 - Standalone - Worker not able to connect to Master

2016-07-31 Thread ayan guha
Hi I just downloaded Spark 2.0 on my windows 7 to check it out. However, not able to set up a standalone cluster: Step 1: master set up (Successful) bin/spark-class org.apache.spark.deploy.master.Master It did throw an error about not able to find winutils, but started successfully. Step II:

Re: Spark recovery takes long

2016-07-31 Thread NB
Well, we haven't really enabled recovery after running into this issue in Spark 1.2. I do intend to try this again soon with Spark 1.6.1 and see if it works out this time. NB -- View this message in context:

Re: Tuning level of Parallelism: Increase or decrease?

2016-07-31 Thread Andrew Ehrlich
15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50) placed right after loading the data. It will probably either run faster or crash with out of memory errors. > On Jul 29, 2016, at 9:02 AM, Jestin Ma wrote: > > I am processing ~2 TB of

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-31 Thread Andrew Ehrlich
You could write each image to a different directory instead of a different file. That can be done by filtering the RDD into one RDD for each image and then saving each. That might not be what you’re after though, in terms of space and speed efficiency. Another way would be to save them multiple

Re: error while running filter on dataframe

2016-07-31 Thread ayan guha
It would help to share spark version, env details and code snippet. There are many very knowledgeable guys here who can then be able to help On 1 Aug 2016 02:15, "Tony Lane" wrote: > Can someone help me understand this error which occurs while running a > filter on a

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
Hi Interesting problem :) And this is where my knowledge is limited. But what I understand is this is a clustering problem of names. You may want to find a bunch of names belongs to same group by doing say distance between them. Spark supports few clustering algorithm under mllib. Love to know

Re: Fail a batch in Spark Streaming forcefully based on business rules

2016-07-31 Thread Lars Albertsson
I don't know your context, so I don't have a solution for you. If you provide more information, the list might be able to suggest a solution. IIUYC, however, it sounds like you could benefit from decoupling operational failure from business level failure. E.g. if there is a failure according to

Re: Clean up app folders in worker nodes

2016-07-31 Thread pbirsinger
This works. I needed to restart the master and slaves for the changes to take effect. Plus 1 million to you sir. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clean-up-app-folders-in-worker-nodes-tp20889p27440.html Sent from the Apache Spark User List

Re: build error - failing test- Error while building spark 2.0 trunk from github

2016-07-31 Thread Jacek Laskowski
Hi, Can you share what's the command to run the build? What's the OS? Java? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Jul 31, 2016 at 6:54 PM,

Re: error while running filter on dataframe

2016-07-31 Thread Jacek Laskowski
Hi, My eyes are not trained to read this low-level output...yet...so could you narrow down to a reproducible code and post it here and/or file an JIRA issue? Thanks. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0

Re: spark 2.0 readStream from a REST API

2016-07-31 Thread Jacek Laskowski
Hi, See https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-mf-format. There's a custom format that you can use to get started. Basically, you need to develop the code behind "mysource" format and register it using --packages or --jars or similar when you spark-submit

Re: sql to spark scala rdd

2016-07-31 Thread Jacek Laskowski
Hi, Impossible - see http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@sliding(size:Int,step:Int):Iterator[Repr]. I tried to show you why you ended up with "non-empty iterator" after println. You should really start with http://www.scala-lang.org/documentation/ Pozdrawiam,

Re: Java Recipes for Spark

2016-07-31 Thread Jean Georges Perrin
Thanks Guys - I really appreciate :)... If you have any idea of something missing, I'll gladly add it. (and yeah, come on! Is that some kind of primitive racism or what: Java rocks! What are those language where you can turn a list to a string and back to an object. #StrongTypingRules) > On

Re: error while running filter on dataframe

2016-07-31 Thread Michael Armbrust
You are hitting a bug in code generation. If you can come up with a small reproduction for the problem. It would be very helpful if you could open a JIRA. On Sun, Jul 31, 2016 at 9:14 AM, Tony Lane wrote: > Can someone help me understand this error which occurs while

Re: calling dataset.show on a custom object - displays toString() value as first column and blank for rest

2016-07-31 Thread Michael Armbrust
Can you share you code? This does not happen for me . On Sun, Jul 31, 2016 at 7:16 AM, Rohit Chaddha

Re: spark 2.0 readStream from a REST API

2016-07-31 Thread Michael Armbrust
You have to add a file in resource too (example ). Either that or give a full class name. On Sun, Jul 31, 2016 at 9:45 AM, Ayoub Benali

Re: [Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Michael Armbrust
Are you sure you are running Spark 2.0? In your stack trace I see SqlNewHadoopRDD, which was removed in #12354 . On Sun, Jul 31, 2016 at 2:12 AM, Chanh Le wrote: > Hi everyone, > Why *MutableInt* cannot be cast to *MutableLong?*

Re: sql to spark scala rdd

2016-07-31 Thread sri hari kali charan Tummala
val test=sc.textFile(file).keyBy(x => x.split("\\~") (0)) .map(x => x._2.split("\\~")) .map(x => ((x(0),x(1),x(2 .map{case (account,datevalue,amount) => ((account,datevalue),(amount.toDouble))}.mapValues(x => x).toArray.sliding(2,1).map(x => (x(0)._1,x(1)._2,(x.foldLeft(0.0)(_ +

Re: sql to spark scala rdd

2016-07-31 Thread sri hari kali charan Tummala
Hi All, I already solved it using DF and spark sql I was wondering how to solve in scala rdd, I just got the answer need to check my results compared to spark sql thanks all for your time. I am trying to solve moving average using scala RDD group by key. input:- -987~20150728~100

Re: multiple spark streaming contexts

2016-07-31 Thread Sumit Khanna
Any ideas on this one guys ? I can do a sample run but can't be sure of imminent problems if any? How can I ensure different batchDuration etc etc in here, per StreamingContext. Thanks, On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna wrote: > Hey, > > Was wondering if I

Re: sql to spark scala rdd

2016-07-31 Thread Mich Talebzadeh
Check also this HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: sql to spark scala rdd

2016-07-31 Thread sri hari kali charan Tummala
Tuple [Lscala.Tuple2;@65e4cb84 On Sun, Jul 31, 2016 at 1:00 AM, Jacek Laskowski wrote: > Hi, > > What's the result type of sliding(2,1)? > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0

Re: spark-submit hangs forever after all tasks finish(spark 2.0.0 stable version on yarn)

2016-07-31 Thread Zhuo Tao
Yarn client On Sunday, July 31, 2016, Pradeep wrote: > Hi, > > Are you running on yarn-client or cluster mode? > > Pradeep > > > On Jul 30, 2016, at 7:34 PM, taozhuo > > wrote: > > > > below is the error messages that seem run infinitely:

build error - failing test- Error while building spark 2.0 trunk from github

2016-07-31 Thread Rohit Chaddha
--- T E S T S --- Running org.apache.spark.api.java.OptionalSuite Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.052 sec - in org.apache.spark.api.java.OptionalSuite Running

Re: spark 2.0 readStream from a REST API

2016-07-31 Thread Ayoub Benali
Looks like the way to go in spark 2.0 is to implement StreamSourceProvider with DataSourceRegister

error while running filter on dataframe

2016-07-31 Thread Tony Lane
Can someone help me understand this error which occurs while running a filter on a dataframe 2016-07-31 21:01:57 ERROR CodeGenerator:91 - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 117, Column 58: Expression "mapelements_isNull" is not an rvalue

Spark Sql - Losing connection with Hive Metastore

2016-07-31 Thread KhajaAsmath Mohammed
Hi, I am working on application to move the data from multiple hive tables into single external hive table using joins. Spark sql is able to insert data into the table but it is losing connection to metastore after inserting data. I still have 3 more queries to be executed and insert data into

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
Many thanks This is a challenge of some sort. I did this for my own work. I downloaded my bank account for the past few years as a CSV format and loaded into Hive ORC table with databricks stuff. All tables are transactional and bucketed. A typical row looks like this (they vary from bank to

calling dataset.show on a custom object - displays toString() value as first column and blank for rest

2016-07-31 Thread Rohit Chaddha
I have a custom object called A and corresponding Dataset when I call datasetA.show() method i get the following +++-+-+---+ |id|da|like|values|uid| +++-+-+---+ |A.toString()...|

spark java - convert string to date

2016-07-31 Thread Tony Lane
Any built in function in java with spark to convert string to date more efficiently or do we just use the standard java techniques -Tony

Re: spark-submit hangs forever after all tasks finish(spark 2.0.0 stable version on yarn)

2016-07-31 Thread Pradeep
Hi, Are you running on yarn-client or cluster mode? Pradeep > On Jul 30, 2016, at 7:34 PM, taozhuo wrote: > > below is the error messages that seem run infinitely: > > > 16/07/30 23:25:38 DEBUG ProtobufRpcEngine: Call: getApplicationReport took > 1ms > 16/07/30 23:25:39

Spark R 2.0 dapply very slow

2016-07-31 Thread Yann-Aël Le Borgne
Hello all, I just started testing Spark R 2.0, and find the execution of dapply very slow. For example, using R, the following code set.seed(2) random_DF<-data.frame(matrix(rnorm(100),10,10)) system.time(dummy_res<-random_DF[random_DF[,1]>1,]) user system elapsed 0.005 0.000

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
Hi here is a quick setup (Based on airlines.txt dataset): -- *from datetime import datetime, timedelta* *from pyspark.sql.types import **

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
It is true that whatever an analytic function does can be done by standard SQL, with join and sub-queries. But the same routine done by analytic function is always faster, or at least as fast, when compared to standard SQL. I will try to see if I can do analytic functions with Spark FP on Data

spark 2.0 readStream from a REST API

2016-07-31 Thread Ayoub Benali
Hello, I started playing with the Structured Streaming API in spark 2.0 and I am looking for a way to create streaming Dataset/Dataframe from a rest HTTP endpoint but I am bit stuck. "readStream" in SparkSession has a json method but this one is expecting a path (s3, hdfs, etc) and I want to

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
The point is, window functions are supposed designed to be faster by doing the calculations in one pass, instead of 2 pass in case of max. DF supports window functions (using sql.Window) so instead of writing sql, you can use it as well. Best Ayan On Sun, Jul 31, 2016 at 7:48 PM, Mich

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
yes reserved word issue thanks hive> select * > from (select transactiondate, transactiondescription, debitamount > , rank() over (order by transactiondate desc) r > from accounts.ll_18740868 where transactiondescription like '%HARRODS%' > ) RS > where r=1 > ; Query ID =

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
I think the word "INNER" is reserved in Hive. Please change the alias to something else. Not sure about scala, but essentially it is string replacement. On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh wrote: > thanks how about scala? > > BTW the same analytic code

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
thanks how about scala? BTW the same analytic code fails in Hive itself:( hive> select * > from (select transactiondate, transactiondescription, debitamount > from (select transactiondate, transactiondescription, debitamount > , rank() over (order by transactiondate desc) r >

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
Hi This is because Spark does not provide a way to "bind" variables like Oracle does. So you can build the sql string, like below (in python) val = 'XYZ' sqlbase = "select . where col = ''".replace(',val) On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh

Re: Structured Streaming Parquet Sink

2016-07-31 Thread Tathagata Das
Yes, files do not support complete mode output yet. We are working on that, and should be available in Spark 2.1. In the meantime, you can use aggregation with memory sink (i.e. format("memory")) to store in a in-memory table, which then can be periodically written to a parquet table explicitly.

Re: Visualization of data analysed using spark

2016-07-31 Thread Sivakumaran S
Hi Tony, If your requirement is browser based plotting (real time or other wise), you can load the data and display it in a browser using D3. Since D3 has very low level plotting routines, you can look at C3 ( provided by www.pubnub.com) or Rickshaw (https://github.com/shutterstock/rickshaw

[Spark 2.0] Why MutableInt cannot be cast to MutableLong?

2016-07-31 Thread Chanh Le
Hi everyone, Why MutableInt cannot be cast to MutableLong? It’s really weird and seems Spark 2.0 has a lot of error with parquet about format. org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableL ong Caused by:

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
Thanks Ayan. This is the one I used scala> sqltext = """ | select * | from (select transactiondate, transactiondescription, debitamount | , rank() over (order by transactiondate desc) r | from ll_18740868 where transactiondescription like '%XYZ%' | ) inner |

Re: sql to spark scala rdd

2016-07-31 Thread Jacek Laskowski
Hi, What's the result type of sliding(2,1)? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Jul 31, 2016 at 9:23 AM, sri hari kali charan Tummala

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
Thanks all scala> var maxdate = ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0).getDate(0) maxdate: java.sql.Date = 2015-12-15 scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG) && col("transactiondate") ===

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
thanks Nicholas got it Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own

Re: sql to spark scala rdd

2016-07-31 Thread sri hari kali charan Tummala
tried this no luck, wht is non-empty iterator here ? OP:- (-987,non-empty iterator) (-987,non-empty iterator) (-987,non-empty iterator) (-987,non-empty iterator) (-987,non-empty iterator) sc.textFile(file).keyBy(x => x.split("\\~") (0)) .map(x => x._2.split("\\~")) .map(x => (x(0),x(2)))

Re: sql to spark scala rdd

2016-07-31 Thread sri hari kali charan Tummala
Hi All, I managed to write using sliding function but can it get key as well in my output ? sc.textFile(file).keyBy(x => x.split("\\~") (0)) .map(x => x._2.split("\\~")) .map(x => (x(2).toDouble)).toArray().sliding(2,1).map(x => (x,x.size)).foreach(println) at the moment my

Re: Structured Streaming Parquet Sink

2016-07-31 Thread Jacek Laskowski
Hi Arun, Regarding parquet and complete output mode: A relevant piece of the code to think about: if (outputMode != OutputMode.Append) { throw new IllegalArgumentException( s"Data source $className does not support $outputMode output mode") }

Re: How to filter based on a constant value

2016-07-31 Thread Nicholas Hakobian
>From the online docs: https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/Row.html#apply(int) java.lang.Object apply(int i) Returns the value at position i. If the value is null, null is returned. The following is a mapping between Spark SQL types and return types: So its

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
thanks gents. I am trying to understand this better. As I understand a DataFrame is basically an equivalent table in relational term. so scala> var maxdate = ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")) maxdate: org.apache.spark.sql.DataFrame

Re: Visualization of data analysed using spark

2016-07-31 Thread Gourav Sengupta
If you are using Python, please try using Bokeh and its related stack. Most of the people in this forum including guys at data bricks have not tried that stack from Anaconda, its worth a try when you are visualizing data in big data stack. Regards, Gourav On Sat, Jul 30, 2016 at 10:25 PM,