Re: How to start spark streaming application with recent past timestamp for replay of old batches?

2016-02-21 Thread Akhil Das
On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi Folks, > > > > I am exploring spark for streaming from two sources (a) Kinesis and (b) > HDFS for some of our use-cases. Since we maintain state gathered over last > x hours in spark streaming, we

[Example] : read custom schema from file

2016-02-21 Thread Divya Gehlot
Hi, Can anybody help me by providing me example how can we read schema of the data set from the file. Thanks, Divya

Re: [Please Help] Log redirection on EMR

2016-02-21 Thread Sabarish Sasidharan
Your logs are getting archived in your logs bucket in S3. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html Regards Sab On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR wrote: > Hi > > In am using an EMR cluster for running my

How to start spark streaming application with recent past timestamp for replay of old batches?

2016-02-21 Thread ashokkumar rajendran
Hi Folks, I am exploring spark for streaming from two sources (a) Kinesis and (b) HDFS for some of our use-cases. Since we maintain state gathered over last x hours in spark streaming, we would like to replay the data from last x hours as batches during deployment. I have gone through the Spark

[Please Help] Log redirection on EMR

2016-02-21 Thread HARSH TAKKAR
Hi In am using an EMR cluster for running my spark jobs, but after the job finishes logs disappear, I have added a log4j.properties in my jar, but all the logs still redirects to EMR resource manager which vanishes after jobs completes, is there a way i could redirect the logs to a location in

Re: Accessing Web UI

2016-02-21 Thread Vasanth Bhat
Thanks Gourav, Eduardo I tried http://localhost:8080 and http://OAhtvJ5MCA:8080/ . Both cases the forefox just hangs. Also I tried with lynx text based browser. I get the message "HTTP request sent; waiting for response." and it hangs as well. Is there way to enable debug logs in

Re: Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Jonathan Kelly
On the line preceding the one that the compiler is complaining about (which doesn't actually have a problem in itself), you declare df as "df"+fileName, making it a string. Then you try to assign a DataFrame to df, but it's already a string. I don't quite understand your intent with that previous

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Compaction would have been triggered automatically as following properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not been set for these tables. hive.compactor.initiator.on true hive.compactor.worker.threads 1

Re: Specify number of executors in standalone cluster mode

2016-02-21 Thread Hemant Bhanawat
Max number of cores per executor can be controlled using spark.executor.cores. And maximum number of executors on a single worker can be determined by environment variable: SPARK_WORKER_INSTANCES. However, to ensure that all available cores are used, you will have to take care of how the stream

Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Divya Gehlot
Hi, I am trying to dynamically create Dataframe by reading subdirectories under parent directory My code looks like > import org.apache.spark._ > import org.apache.spark.sql._ > val hadoopConf = new org.apache.hadoop.conf.Configuration() > val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new >

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread Varadharajan Mukundan
Yes, I was burned down by this issue couple of weeks back. This also means that after every insert job, compaction should be run to access new rows from Spark. Sad that this issue is not documented / mentioned anywhere. On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Hi Varadharajan, Thanks for your response. Yes it is transnational table; See below *show create table. * Table hardly have 3 records , and after triggering minor compaction on tables , it start showing results on spark SQL. > *ALTER TABLE hivespark COMPACT 'major';* > *show create table

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread Varadharajan Mukundan
Hi, Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of "desc extended " in hive console. If you'd need to access transactional table,

Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Hi, I have observed that Spark SQL is not returning records for hive bucketed ORC tables on HDP. On spark SQL , I am able to list all tables , but queries on hive bucketed tables are not returning records. I have also tried the same for non-bucketed hive tables. it is working fine. Same

RE: Submitting Jobs Programmatically

2016-02-21 Thread Patrick Mi
Hi there, I had similar problem in Java with the standalone cluster on Linux but got that working by passing the following option -Dspark.jars=file:/path/to/sparkapp.jar sparkapp.jar has the launch application Hope that helps. Regards, Patrick -Original Message- From: Arko Provo

Re:RE: how to set database in DataFrame.saveAsTable?

2016-02-21 Thread Mich Talebzadeh
Well my version of Spark is 1.5.2 On 21/02/2016 23:54, Jacky Wang wrote: > df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark > version 1.6.0 > df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark > version 1.4 > > -- > > Jacky Wang >

Re: Stream group by

2016-02-21 Thread Vinti Maheshwari
> > Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin. > For reference, final solution: > > def main(args: Array[String]): Unit = { > val conf = new SparkConf().setAppName("HBaseStream") > val sc = new SparkContext(conf) > // create a StreamingContext, the main

Re: Stream group by

2016-02-21 Thread Vinti Maheshwari
Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin. For reference, final solution: def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("HBaseStream") val sc = new SparkContext(conf) // create a StreamingContext, the main entry point for

Re:RE: how to set database in DataFrame.saveAsTable?

2016-02-21 Thread Jacky Wang
df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark version 1.6.0 df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark version 1.4 -- Jacky Wang At 2016-02-21 17:35:53, "Mich Talebzadeh" wrote: I looked at doc on

Re: Stream group by

2016-02-21 Thread ayan guha
I believe the best way would be to use reduceByKey operation. On Mon, Feb 22, 2016 at 5:10 AM, Jatin Kumar < jku...@rocketfuelinc.com.invalid> wrote: > You will need to do a collect and update a global map if you want to. > > myDStream.map(record => (record._2, (record._3, record_4, record._5))

Re: Spark Job Hanging on Join

2016-02-21 Thread Gourav Sengupta
Sorry, please include the following questions to the list above: the SPARK version? whether you are using RDD or DataFrames? is the code run locally or in SPARK Cluster mode or in AWS EMR? Regards, Gourav Sengupta On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta

Re: Spark Job Hanging on Join

2016-02-21 Thread Gourav Sengupta
Hi Tamara, few basic questions first. How many executors are you using? Is the data getting all cached into the same executor? How many partitions do you have of the data? How many fields are you trying to use in the join? If you need any help in finding answer to these questions please let me

Re: Evaluating spark streaming use case

2016-02-21 Thread Ted Yu
w.r.t. the new mapWithState(), there have been some bug fixes since the release of 1.6.0 e.g. SPARK-13121 java mapWithState mishandles scala Option Looks like 1.6.1 RC should come out next week. FYI On Sun, Feb 21, 2016 at 10:47 AM, Chris Fregly wrote: > good catch on the

Re: Evaluating spark streaming use case

2016-02-21 Thread Chris Fregly
good catch on the cleaner.ttl @jatin- when you say "memory-persisted RDD", what do you mean exactly? and how much data are you talking about? remember that spark can evict these memory-persisted RDDs at any time. they can be recovered from Kafka, but this is not a good situation to be in.

Re: RDD[org.apache.spark.sql.Row] filter ERROR

2016-02-21 Thread Tenghuan He
Hi Ted, Thanks a lot for you reply I tried your code in spark-shell on my laptop it works well. But when I tried it on another computer installed with spark I got an Error ​​ scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") :11: error: value toDF is not a

Re: Stream group by

2016-02-21 Thread Jatin Kumar
You will need to do a collect and update a global map if you want to. myDStream.map(record => (record._2, (record._3, record_4, record._5)) .groupByKey((r1, r2) => (r1._1 + r2._1, r1._2 + r2._2, r1._3 + r2._3)) .foreachRDD(rdd => { rdd.collect().foreach((fileName,

Re: Stream group by

2016-02-21 Thread Vinti Maheshwari
Nevermind, seems like an executor level mutable map is not recommended as stated in http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/ On Sun, Feb 21, 2016 at 9:54 AM, Vinti Maheshwari wrote: > Thanks for your reply Jatin. I changed my

Re: Stream group by

2016-02-21 Thread Vinti Maheshwari
Thanks for your reply Jatin. I changed my parsing logic to what you suggested: def parseCoverageLine(str: String) = { val arr = str.split(",") ... ... (arr(0), arr(1) :: count.toList) // (test, [file, 1, 1, 2]) } Then in the grouping, can i use a global hash map

Re: Stream group by

2016-02-21 Thread Jatin Kumar
Hello Vinti, One way to get this done is you split your input line into key and value tuple and then you can simply use groupByKey and handle the values the way you want. For example: Assuming you have already split the values into a 5 tuple: myDStream.map(record => (record._2, (record._3,

Specify number of executors in standalone cluster mode

2016-02-21 Thread Saiph Kappa
Hi, I'm running a spark streaming application onto a spark cluster that spans 6 machines/workers. I'm using spark cluster standalone mode. Each machine has 8 cores. Is there any way to specify that I want to run my application on all 6 machines and just use 2 cores on each machine? Thanks

Re: Behind the scene of RDD to DataFrame

2016-02-21 Thread Weiwei Zhang
Thanks a lot! Best Regards, Weiwei On Sat, Feb 20, 2016 at 11:53 PM, Hemant Bhanawat wrote: > toDF internally calls sqlcontext.createDataFrame which transforms the RDD > to RDD[InternalRow]. This RDD[InternalRow] is then mapped to a dataframe. > > Type conversions (from

Re: Evaluating spark streaming use case

2016-02-21 Thread Ted Yu
w.r.t. cleaner TTL, please see: [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 FYI On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas wrote: > > It sounds like another window operation on top of the 30-min window will > achieve the desired objective. > Just

Stream group by

2016-02-21 Thread Vinti Maheshwari
Hello, I have input lines like below *Input* t1, file1, 1, 1, 1 t1, file1, 1, 2, 3 t1, file2, 2, 2, 2, 2 t2, file1, 5, 5, 5 t2, file2, 1, 1, 2, 2 and i want to achieve the output like below rows which is a vertical addition of the corresponding numbers. *Output* “file1” : [ 1+1+5, 1+2+5, 1+3+5

Re: spark-xml can't recognize schema

2016-02-21 Thread Dave Moyers
Make sure the xml input file is well formed (check your end tags). Sent from my iPhone > On Feb 21, 2016, at 8:14 AM, Prathamesh Dharangutte > wrote: > > This is the code I am using for parsing xml file: > > > > import org.apache.spark.{SparkConf,SparkContext} >

Re: RDD[org.apache.spark.sql.Row] filter ERROR

2016-02-21 Thread Ted Yu
I tried the following in spark-shell: scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") df0: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields] scala> val idList = List("1", "2", "3") idList: List[String] = List(1, 2, 3) scala> val

Re: spark-xml can't recognize schema

2016-02-21 Thread Sebastian Piu
No because you didn't say that explicitly. Can you share a sample file too? On Sun, 21 Feb 2016, 14:34 Prathamesh Dharangutte wrote: > I am using spark 1.4.0 with scala 2.10.4 and 0.3.2 of spark-xml > Orderid is empty for some books and multiple entries of it for other

Re: spark-xml can't recognize schema

2016-02-21 Thread Prathamesh Dharangutte
I am using spark 1.4.0 with scala 2.10.4  and 0.3.2 of spark-xmlOrderid is empty for some books and multiple entries of it for other books,did you include ‎that in your xml file?

Re: spark-xml can't recognize schema

2016-02-21 Thread Sebastian Piu
Just ran that code and it works fine, here is the output: What version are you using? val ctx = SQLContext.getOrCreate(sc) val df = ctx.read.format("com.databricks.spark.xml").option("rowTag", "book").load("file:///tmp/sample.xml") df.printSchema() root |-- name: long (nullable = true) |--

Re: spark-xml can't recognize schema

2016-02-21 Thread Prathamesh Dharangutte
This is the code I am using for parsing xml file: import org.apache.spark.{SparkConf,SparkContext} import org.apache.spark.sql.{DataFrame,SQLContext} import com.databricks.spark.xml object XmlProcessing { def main(args : Array[String]) = { val conf = new SparkConf()

Re: spark-xml can't recognize schema

2016-02-21 Thread Sebastian Piu
Can you paste the code you are using? On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte wrote: > I am trying to parse xml file using spark-xml. But for some reason when i > print schema it only shows root instead of the hierarchy. I am using > sqlcontext to read the

spark-xml can't recognize schema

2016-02-21 Thread Prathamesh Dharangutte
I am trying to parse xml file using spark-xml. But for some reason when i print schema it only shows root instead of the hierarchy. I am using sqlcontext to read the data. I am proceeding according to this video : https://www.youtube.com/watch?v=NemEp53yGbI The structure of xml file is somewhat

Re: Evaluating spark streaming use case

2016-02-21 Thread Gerard Maas
It sounds like another window operation on top of the 30-min window will achieve the desired objective. Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl) to a long enough value and you will require enough resources (mem & disk) to keep the required data. -kr, Gerard.

Fwd: Evaluating spark streaming use case

2016-02-21 Thread Jatin Kumar
Hello Spark users, I have to aggregate messages from kafka and at some fixed interval (say every half hour) update a memory persisted RDD and run some computation. This computation uses last one day data. Steps are: - Read from realtime Kafka topic X in spark streaming batches of 5 seconds -

filter by dict() key in pySpark

2016-02-21 Thread Franc Carter
I have a DataFrame that has a Python dict() as one of the columns. I'd like to filter he DataFrame for those Rows that where the dict() contains a specific value. e.g something like this:- DF2 = DF1.filter('name' in DF1.params) but that gives me this error ValueError: Cannot convert column

Re: Element appear in both 2 splits of RDD after using randomSplit

2016-02-21 Thread nguyen duc tuan
That's very useful information. The reason for weird problem is because of the non-determination of RDD before applying randomSplit. By caching RDD, we can make RDD become deterministic and so problem is solved. Thank you for your help. 2016-02-21 11:12 GMT+07:00 Ted Yu : >

RE: how to set database in DataFrame.saveAsTable?

2016-02-21 Thread Mich Talebzadeh
I looked at doc on this. It is not clear what goes behind the scene. Very little documentation on it First in Hive a database has to exist before it can be used so sql(“use mytable”) will not create a database for you. Also you cannot call your table mytable in database mytable!