spark jdbc postgres query results don't match those of postgres query
I am running into a weird issue in Spark 1.6, which I was wondering if anyone has encountered before. I am running a simple select query from spark using a jdbc connection to postgres: val POSTGRES_DRIVER: String = "org.postgresql.Driver" val srcSql = """select total_action_value, last_updated from fb_fact_no_seg_20180123 where ad_id = '23842688418150437'"" val r = sqlContext.read.format("jdbc").options(Map( "url" -> jdbcUrl, "dbtable" -> s"($srcSql) as src" , "driver" -> POSTGRES_DRIVER )).load().coalesce(1).cache() r.show +--++ |total_action_value| last_updated| +--++ | 2743.3301|2018-02-06 00:18:...| +--++ >From above you see that the result is 2743.3301, but when I run the same query directly in postgres I get a slightly different answer: select total_action_value, last_updated from fb_fact_no_seg_20180123 where ad_id = '23842688418150437'; total_action_value | last_updated +- 2743.33 | 2018-02-06 00:18:08 As you can see from above the value is 2743.33. So why is the result coming from spark off by .0001; basically where is .0001 coming from since in postgres the decimal value is .33? Thanks, KP
Re: Setting Optimal Number of Spark Executor Instances
Mohini, We set that parameter before we went and played with the number of executors and that didn't seem to help at all. Thanks, KP On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar wrote: > Hi, > > try using this parameter --conf spark.sql.shuffle.partitions=1000 > > Thanks, > Mohini > > On Tue, Mar 14, 2017 at 3:30 PM, kpeng1 wrote: > >> Hi All, >> >> I am currently on Spark 1.6 and I was doing a sql join on two tables that >> are over 100 million rows each and I noticed that it was spawn 3+ >> tasks >> (this is the progress meter that we are seeing show up). We tried to >> coalesece, repartition and shuffle partitions to drop the number of tasks >> down because we were getting time outs due to the number of task being >> spawned, but those operations did not seem to reduce the number of tasks. >> The solution we came up with was actually to set the num executors to 50 >> (--num-executors=50) and it looks like it spawned 200 active tasks, but >> the >> total number of tasks remained the same. Was wondering if anyone knows >> what >> is going on? Is there an optimal number of executors, I was under the >> impression that the default dynamic allocation would pick the optimal >> number >> of executors for us and that this situation wouldn't happen. Is there >> something I am missing? >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Setting-Optimal-Number-of-Spark-Execut >> or-Instances-tp28493.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > Thanks & Regards, > Mohini Kalamkar > M: +1 310 567 9329 <(310)%20567-9329> >
Re: Weird results with Spark SQL Outer joins
Mike, It looks like you are right. The result seem to be fine. It looks like I messed up on the filtering clause. sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad) WHERE (s.date >= '2016-01-03' OR s.date IS NULL) AND (d.date >= '2016-01-03' OR d.date IS NULL)").count() res2: Long = 53042 Davies, Cesar, Gourav, Thanks for the support. KP On Tue, May 3, 2016 at 11:26 AM, Michael Segel wrote: > Silly question? > > If you change the predicate to > ( s.date >= ‘2016-01-03’ OR s.date IS NULL ) > AND > (d.date >= ‘2016-01-03’ OR d.date IS NULL) > > What do you get? > > Sorry if the syntax isn’t 100% correct. The idea is to not drop null > values from the query. > I would imagine that this shouldn’t kill performance since its most likely > a post join filter on the result set? > (Or is that just a Hive thing?) > > -Mike > > > On May 3, 2016, at 12:42 PM, Davies Liu wrote: > > > > Bingo, the two predicate s.date >= '2016-01-03' AND d.date >= > > '2016-01-03' is the root cause, > > which will filter out all the nulls from outer join, will have same > > result as inner join. > > > > In Spark 2.0, we turn these join into inner join actually. > > > > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores wrote: > >> Hi > >> > >> Have you tried the joins without the where clause? When you use them > you are > >> filtering all the rows with null columns in those fields. In other > words you > >> are doing a inner join in all your queries. > >> > >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> > >> wrote: > >>> > >>> Hi Kevin, > >>> > >>> Having given it a first look I do think that you have hit something > here > >>> and this does not look quite fine. I have to work on the multiple AND > >>> conditions in ON and see whether that is causing any issues. > >>> > >>> Regards, > >>> Gourav Sengupta > >>> > >>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng wrote: > >>>> > >>>> Davies, > >>>> > >>>> Here is the code that I am typing into the spark-shell along with the > >>>> results (my question is at the bottom): > >>>> > >>>> val dps = > >>>> sqlContext.read.format("com.databricks.spark.csv").option("header", > >>>> "true").load("file:///home/ltu/dps_csv/") > >>>> val swig = > >>>> sqlContext.read.format("com.databricks.spark.csv").option("header", > >>>> "true").load("file:///home/ltu/swig_csv/") > >>>> > >>>> dps.count > >>>> res0: Long = 42694 > >>>> > >>>> swig.count > >>>> res1: Long = 42034 > >>>> > >>>> > >>>> dps.registerTempTable("dps_pin_promo_lt") > >>>> swig.registerTempTable("swig_pin_promo_lt") > >>>> > >>>> sqlContext.sql("select * from dps_pin_promo_lt where date > > >>>> '2016-01-03'").count > >>>> res4: Long = 42666 > >>>> > >>>> sqlContext.sql("select * from swig_pin_promo_lt where date > > >>>> '2016-01-03'").count > >>>> res5: Long = 34131 > >>>> > >>>> sqlContext.sql("select distinct date, account, ad from > dps_pin_promo_lt > >>>> where date > '2016-01-03'").count > >>>> res6: Long = 42533 > >>>> > >>>> sqlContext.sql("select distinct date, account, ad from > swig_pin_promo_lt > >>>> where date > '2016-01-03'").count > >>>> res7: Long = 34131 > >>>> > >>>> > >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , > d.account > >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN > >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND > s.ad = > >>&
Re: Weird results with Spark SQL Outer joins
Davies, What exactly do you mean in regards to Spark 2.0 turning these join into an inner join? Does this mean that spark sql won't be supporting where clauses in outer joins? Cesar & Gourav, When running the queries without the where clause it works as expected. I am pasting my results below: val dps = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("file:///home/ltu/dps_csv/") val swig = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("file:///home/ltu/swig_csv/") dps.count res0: Long = 42694 swig.count res1: Long = 42034 dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad)").count() res5: Long = 60919 sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad)").count() res6: Long = 42034 sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad)").count() res7: Long = 42694 Thanks, KP On Tue, May 3, 2016 at 10:42 AM, Davies Liu wrote: > Bingo, the two predicate s.date >= '2016-01-03' AND d.date >= > '2016-01-03' is the root cause, > which will filter out all the nulls from outer join, will have same > result as inner join. > > In Spark 2.0, we turn these join into inner join actually. > > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores wrote: > > Hi > > > > Have you tried the joins without the where clause? When you use them you > are > > filtering all the rows with null columns in those fields. In other words > you > > are doing a inner join in all your queries. > > > > On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> > > wrote: > >> > >> Hi Kevin, > >> > >> Having given it a first look I do think that you have hit something here > >> and this does not look quite fine. I have to work on the multiple AND > >> conditions in ON and see whether that is causing any issues. > >> > >> Regards, > >> Gourav Sengupta > >> > >> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng wrote: > >>> > >>> Davies, > >>> > >>> Here is the code that I am typing into the spark-shell along with the > >>> results (my question is at the bottom): > >>> > >>> val dps = > >>> sqlContext.read.format("com.databricks.spark.csv").option("header", > >>> "true").load("file:///home/ltu/dps_csv/") > >>> val swig = > >>> sqlContext.read.format("com.databricks.spark.csv").option("header", > >>> "true").load("file:///home/ltu/swig_csv/") > >>> > >>> dps.count > >>> res0: Long = 42694 > >>> > >>> swig.count > >>> res1: Long = 42034 > >>> > >>> > >>> dps.registerTempTable("dps_pin_promo_lt") > >>> swig.registerTempTable("swig_pin_promo_lt") > >>> > >>> sqlContext.sql("select * from dps_pin_promo_lt where date > > >>> '2016-01-03'").count > >>> res4: Long = 42666 > >>> > >>> sqlContext.sql("select * from swig_pin_promo_lt where date > > >>> '2016-01-03'").count > >>> res5: Long = 34131 > >>> > >>> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt > >>> where date > '2016-01-03'").count > >>> res6: Long = 42533 > >>> > >>> sqlContext.sql("select distinct date, account, ad from > swig_pin_promo_lt > >>> where date > '2016-01-03'").count > >>> res7: Long = 34131 > >>> > >>> > >>> sqlContext.sql("SELECT s.date AS edat
Re: Weird results with Spark SQL Outer joins
Davies, Here is the code that I am typing into the spark-shell along with the results (my question is at the bottom): val dps = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("file:///home/ltu/dps_csv/") val swig = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("file:///home/ltu/swig_csv/") dps.count res0: Long = 42694 swig.count res1: Long = 42034 dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") sqlContext.sql("select * from dps_pin_promo_lt where date > '2016-01-03'").count res4: Long = 42666 sqlContext.sql("select * from swig_pin_promo_lt where date > '2016-01-03'").count res5: Long = 34131 sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt where date > '2016-01-03'").count res6: Long = 42533 sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt where date > '2016-01-03'").count res7: Long = 34131 sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() res9: Long = 23809 sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() res10: Long = 23809 sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() res11: Long = 23809 sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() res12: Long = 23809 >From my results above, we notice that the counts of distinct values based on the join criteria and filter criteria for each individual table is located at res6 and res7. My question is why is the outer join producing less rows than the smallest table; if there are no matches it should still bring in that row as part of the outer join. For the full and right outer join I am expecting to see a minimum of res6 rows, but I get less, is there something specific that I am missing here? I am expecting that the full outer join would give me the union of the two table sets so I am expecting at least 42533 rows not 23809. Gourav, I just ran this result set on a new session with slightly newer data... still seeing those results. Thanks, KP On Mon, May 2, 2016 at 11:16 PM, Davies Liu wrote: > as @Gourav said, all the join with different join type show the same > results, > which meant that all the rows from left could match at least one row from > right, > all the rows from right could match at least one row from left, even > the number of row from left does not equal that of right. > > This is correct result. > > On Mon, May 2, 2016 at 2:13 PM, Kevin Peng wrote: > > Yong, > > > > Sorry, let explain my deduction; it is going be difficult to get a sample > > data out since the dataset I am using is proprietary. > > > > From the above set queries (ones mentioned in above comments) both inner > and > > outer join are producing the same counts. They are basically pulling out > > selected columns from the query, but there is no roll up happening or > > anything that would possible make it suspicious that there is any > difference > > besides the type of joins. The tables are matched based on three keys > that > > are present in both tables (ad, account, and date), on top of this they > are > > filtered by date being above 2016-01-03. Since all the joins are > producing > > the same counts, the natural suspicions is that the tables are identical, > > but I when I run the following two queries: > > &g
Re: Weird results with Spark SQL Outer joins
Yong, Sorry, let explain my deduction; it is going be difficult to get a sample data out since the dataset I am using is proprietary. >From the above set queries (ones mentioned in above comments) both inner and outer join are producing the same counts. They are basically pulling out selected columns from the query, but there is no roll up happening or anything that would possible make it suspicious that there is any difference besides the type of joins. The tables are matched based on three keys that are present in both tables (ad, account, and date), on top of this they are filtered by date being above 2016-01-03. Since all the joins are producing the same counts, the natural suspicions is that the tables are identical, but I when I run the following two queries: scala> sqlContext.sql("select * from swig_pin_promo_lt where date >='2016-01-03'").count res14: Long = 34158 scala> sqlContext.sql("select * from dps_pin_promo_lt where date >='2016-01-03'").count res15: Long = 42693 The above two queries filter out the data based on date used by the joins of 2016-01-03 and you can see the row count between the two tables are different, which is why I am suspecting something is wrong with the outer joins in spark sql, because in this situation the right and outer joins may produce the same results, but it should not be equal to the left join and definitely not the inner join; unless I am missing something. Side note: In my haste response above I posted the wrong counts for dps.count, the real value is res16: Long = 42694 Thanks, KP On Mon, May 2, 2016 at 12:50 PM, Yong Zhang wrote: > We are still not sure what is the problem, if you cannot show us with some > example data. > > For dps with 42632 rows, and swig with 42034 rows, if dps full outer join > with swig on 3 columns; with additional filters, get the same resultSet row > count as dps lefter outer join with swig on 3 columns, with additional > filters, again get the the same resultSet row count as dps right outer join > with swig on 3 columns, with same additional filters. > > Without knowing your data, I cannot see the reason that has to be a bug in > the spark. > > Am I misunderstanding your bug? > > Yong > > -- > From: kpe...@gmail.com > Date: Mon, 2 May 2016 12:11:18 -0700 > Subject: Re: Weird results with Spark SQL Outer joins > To: gourav.sengu...@gmail.com > CC: user@spark.apache.org > > > Gourav, > > I wish that was case, but I have done a select count on each of the two > tables individually and they return back different number of rows: > > > dps.registerTempTable("dps_pin_promo_lt") > swig.registerTempTable("swig_pin_promo_lt") > > > dps.count() > RESULT: 42632 > > > swig.count() > RESULT: 42034 > > On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > > This shows that both the tables have matching records and no mismatches. > Therefore obviously you have the same results irrespective of whether you > use right or left join. > > I think that there is no problem here, unless I am missing something. > > Regards, > Gourav > > On Mon, May 2, 2016 at 7:48 PM, kpeng1 wrote: > > Also, the results of the inner query produced the same results: > sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account > AS > d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN > dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad > = > d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count() > RESULT:23747 > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > >
Re: Weird results with Spark SQL Outer joins
Gourav, I wish that was case, but I have done a select count on each of the two tables individually and they return back different number of rows: dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") dps.count() RESULT: 42632 swig.count() RESULT: 42034 On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta wrote: > This shows that both the tables have matching records and no mismatches. > Therefore obviously you have the same results irrespective of whether you > use right or left join. > > I think that there is no problem here, unless I am missing something. > > Regards, > Gourav > > On Mon, May 2, 2016 at 7:48 PM, kpeng1 wrote: > >> Also, the results of the inner query produced the same results: >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account >> AS >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >> s.ad = >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= >> '2016-01-03'").count() >> RESULT:23747 >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Weird results with Spark SQL Outer joins
Gourav, Apologies. I edited my post with this information: Spark version: 1.6 Result from spark shell OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 Thanks, KP On Mon, May 2, 2016 at 11:05 AM, Gourav Sengupta wrote: > Hi, > > As always, can you please write down details regarding your SPARK cluster > - the version, OS, IDE used, etc? > > Regards, > Gourav Sengupta > > On Mon, May 2, 2016 at 5:58 PM, kpeng1 wrote: > >> Hi All, >> >> I am running into a weird result with Spark SQL Outer joins. The results >> for all of them seem to be the same, which does not make sense due to the >> data. Here are the queries that I am running with the results: >> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account >> AS >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >> s.ad = >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= >> '2016-01-03'").count() >> RESULT:23747 >> >> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account >> AS >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >> s.ad = >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= >> '2016-01-03'").count() >> RESULT:23747 >> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account >> AS >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >> s.ad = >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= >> '2016-01-03'").count() >> RESULT: 23747 >> >> Was wondering if someone had encountered this issues before. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: println not appearing in libraries when running job using spark-submit --master local
Ted, What triggerAndWait does is perform a rest call to a specified url and then waits until the status message that gets returned by that url in a json a field says complete. The issues is I put a println at the very top of the method and that doesn't get printed out, and I know that println isn't causing an issues because there is an exception that I throw further down the line and that exception is what I am currently getting, but none of the println along the way are showing: def triggerAndWait(url: String, pollInterval: Int = 1000 * 30, timeOut: Int = 1000 * 60 * 60, connectTimeout: Int = 3, readTimeout: Int = 3, requestMethod: String = "GET"): Boolean = { println("Entering triggerAndWait function - url: " + url + " pollInterval: " + pollInterval.toString() + " timeOut: " + timeOut.toString() + " connectionTimeout: " + connectTimeout.toString() + " readTimeout: " + readTimeout.toString() + " requestMethod: " + requestMethod) . Thanks, KP On Mon, Mar 28, 2016 at 1:52 PM, Ted Yu wrote: > Can you describe what gets triggered by triggerAndWait ? > > Cheers > > On Mon, Mar 28, 2016 at 1:39 PM, kpeng1 wrote: > >> Hi All, >> >> I am currently trying to debug a spark application written in scala. I >> have >> a main method: >> def main(args: Array[String]) { >> ... >> SocialUtil.triggerAndWait(triggerUrl) >> ... >> >> The SocialUtil object is included in a seperate jar. I launched the >> spark-submit command using --jars passing the SocialUtil jar. Inside the >> triggerAndWait function I have a println statement that is the first thing >> in the method, but it doesn't seem to be coming out. All println that >> happen inside the main function directly are appearing though. I was >> wondering if anyone knows what is going on in this situation and how I can >> go about making the println in the SocialUtil object appear. >> >> Thanks, >> >> KP >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/println-not-appearing-in-libraries-when-running-job-using-spark-submit-master-local-tp26617.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Loading in json with spark sql
Yin, Yup thanks. I fixed that shortly after I posted and it worked. Thanks, Kevin On Fri, Mar 13, 2015 at 8:28 PM, Yin Huai wrote: > Seems you want to use array for the field of "providers", like > "providers":[{"id": > ...}, {"id":...}] instead of "providers":{{"id": ...}, {"id":...}} > > On Fri, Mar 13, 2015 at 7:45 PM, kpeng1 wrote: > >> Hi All, >> >> I was noodling around with loading in a json file into spark sql's hive >> context and I noticed that I get the following message after loading in >> the >> json file: >> PhysicalRDD [_corrupt_record#0], MappedRDD[5] at map at JsonRDD.scala:47 >> >> I am using the HiveContext to load in the json file using the jsonFile >> command. I also have 1 json object per line on the file. Here is a >> sample >> of the contents in the json file: >> >> {"user_id":"7070","providers":{{"id":"8753","name":"pjfig","behaviors":{"b1":"erwxt","b2":"yjooj"}},{"id":"8329","name":"dfvhh","behaviors":{"b1":"pjjdn","b2":"ooqsh" >> >> {"user_id":"1615","providers":{{"id":"6105","name":"rsfon","behaviors":{"b1":"whlje","b2":"lpjnq"}},{"id":"6828","name":"pnmrb","behaviors":{"b1":"fjpmz","b2":"dxqxk" >> >> {"user_id":"5210","providers":{{"id":"9360","name":"xdylm","behaviors":{"b1":"gcdze","b2":"cndcs"}},{"id":"4812","name":"gxboh","behaviors":{"b1":"qsxao","b2":"ixdzq" >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-in-json-with-spark-sql-tp22044.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: spark sql writing in avro
Markus, Thanks. That makes sense. I was able to get this to work with spark-shell passing in the git built jar. I did notice that I couldn't get AvroSaver.save to work with SQLContext, but it works with HiveContext. Not sure if that is an issue, but for me, it is fine. Once again, thanks for the help. Kevin On Fri, Mar 13, 2015 at 1:57 PM, M. Dale wrote: > I probably did not do a good enough job explaining the problem. If you > used Maven with the > default Maven repository you have an old version of spark-avro that does > not contain AvroSaver and does not have the saveAsAvro method implemented: > > Assuming you use the default Maven repo location: > cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1 > jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver > > Comes up empty. The jar file does not contain this class because > AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November. > > So: > git clone g...@github.com:databricks/spark-avro.git > cd spark-avro > sbt publish-m2 > > This publishes the latest master code (this includes AvroSaver etc.) to > your local Maven repo and Maven will pick up the latest version of > spark-avro (for this machine). > > Now you should be able to compile and run. > > HTH, > Markus > > > On 03/12/2015 11:55 PM, Kevin Peng wrote: > > Dale, > > I basically have the same maven dependency above, but my code will not > compile due to not being able to reference to AvroSaver, though the > saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro > compiles for me, it errors out when running the spark job due to it not > being implemented (the job quits and says non implemented method or > something along those lines). > > I will try going the spark shell and passing in the jar built from > github since I haven't tried that quite yet. > > On Thu, Mar 12, 2015 at 6:44 PM, M. Dale wrote: > >> Short answer: if you downloaded spark-avro from the repo.maven.apache.org >> repo you might be using an old version (pre-November 14, 2014) - >> see timestamps at >> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ >> Lots of changes at https://github.com/databricks/spark-avro since then. >> >> Databricks, thank you for sharing the Avro code!!! >> >> Could you please push out the latest version or update the version >> number and republish to repo.maven.apache.org (I have no idea how jars >> get >> there). Or is there a different repository that users should point to for >> this artifact? >> >> Workaround: Download from https://github.com/databricks/spark-avro and >> build >> with latest functionality (still version 0.1) and add to your local Maven >> or Ivy repo. >> >> Long version: >> I used a default Maven build and declared my dependency on: >> >> >> com.databricks >> spark-avro_2.10 >> 0.1 >> >> >> Maven downloaded the 0.1 version from >> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/ >> and included it in my app code jar. >> >> From spark-shell: >> >> import com.databricks.spark.avro._ >> import org.apache.spark.sql.SQLContext >> val sqlContext = new SQLContext(sc) >> >> # This schema includes LONG for time in millis ( >> https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl >> ) >> val recordsSchema = sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro") >> java.lang.RuntimeException: Unsupported type LONG >> >> However, checking out the spark-avro code from its GitHub repo and adding >> a test case against the MailRecord avro everything ran fine. >> >> So I built the databricks spark-avro locally on my box and then put it in >> my >> local Maven repo - everything worked from spark-shell when adding that jar >> as dependency. >> >> Hope this helps for the "save" case as well. On the pre-14NOV version, >> avro.scala >> says: >> // TODO: Implement me. >> implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) { >> def saveAsAvroFile(path: String): Unit = ??? >> } >> >> Markus >> >> On 03/12/2015 07:05 PM, kpeng1 wrote: >> >>> Hi All, >>> >>> I am current trying to write out a scheme RDD to avro. I noticed that >>> there >>> is a databricks spark-avro library and I have included that in my >>> dependencies, but it looks like I am not able to access the AvroSaver >>>
Re: spark sql writing in avro
Dale, I basically have the same maven dependency above, but my code will not compile due to not being able to reference to AvroSaver, though the saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro compiles for me, it errors out when running the spark job due to it not being implemented (the job quits and says non implemented method or something along those lines). I will try going the spark shell and passing in the jar built from github since I haven't tried that quite yet. On Thu, Mar 12, 2015 at 6:44 PM, M. Dale wrote: > Short answer: if you downloaded spark-avro from the repo.maven.apache.org > repo you might be using an old version (pre-November 14, 2014) - > see timestamps at http://repo.maven.apache.org/ > maven2/com/databricks/spark-avro_2.10/0.1/ > Lots of changes at https://github.com/databricks/spark-avro since then. > > Databricks, thank you for sharing the Avro code!!! > > Could you please push out the latest version or update the version > number and republish to repo.maven.apache.org (I have no idea how jars get > there). Or is there a different repository that users should point to for > this artifact? > > Workaround: Download from https://github.com/databricks/spark-avro and > build > with latest functionality (still version 0.1) and add to your local Maven > or Ivy repo. > > Long version: > I used a default Maven build and declared my dependency on: > > > com.databricks > spark-avro_2.10 > 0.1 > > > Maven downloaded the 0.1 version from http://repo.maven.apache.org/ > maven2/com/databricks/spark-avro_2.10/0.1/ and included it in my app code > jar. > > From spark-shell: > > import com.databricks.spark.avro._ > import org.apache.spark.sql.SQLContext > val sqlContext = new SQLContext(sc) > > # This schema includes LONG for time in millis (https://github.com/medale/ > spark-mail/blob/master/mailrecord/src/main/avro/com/ > uebercomputing/mailrecord/MailRecord.avdl) > val recordsSchema = sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro") > java.lang.RuntimeException: Unsupported type LONG > > However, checking out the spark-avro code from its GitHub repo and adding > a test case against the MailRecord avro everything ran fine. > > So I built the databricks spark-avro locally on my box and then put it in > my > local Maven repo - everything worked from spark-shell when adding that jar > as dependency. > > Hope this helps for the "save" case as well. On the pre-14NOV version, > avro.scala > says: > // TODO: Implement me. > implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) { > def saveAsAvroFile(path: String): Unit = ??? > } > > Markus > > On 03/12/2015 07:05 PM, kpeng1 wrote: > >> Hi All, >> >> I am current trying to write out a scheme RDD to avro. I noticed that >> there >> is a databricks spark-avro library and I have included that in my >> dependencies, but it looks like I am not able to access the AvroSaver >> object. On compilation of the job I get this: >> error: not found: value AvroSaver >> [ERROR] AvroSaver.save(resultRDD, args(4)) >> >> I also tried calling saveAsAvro on the resultRDD(the actual rdd with the >> results) and that passes compilation, but when I run the code I get an >> error >> that says the saveAsAvro is not implemented. I am using version 0.1 of >> spark-avro_2.10 >> >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0
Marcelo, Thanks. The one in the CDH repo fixed it :) On Wed, Mar 4, 2015 at 4:37 PM, Marcelo Vanzin wrote: > Hi Kevin, > > If you're using CDH, I'd recommend using the CDH repo [1], and also > the CDH version when building your app. > > [1] > http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html > > On Wed, Mar 4, 2015 at 4:34 PM, Kevin Peng wrote: > > Ted, > > > > I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not > too > > sure about the compatibility issues between 1.2.0 and 1.2.1, that is why > I > > would want to stick to 1.2.0. > > > > > > > > On Wed, Mar 4, 2015 at 4:25 PM, Ted Yu wrote: > >> > >> Kevin: > >> You can try with 1.2.1 > >> > >> See this thread: http://search-hadoop.com/m/JW1q5Vfe6X1 > >> > >> Cheers > >> > >> On Wed, Mar 4, 2015 at 4:18 PM, Kevin Peng wrote: > >>> > >>> Marcelo, > >>> > >>> Yes that is correct, I am going through a mirror, but 1.1.0 works > >>> properly, while 1.2.0 does not. I suspect there is crc in the 1.2.0 > pom > >>> file. > >>> > >>> On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin > >>> wrote: > >>>> > >>>> Seems like someone set up "m2.mines.com" as a mirror in your pom file > >>>> or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is > >>>> in a messed up state). > >>>> > >>>> On Wed, Mar 4, 2015 at 3:49 PM, kpeng1 wrote: > >>>> > Hi All, > >>>> > > >>>> > I am currently having problem with the maven dependencies for > version > >>>> > 1.2.0 > >>>> > of spark-core and spark-hive. Here are my dependencies: > >>>> > > >>>> > org.apache.spark > >>>> > spark-core_2.10 > >>>> > 1.2.0 > >>>> > > >>>> > > >>>> > org.apache.spark > >>>> > spark-hive_2.10 > >>>> > 1.2.0 > >>>> > > >>>> > > >>>> > When the dependencies are set to version 1.1.0, I do not get any > >>>> > errors. > >>>> > Here are the errors I am getting from artifactory for version 1.2.0 > of > >>>> > spark-core: > >>>> > error=Could not transfer artifact > >>>> > org.apache.spark\:spark-core_2.10\:pom\:1.2.0 from/to repo > >>>> > (https\://m2.mines.com\:8443/artifactory/repo)\: Failed to transfer > >>>> > file\: > >>>> > > >>>> > https\://m2.mines.com > \:8443/artifactory/repo/org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom. > >>>> > Return code is\: 409 , ReasonPhrase\:Conflict. > >>>> > > >>>> > The error is the same for spark-hive. > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > -- > >>>> > View this message in context: > >>>> > > http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-maven-dependencies-for-version-1-2-0-but-not-version-1-1-0-tp21919.html > >>>> > Sent from the Apache Spark User List mailing list archive at > >>>> > Nabble.com. > >>>> > > >>>> > > - > >>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>>> > For additional commands, e-mail: user-h...@spark.apache.org > >>>> > > >>>> > >>>> > >>>> > >>>> -- > >>>> Marcelo > >>> > >>> > >> > > > > > > -- > Marcelo >
Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0
Ted, I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not too sure about the compatibility issues between 1.2.0 and 1.2.1, that is why I would want to stick to 1.2.0. On Wed, Mar 4, 2015 at 4:25 PM, Ted Yu wrote: > Kevin: > You can try with 1.2.1 > > See this thread: http://search-hadoop.com/m/JW1q5Vfe6X1 > > Cheers > > On Wed, Mar 4, 2015 at 4:18 PM, Kevin Peng wrote: > >> Marcelo, >> >> Yes that is correct, I am going through a mirror, but 1.1.0 works >> properly, while 1.2.0 does not. I suspect there is crc in the 1.2.0 pom >> file. >> >> On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin >> wrote: >> >>> Seems like someone set up "m2.mines.com" as a mirror in your pom file >>> or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is >>> in a messed up state). >>> >>> On Wed, Mar 4, 2015 at 3:49 PM, kpeng1 wrote: >>> > Hi All, >>> > >>> > I am currently having problem with the maven dependencies for version >>> 1.2.0 >>> > of spark-core and spark-hive. Here are my dependencies: >>> > >>> > org.apache.spark >>> > spark-core_2.10 >>> > 1.2.0 >>> > >>> > >>> > org.apache.spark >>> > spark-hive_2.10 >>> > 1.2.0 >>> > >>> > >>> > When the dependencies are set to version 1.1.0, I do not get any >>> errors. >>> > Here are the errors I am getting from artifactory for version 1.2.0 of >>> > spark-core: >>> > error=Could not transfer artifact >>> > org.apache.spark\:spark-core_2.10\:pom\:1.2.0 from/to repo >>> > (https\://m2.mines.com\:8443/artifactory/repo)\: Failed to transfer >>> file\: >>> > https\://m2.mines.com >>> \:8443/artifactory/repo/org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom. >>> > Return code is\: 409 , ReasonPhrase\:Conflict. >>> > >>> > The error is the same for spark-hive. >>> > >>> > >>> > >>> > >>> > -- >>> > View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-maven-dependencies-for-version-1-2-0-but-not-version-1-1-0-tp21919.html >>> > Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> > >>> > - >>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> > For additional commands, e-mail: user-h...@spark.apache.org >>> > >>> >>> >>> >>> -- >>> Marcelo >>> >> >> >
Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0
Ted, I have tried wiping out ~/.m2/org.../spark directory multiple times. It doesn't seem to work. On Wed, Mar 4, 2015 at 4:20 PM, Ted Yu wrote: > kpeng1: > Try wiping out ~/.m2 and build again. > > Cheers > > On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin > wrote: > >> Seems like someone set up "m2.mines.com" as a mirror in your pom file >> or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is >> in a messed up state). >> >> On Wed, Mar 4, 2015 at 3:49 PM, kpeng1 wrote: >> > Hi All, >> > >> > I am currently having problem with the maven dependencies for version >> 1.2.0 >> > of spark-core and spark-hive. Here are my dependencies: >> > >> > org.apache.spark >> > spark-core_2.10 >> > 1.2.0 >> > >> > >> > org.apache.spark >> > spark-hive_2.10 >> > 1.2.0 >> > >> > >> > When the dependencies are set to version 1.1.0, I do not get any errors. >> > Here are the errors I am getting from artifactory for version 1.2.0 of >> > spark-core: >> > error=Could not transfer artifact >> > org.apache.spark\:spark-core_2.10\:pom\:1.2.0 from/to repo >> > (https\://m2.mines.com\:8443/artifactory/repo)\: Failed to transfer >> file\: >> > https\://m2.mines.com >> \:8443/artifactory/repo/org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom. >> > Return code is\: 409 , ReasonPhrase\:Conflict. >> > >> > The error is the same for spark-hive. >> > >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-maven-dependencies-for-version-1-2-0-but-not-version-1-1-0-tp21919.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> >> >> -- >> Marcelo >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0
Marcelo, Yes that is correct, I am going through a mirror, but 1.1.0 works properly, while 1.2.0 does not. I suspect there is crc in the 1.2.0 pom file. On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin wrote: > Seems like someone set up "m2.mines.com" as a mirror in your pom file > or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is > in a messed up state). > > On Wed, Mar 4, 2015 at 3:49 PM, kpeng1 wrote: > > Hi All, > > > > I am currently having problem with the maven dependencies for version > 1.2.0 > > of spark-core and spark-hive. Here are my dependencies: > > > > org.apache.spark > > spark-core_2.10 > > 1.2.0 > > > > > > org.apache.spark > > spark-hive_2.10 > > 1.2.0 > > > > > > When the dependencies are set to version 1.1.0, I do not get any errors. > > Here are the errors I am getting from artifactory for version 1.2.0 of > > spark-core: > > error=Could not transfer artifact > > org.apache.spark\:spark-core_2.10\:pom\:1.2.0 from/to repo > > (https\://m2.mines.com\:8443/artifactory/repo)\: Failed to transfer > file\: > > https\://m2.mines.com > \:8443/artifactory/repo/org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom. > > Return code is\: 409 , ReasonPhrase\:Conflict. > > > > The error is the same for spark-hive. > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-maven-dependencies-for-version-1-2-0-but-not-version-1-1-0-tp21919.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > -- > Marcelo >
Re: Invalid signature file digest for Manifest main attributes with spark job built using maven
Sean, Thanks. That worked. Kevin On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen wrote: > This is more of a Java / Maven issue than Spark per se. I would use > the shade plugin to remove signature files in your final META-INF/ > dir. As Spark does, in its : > > > > *:* > > org/datanucleus/** > META-INF/*.SF > META-INF/*.DSA > META-INF/*.RSA > > > > > On Mon, Sep 15, 2014 at 11:33 PM, kpeng1 wrote: > > Hi All, > > > > I am trying to submit a spark job that I have built in maven using the > > following command: > > /usr/bin/spark-submit --deploy-mode client --class com.spark.TheMain > > --master local[1] /home/cloudera/myjar.jar 100 > > > > But I seem to be getting the following error: > > Exception in thread "main" java.lang.SecurityException: Invalid signature > > file digest for Manifest main attributes > > at > > > sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:286) > > at > > > sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:239) > > at java.util.jar.JarVerifier.processEntry(JarVerifier.java:307) > > at java.util.jar.JarVerifier.update(JarVerifier.java:218) > > at java.util.jar.JarFile.initializeVerifier(JarFile.java:345) > > at java.util.jar.JarFile.getInputStream(JarFile.java:412) > > at > sun.misc.URLClassPath$JarLoader$2.getInputStream(URLClassPath.java:775) > > at sun.misc.Resource.cachedInputStream(Resource.java:77) > > at sun.misc.Resource.getByteBuffer(Resource.java:160) > > at java.net.URLClassLoader.defineClass(URLClassLoader.java:436) > > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > > at java.security.AccessController.doPrivileged(Native Method) > > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > > at java.lang.Class.forName0(Native Method) > > at java.lang.Class.forName(Class.java:270) > > at > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:289) > > at > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) > > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > > > > > Here is the pom file I am using to build the jar: > > http://maven.apache.org/POM/4.0.0"; > > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; > > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 > > http://maven.apache.org/maven-v4_0_0.xsd";> > > 4.0.0 > > com.spark > > myjar > > 0.0.1-SNAPSHOT > > ${project.artifactId} > > My wonderfull scala app > > 2010 > > > > > > My License > > http:// > > repo > > > > > > > > > > cdh5.1.0 > > 1.6 > > 1.6 > > UTF-8 > > 2.10 > > 2.10.4 > > > > > > > > > > scala-tools.org > > Scala-tools Maven2 Repository > > https://oss.sonatype.org/content/repositories/snapshots/ > > > > > > > maven-hadoop > > Hadoop Releases > > > > https://repository.cloudera.com/content/repositories/releases/ > > > > > > > cloudera-repos > > Cloudera Repos > > https://repository.cloudera.com/artifactory/cloudera-repos/ > > > > > > > > > > > scala-tools.org > > Scala-tools Maven2 Repository > > https://oss.sonatype.org/content/repositories/snapshots/ > > > > > > > > > > > > > org.scala-lang > > scala-library > > ${scala.version} > > > > > > org.apache.spark > > spark-core_2.10 > > 1.0.0-${cdh.version} > > > > > > org.apache.spark > > spark-tools_2.10 > > 1.0.0-${cdh.version} > > > > > > org.apache.spark > > spark-streaming-flume_2.10 > > 1.0.0-${cdh.version} > > > > > > org.apache.spark > > spark-streaming_2.10 > > 1.0.0-${cdh.version} > > > > > > org.apache.flume > > flume-ng-sdk > > 1.5.0-${cdh.version} > > > > > > > > io.netty > > netty > > > > > > > > > > org.apache.flume > > flume-ng-core > > 1.5.0-${cdh.version} > > > > > > > > io.netty > > netty > > > > > > > > > > org.apache.hbase > > hbase-client > > 0.98.1-${cdh.version} > > > > > > > > io.netty > > netty > > > > > > > > > > org.apache.hadoop > > hadoop-client > > 2.3.0-${cdh.version} > > > > > > > > > > > > junit > > junit > > 4
Re: Spark Streaming into HBase
Ted, Here is the full stack trace coming from spark-shell: 14/09/03 16:21:03 ERROR scheduler.JobScheduler: Error running job streaming job 1409786463000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Basically, what I am doing on the terminal where I run nc -lk, I type in words separated by commas and hit enter i.e. "bill,ted". On Wed, Sep 3, 2014 at 2:36 PM, Ted Yu wrote: > Adding back user@ > > I am not familiar with the NotSerializableException. Can you show the > full stack trace ? > > See SPARK-1297 for changes you need to make so that Spark works with > hbase 0.98 > > Cheers > > > On Wed, Sep 3, 2014 at 2:33 PM, Kevin Peng wrote: > >> Ted, >> >> The hbase-site.xml is in the classpath (had worse issues before... until >> I figured that it wasn't in the path). >> >> I get the following error in the spark-shell: >> org.apache.spark.SparkException: Job aborted due to stage failure: Task >> not serializable: java.io.NotSerializableException: >> org.apache.spark.streaming.StreamingContext >> at org.apache.spark.scheduler.DAGScheduler.org >> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.sc >> ... >> >> I also double checked the hbase table, just in case, and nothing new is >> written in there. >> >> I am using hbase version: 0.98.1-cdh5.1.0 the default one with the >> CDH5.1.0 distro. >> >> Thank you for the help. >> >> >> On Wed, Sep 3, 2014 at 2:09 PM, Ted Yu wrote: >> >>> Is hbase-site.xml in the classpath ? >>> Do you observe any exception from the code below or in region server log >>> ? >>> >>> Which hbase release are you using ? >>> >>> >>> On Wed, Sep 3, 2014 at 2:05 PM, kpeng1 wrote: >>> >>>> I have been trying to understand how spark streaming and hbase connect, >>>> but >>>> have not been successful. What I am trying to do is given a spark >>>> stream, >>>> process that stream and store the results in an hbase table. So far >>>> this is >>>> what I have: >>>> >>>> import org.apache.spark.SparkConf >>>> import org.apache.spark.streaming.{Seconds, StreamingContext} >>>> import org.apache.spark.streaming.StreamingContext._ >>>> import org.apache.spark.storage.StorageLevel >>>> import org.apache.hadoop.hbase.HBaseConfiguration >>>> import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get} >>>> import org.apache.hadoop.hbase.util.Bytes >>>> >>>> def blah(row: Array[String]) { >>>> val hConf = new HBaseConfiguration() >>>> val hTable = new HTable(hConf, "table") >>>> val thePut = new Put(Bytes.toBytes(row(0))) >>>> thePut.add(Bytes.toBytes("cf"), Bytes.toBytes(row(0)), >>>> Bytes.toBytes(row(0))) >>>> hTable