spark jdbc postgres query results don't match those of postgres query

2018-03-29 Thread Kevin Peng
I am running into a weird issue in Spark 1.6, which I was wondering if
anyone has encountered before. I am running a simple select query from
spark using a jdbc connection to postgres: val POSTGRES_DRIVER: String =
"org.postgresql.Driver" val srcSql = """select total_action_value,
last_updated from fb_fact_no_seg_20180123 where ad_id =
'23842688418150437'"" val r = sqlContext.read.format("jdbc").options(Map(
"url" -> jdbcUrl, "dbtable" -> s"($srcSql) as src" , "driver" ->
POSTGRES_DRIVER )).load().coalesce(1).cache() r.show
+--++ |total_action_value|
last_updated| +--++ |
2743.3301|2018-02-06 00:18:...| +--++
>From above you see that the result is 2743.3301, but when I run the same
query directly in postgres I get a slightly different answer: select
total_action_value, last_updated from fb_fact_no_seg_20180123 where ad_id =
'23842688418150437'; total_action_value | last_updated
+- 2743.33 | 2018-02-06 00:18:08 As
you can see from above the value is 2743.33. So why is the result coming
from spark off by .0001; basically where is .0001 coming from since in
postgres the decimal value is .33? Thanks, KP


Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Kevin Peng
Mohini,

We set that parameter before we went and played with the number of
executors and that didn't seem to help at all.

Thanks,

KP

On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar 
wrote:

> Hi,
>
> try using this parameter --conf spark.sql.shuffle.partitions=1000
>
> Thanks,
> Mohini
>
> On Tue, Mar 14, 2017 at 3:30 PM, kpeng1  wrote:
>
>> Hi All,
>>
>> I am currently on Spark 1.6 and I was doing a sql join on two tables that
>> are over 100 million rows each and I noticed that it was spawn 3+
>> tasks
>> (this is the progress meter that we are seeing show up).  We tried to
>> coalesece, repartition and shuffle partitions to drop the number of tasks
>> down because we were getting time outs due to the number of task being
>> spawned, but those operations did not seem to reduce the number of tasks.
>> The solution we came up with was actually to set the num executors to 50
>> (--num-executors=50) and it looks like it spawned 200 active tasks, but
>> the
>> total number of tasks remained the same.  Was wondering if anyone knows
>> what
>> is going on?  Is there an optimal number of executors, I was under the
>> impression that the default dynamic allocation would pick the optimal
>> number
>> of executors for us and that this situation wouldn't happen.  Is there
>> something I am missing?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Setting-Optimal-Number-of-Spark-Execut
>> or-Instances-tp28493.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Thanks & Regards,
> Mohini Kalamkar
> M: +1 310 567 9329 <(310)%20567-9329>
>


Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
Mike,

It looks like you are right.  The result seem to be fine.  It looks like I
messed up on the filtering clause.

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE (s.date >= '2016-01-03' OR s.date IS NULL) AND (d.date >=
'2016-01-03' OR d.date IS NULL)").count()
res2: Long = 53042

Davies, Cesar, Gourav,

Thanks for the support.

KP

On Tue, May 3, 2016 at 11:26 AM, Michael Segel 
wrote:

> Silly question?
>
> If you change the predicate to
> ( s.date >= ‘2016-01-03’ OR s.date IS NULL )
> AND
> (d.date >= ‘2016-01-03’ OR d.date IS NULL)
>
> What do you get?
>
> Sorry if the syntax isn’t 100% correct. The idea is to not drop null
> values from the query.
> I would imagine that this shouldn’t kill performance since its most likely
> a post join filter on the result set?
> (Or is that just a Hive thing?)
>
> -Mike
>
> > On May 3, 2016, at 12:42 PM, Davies Liu  wrote:
> >
> > Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
> > '2016-01-03' is the root cause,
> > which will filter out all the nulls from outer join, will have same
> > result as inner join.
> >
> > In Spark 2.0, we turn these join into inner join actually.
> >
> > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores  wrote:
> >> Hi
> >>
> >> Have you tried the joins without the where clause? When you use them
> you are
> >> filtering all the rows with null columns in those fields. In other
> words you
> >> are doing a inner join in all your queries.
> >>
> >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com>
> >> wrote:
> >>>
> >>> Hi Kevin,
> >>>
> >>> Having given it a first look I do think that you have hit something
> here
> >>> and this does not look quite fine. I have to work on the multiple AND
> >>> conditions in ON and see whether that is causing any issues.
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng  wrote:
> >>>>
> >>>> Davies,
> >>>>
> >>>> Here is the code that I am typing into the spark-shell along with the
> >>>> results (my question is at the bottom):
> >>>>
> >>>> val dps =
> >>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>>> "true").load("file:///home/ltu/dps_csv/")
> >>>> val swig =
> >>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>>> "true").load("file:///home/ltu/swig_csv/")
> >>>>
> >>>> dps.count
> >>>> res0: Long = 42694
> >>>>
> >>>> swig.count
> >>>> res1: Long = 42034
> >>>>
> >>>>
> >>>> dps.registerTempTable("dps_pin_promo_lt")
> >>>> swig.registerTempTable("swig_pin_promo_lt")
> >>>>
> >>>> sqlContext.sql("select * from dps_pin_promo_lt where date >
> >>>> '2016-01-03'").count
> >>>> res4: Long = 42666
> >>>>
> >>>> sqlContext.sql("select * from swig_pin_promo_lt where date >
> >>>> '2016-01-03'").count
> >>>> res5: Long = 34131
> >>>>
> >>>> sqlContext.sql("select distinct date, account, ad from
> dps_pin_promo_lt
> >>>> where date > '2016-01-03'").count
> >>>> res6: Long = 42533
> >>>>
> >>>> sqlContext.sql("select distinct date, account, ad from
> swig_pin_promo_lt
> >>>> where date > '2016-01-03'").count
> >>>> res7: Long = 34131
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>&

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
Davies,

What exactly do you mean in regards to Spark 2.0 turning these join into an
inner join?  Does this mean that spark sql won't be supporting where
clauses in outer joins?


Cesar & Gourav,

When running the queries without the where clause it works as expected.  I
am pasting my results below:
val dps =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("file:///home/ltu/dps_csv/")
val swig =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("file:///home/ltu/swig_csv/")

dps.count
res0: Long = 42694

swig.count
res1: Long = 42034


dps.registerTempTable("dps_pin_promo_lt")
swig.registerTempTable("swig_pin_promo_lt")


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad)").count()
res5: Long = 60919


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad)").count()
res6: Long = 42034


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad)").count()
res7: Long = 42694

Thanks,

KP


On Tue, May 3, 2016 at 10:42 AM, Davies Liu  wrote:

> Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
> '2016-01-03' is the root cause,
>  which will filter out all the nulls from outer join, will have same
> result as inner join.
>
> In Spark 2.0, we turn these join into inner join actually.
>
> On Tue, May 3, 2016 at 9:50 AM, Cesar Flores  wrote:
> > Hi
> >
> > Have you tried the joins without the where clause? When you use them you
> are
> > filtering all the rows with null columns in those fields. In other words
> you
> > are doing a inner join in all your queries.
> >
> > On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com>
> > wrote:
> >>
> >> Hi Kevin,
> >>
> >> Having given it a first look I do think that you have hit something here
> >> and this does not look quite fine. I have to work on the multiple AND
> >> conditions in ON and see whether that is causing any issues.
> >>
> >> Regards,
> >> Gourav Sengupta
> >>
> >> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng  wrote:
> >>>
> >>> Davies,
> >>>
> >>> Here is the code that I am typing into the spark-shell along with the
> >>> results (my question is at the bottom):
> >>>
> >>> val dps =
> >>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>> "true").load("file:///home/ltu/dps_csv/")
> >>> val swig =
> >>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>> "true").load("file:///home/ltu/swig_csv/")
> >>>
> >>> dps.count
> >>> res0: Long = 42694
> >>>
> >>> swig.count
> >>> res1: Long = 42034
> >>>
> >>>
> >>> dps.registerTempTable("dps_pin_promo_lt")
> >>> swig.registerTempTable("swig_pin_promo_lt")
> >>>
> >>> sqlContext.sql("select * from dps_pin_promo_lt where date >
> >>> '2016-01-03'").count
> >>> res4: Long = 42666
> >>>
> >>> sqlContext.sql("select * from swig_pin_promo_lt where date >
> >>> '2016-01-03'").count
> >>> res5: Long = 34131
> >>>
> >>> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
> >>> where date > '2016-01-03'").count
> >>> res6: Long = 42533
> >>>
> >>> sqlContext.sql("select distinct date, account, ad from
> swig_pin_promo_lt
> >>> where date > '2016-01-03'").count
> >>> res7: Long = 34131
> >>>
> >>>
> >>> sqlContext.sql("SELECT s.date AS edat

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
Davies,

Here is the code that I am typing into the spark-shell along with the
results (my question is at the bottom):

val dps =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("file:///home/ltu/dps_csv/")
val swig =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("file:///home/ltu/swig_csv/")

dps.count
res0: Long = 42694

swig.count
res1: Long = 42034


dps.registerTempTable("dps_pin_promo_lt")
swig.registerTempTable("swig_pin_promo_lt")

sqlContext.sql("select * from dps_pin_promo_lt where date >
'2016-01-03'").count
res4: Long = 42666

sqlContext.sql("select * from swig_pin_promo_lt where date >
'2016-01-03'").count
res5: Long = 34131

sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
where date > '2016-01-03'").count
res6: Long = 42533

sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
where date > '2016-01-03'").count
res7: Long = 34131


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
res9: Long = 23809


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
res10: Long = 23809


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
res11: Long = 23809


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
res12: Long = 23809



>From my results above, we notice that the counts of distinct values based
on the join criteria and filter criteria for each individual table is
located at res6 and res7.  My question is why is the outer join producing
less rows than the smallest table; if there are no matches it should still
bring in that row as part of the outer join.  For the full and right outer
join I am expecting to see a minimum of res6 rows, but I get less, is there
something specific that I am missing here?  I am expecting that the full
outer join would give me the union of the two table sets so I am expecting
at least 42533 rows not 23809.


Gourav,

I just ran this result set on a new session with slightly newer data...
still seeing those results.



Thanks,

KP


On Mon, May 2, 2016 at 11:16 PM, Davies Liu  wrote:

> as @Gourav said, all the join with different join type show the same
> results,
> which meant that all the rows from left could match at least one row from
> right,
> all the rows from right could match at least one row from left, even
> the number of row from left does not equal that of right.
>
> This is correct result.
>
> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng  wrote:
> > Yong,
> >
> > Sorry, let explain my deduction; it is going be difficult to get a sample
> > data out since the dataset I am using is proprietary.
> >
> > From the above set queries (ones mentioned in above comments) both inner
> and
> > outer join are producing the same counts.  They are basically pulling out
> > selected columns from the query, but there is no roll up happening or
> > anything that would possible make it suspicious that there is any
> difference
> > besides the type of joins.  The tables are matched based on three keys
> that
> > are present in both tables (ad, account, and date), on top of this they
> are
> > filtered by date being above 2016-01-03.  Since all the joins are
> producing
> > the same counts, the natural suspicions is that the tables are identical,
> > but I when I run the following two queries:
> >
&g

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong,

Sorry, let explain my deduction; it is going be difficult to get a sample
data out since the dataset I am using is proprietary.

>From the above set queries (ones mentioned in above comments) both inner
and outer join are producing the same counts.  They are basically pulling
out selected columns from the query, but there is no roll up happening or
anything that would possible make it suspicious that there is any
difference besides the type of joins.  The tables are matched based on
three keys that are present in both tables (ad, account, and date), on top
of this they are filtered by date being above 2016-01-03.  Since all the
joins are producing the same counts, the natural suspicions is that the
tables are identical, but I when I run the following two queries:

scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>='2016-01-03'").count

res14: Long = 34158

scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>='2016-01-03'").count

res15: Long = 42693


The above two queries filter out the data based on date used by the joins
of 2016-01-03 and you can see the row count between the two tables are
different, which is why I am suspecting something is wrong with the outer
joins in spark sql, because in this situation the right and outer joins may
produce the same results, but it should not be equal to the left join and
definitely not the inner join; unless I am missing something.


Side note: In my haste response above I posted the wrong counts for
dps.count, the real value is res16: Long = 42694


Thanks,


KP



On Mon, May 2, 2016 at 12:50 PM, Yong Zhang  wrote:

> We are still not sure what is the problem, if you cannot show us with some
> example data.
>
> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
> with swig on 3 columns; with additional filters, get the same resultSet row
> count as dps lefter outer join with swig on 3 columns, with additional
> filters, again get the the same resultSet row count as dps right outer join
> with swig on 3 columns, with same additional filters.
>
> Without knowing your data, I cannot see the reason that has to be a bug in
> the spark.
>
> Am I misunderstanding your bug?
>
> Yong
>
> --
> From: kpe...@gmail.com
> Date: Mon, 2 May 2016 12:11:18 -0700
> Subject: Re: Weird results with Spark SQL Outer joins
> To: gourav.sengu...@gmail.com
> CC: user@spark.apache.org
>
>
> Gourav,
>
> I wish that was case, but I have done a select count on each of the two
> tables individually and they return back different number of rows:
>
>
> dps.registerTempTable("dps_pin_promo_lt")
> swig.registerTempTable("swig_pin_promo_lt")
>
>
> dps.count()
> RESULT: 42632
>
>
> swig.count()
> RESULT: 42034
>
> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1  wrote:
>
> Also, the results of the inner query produced the same results:
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav,

I wish that was case, but I have done a select count on each of the two
tables individually and they return back different number of rows:


dps.registerTempTable("dps_pin_promo_lt")

swig.registerTempTable("swig_pin_promo_lt")


dps.count()

RESULT: 42632


swig.count()

RESULT: 42034

On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta 
wrote:

> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1  wrote:
>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav,

Apologies.  I edited my post with this information:
Spark version: 1.6
Result from spark shell
OS: Linux version 2.6.32-431.20.3.el6.x86_64 (
mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014

Thanks,

KP

On Mon, May 2, 2016 at 11:05 AM, Gourav Sengupta 
wrote:

> Hi,
>
> As always, can you please write down details regarding your SPARK cluster
> - the version, OS, IDE used, etc?
>
> Regards,
> Gourav Sengupta
>
> On Mon, May 2, 2016 at 5:58 PM, kpeng1  wrote:
>
>> Hi All,
>>
>> I am running into a weird result with Spark SQL Outer joins.  The results
>> for all of them seem to be the same, which does not make sense due to the
>> data.  Here are the queries that I am running with the results:
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>> '2016-01-03'").count()
>> RESULT: 23747
>>
>> Was wondering if someone had encountered this issues before.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Kevin Peng
Ted,

What triggerAndWait does is perform a rest call to a specified url and then
waits until the status message that gets returned by that url in a json a
field says complete.  The issues is I put a println at the very top of the
method and that doesn't get printed out, and I know that println isn't
causing an issues because there is an exception that I throw further down
the line and that exception is what I am currently getting, but none of the
println along the way are showing:


  def triggerAndWait(url: String, pollInterval: Int = 1000 * 30,

timeOut: Int = 1000 * 60 * 60, connectTimeout: Int = 3,

readTimeout: Int = 3, requestMethod: String = "GET"): Boolean = {

println("Entering triggerAndWait function - url: " + url +

  " pollInterval: " + pollInterval.toString() + " timeOut: " +

  timeOut.toString() + " connectionTimeout: " +

  connectTimeout.toString() + " readTimeout: " + readTimeout.toString()
+

  " requestMethod: " + requestMethod)


.


Thanks,


KP

On Mon, Mar 28, 2016 at 1:52 PM, Ted Yu  wrote:

> Can you describe what gets triggered by triggerAndWait ?
>
> Cheers
>
> On Mon, Mar 28, 2016 at 1:39 PM, kpeng1  wrote:
>
>> Hi All,
>>
>> I am currently trying to debug a spark application written in scala.  I
>> have
>> a main method:
>>   def main(args: Array[String]) {
>> ...
>>  SocialUtil.triggerAndWait(triggerUrl)
>> ...
>>
>> The SocialUtil object is included in a seperate jar.  I launched the
>> spark-submit command using --jars passing the SocialUtil jar.  Inside the
>> triggerAndWait function I have a println statement that is the first thing
>> in the method, but it doesn't seem to be coming out.  All println that
>> happen inside the main function directly are appearing though.  I was
>> wondering if anyone knows what is going on in this situation and how I can
>> go about making the println in the SocialUtil object appear.
>>
>> Thanks,
>>
>> KP
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/println-not-appearing-in-libraries-when-running-job-using-spark-submit-master-local-tp26617.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Loading in json with spark sql

2015-03-13 Thread Kevin Peng
Yin,

Yup thanks.  I fixed that shortly after I posted and it worked.

Thanks,

Kevin

On Fri, Mar 13, 2015 at 8:28 PM, Yin Huai  wrote:

> Seems you want to use array for the field of "providers", like 
> "providers":[{"id":
> ...}, {"id":...}] instead of "providers":{{"id": ...}, {"id":...}}
>
> On Fri, Mar 13, 2015 at 7:45 PM, kpeng1  wrote:
>
>> Hi All,
>>
>> I was noodling around with loading in a json file into spark sql's hive
>> context and I noticed that I get the following message after loading in
>> the
>> json file:
>> PhysicalRDD [_corrupt_record#0], MappedRDD[5] at map at JsonRDD.scala:47
>>
>> I am using the HiveContext to load in the json file using the jsonFile
>> command.  I also have 1 json object per line on the file.  Here is a
>> sample
>> of the contents in the json file:
>>
>> {"user_id":"7070","providers":{{"id":"8753","name":"pjfig","behaviors":{"b1":"erwxt","b2":"yjooj"}},{"id":"8329","name":"dfvhh","behaviors":{"b1":"pjjdn","b2":"ooqsh"
>>
>> {"user_id":"1615","providers":{{"id":"6105","name":"rsfon","behaviors":{"b1":"whlje","b2":"lpjnq"}},{"id":"6828","name":"pnmrb","behaviors":{"b1":"fjpmz","b2":"dxqxk"
>>
>> {"user_id":"5210","providers":{{"id":"9360","name":"xdylm","behaviors":{"b1":"gcdze","b2":"cndcs"}},{"id":"4812","name":"gxboh","behaviors":{"b1":"qsxao","b2":"ixdzq"
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-in-json-with-spark-sql-tp22044.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: spark sql writing in avro

2015-03-13 Thread Kevin Peng
Markus,

Thanks.  That makes sense.  I was able to get this to work with spark-shell
passing in the git built jar.  I did notice that I couldn't get
AvroSaver.save to work with SQLContext, but it works with HiveContext.  Not
sure if that is an issue, but for me, it is fine.

Once again, thanks for the help.

Kevin

On Fri, Mar 13, 2015 at 1:57 PM, M. Dale  wrote:

> I probably did not do a good enough job explaining the problem. If you
> used Maven with the
> default Maven repository you have an old version of spark-avro that does
> not contain AvroSaver and does not have the saveAsAvro method implemented:
>
> Assuming you use the default Maven repo location:
> cd ~/.m2/repository/com/databricks/spark-avro_2.10/0.1
> jar tvf spark-avro_2.10-0.1.jar | grep AvroSaver
>
> Comes up empty. The jar file does not contain this class because
> AvroSaver.scala wasn't added until Jan 21. The jar file is from 14 November.
>
> So:
> git clone g...@github.com:databricks/spark-avro.git
> cd spark-avro
> sbt publish-m2
>
> This publishes the latest master code (this includes AvroSaver etc.) to
> your local Maven repo and Maven will pick up the latest version of
> spark-avro (for this machine).
>
> Now you should be able to compile and run.
>
> HTH,
> Markus
>
>
> On 03/12/2015 11:55 PM, Kevin Peng wrote:
>
> Dale,
>
>  I basically have the same maven dependency above, but my code will not
> compile due to not being able to reference to AvroSaver, though the
> saveAsAvro reference compiles fine, which is weird.  Eventhough saveAsAvro
> compiles for me, it errors out when running the spark job due to it not
> being implemented (the job quits and says non implemented method or
> something along those lines).
>
>  I will try going the spark shell and passing in the jar built from
> github since I haven't tried that quite yet.
>
> On Thu, Mar 12, 2015 at 6:44 PM, M. Dale  wrote:
>
>> Short answer: if you downloaded spark-avro from the repo.maven.apache.org
>> repo you might be using an old version (pre-November 14, 2014) -
>> see timestamps at
>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>> Lots of changes at https://github.com/databricks/spark-avro since then.
>>
>> Databricks, thank you for sharing the Avro code!!!
>>
>> Could you please push out the latest version or update the version
>> number and republish to repo.maven.apache.org (I have no idea how jars
>> get
>> there). Or is there a different repository that users should point to for
>> this artifact?
>>
>> Workaround: Download from https://github.com/databricks/spark-avro and
>> build
>> with latest functionality (still version 0.1) and add to your local Maven
>> or Ivy repo.
>>
>> Long version:
>> I used a default Maven build and declared my dependency on:
>>
>> 
>> com.databricks
>> spark-avro_2.10
>> 0.1
>> 
>>
>> Maven downloaded the 0.1 version from
>> http://repo.maven.apache.org/maven2/com/databricks/spark-avro_2.10/0.1/
>> and included it in my app code jar.
>>
>> From spark-shell:
>>
>> import com.databricks.spark.avro._
>> import org.apache.spark.sql.SQLContext
>> val sqlContext = new SQLContext(sc)
>>
>> # This schema includes LONG for time in millis (
>> https://github.com/medale/spark-mail/blob/master/mailrecord/src/main/avro/com/uebercomputing/mailrecord/MailRecord.avdl
>> )
>> val recordsSchema = sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
>> java.lang.RuntimeException: Unsupported type LONG
>>
>> However, checking out the spark-avro code from its GitHub repo and adding
>> a test case against the MailRecord avro everything ran fine.
>>
>> So I built the databricks spark-avro locally on my box and then put it in
>> my
>> local Maven repo - everything worked from spark-shell when adding that jar
>> as dependency.
>>
>> Hope this helps for the "save" case as well. On the pre-14NOV version,
>> avro.scala
>> says:
>>  // TODO: Implement me.
>>   implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
>> def saveAsAvroFile(path: String): Unit = ???
>>   }
>>
>> Markus
>>
>> On 03/12/2015 07:05 PM, kpeng1 wrote:
>>
>>> Hi All,
>>>
>>> I am current trying to write out a scheme RDD to avro.  I noticed that
>>> there
>>> is a databricks spark-avro library and I have included that in my
>>> dependencies, but it looks like I am not able to access the AvroSaver
>>> 

Re: spark sql writing in avro

2015-03-12 Thread Kevin Peng
Dale,

I basically have the same maven dependency above, but my code will not
compile due to not being able to reference to AvroSaver, though the
saveAsAvro reference compiles fine, which is weird.  Eventhough saveAsAvro
compiles for me, it errors out when running the spark job due to it not
being implemented (the job quits and says non implemented method or
something along those lines).

I will try going the spark shell and passing in the jar built from github
since I haven't tried that quite yet.

On Thu, Mar 12, 2015 at 6:44 PM, M. Dale  wrote:

> Short answer: if you downloaded spark-avro from the repo.maven.apache.org
> repo you might be using an old version (pre-November 14, 2014) -
> see timestamps at http://repo.maven.apache.org/
> maven2/com/databricks/spark-avro_2.10/0.1/
> Lots of changes at https://github.com/databricks/spark-avro since then.
>
> Databricks, thank you for sharing the Avro code!!!
>
> Could you please push out the latest version or update the version
> number and republish to repo.maven.apache.org (I have no idea how jars get
> there). Or is there a different repository that users should point to for
> this artifact?
>
> Workaround: Download from https://github.com/databricks/spark-avro and
> build
> with latest functionality (still version 0.1) and add to your local Maven
> or Ivy repo.
>
> Long version:
> I used a default Maven build and declared my dependency on:
>
> 
> com.databricks
> spark-avro_2.10
> 0.1
> 
>
> Maven downloaded the 0.1 version from http://repo.maven.apache.org/
> maven2/com/databricks/spark-avro_2.10/0.1/ and included it in my app code
> jar.
>
> From spark-shell:
>
> import com.databricks.spark.avro._
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
>
> # This schema includes LONG for time in millis (https://github.com/medale/
> spark-mail/blob/master/mailrecord/src/main/avro/com/
> uebercomputing/mailrecord/MailRecord.avdl)
> val recordsSchema = sqlContext.avroFile("/opt/rpm1/enron/enron-tiny.avro")
> java.lang.RuntimeException: Unsupported type LONG
>
> However, checking out the spark-avro code from its GitHub repo and adding
> a test case against the MailRecord avro everything ran fine.
>
> So I built the databricks spark-avro locally on my box and then put it in
> my
> local Maven repo - everything worked from spark-shell when adding that jar
> as dependency.
>
> Hope this helps for the "save" case as well. On the pre-14NOV version,
> avro.scala
> says:
>  // TODO: Implement me.
>   implicit class AvroSchemaRDD(schemaRDD: SchemaRDD) {
> def saveAsAvroFile(path: String): Unit = ???
>   }
>
> Markus
>
> On 03/12/2015 07:05 PM, kpeng1 wrote:
>
>> Hi All,
>>
>> I am current trying to write out a scheme RDD to avro.  I noticed that
>> there
>> is a databricks spark-avro library and I have included that in my
>> dependencies, but it looks like I am not able to access the AvroSaver
>> object.  On compilation of the job I get this:
>> error: not found: value AvroSaver
>> [ERROR] AvroSaver.save(resultRDD, args(4))
>>
>> I also tried calling saveAsAvro on the resultRDD(the actual rdd with the
>> results) and that passes compilation, but when I run the code I get an
>> error
>> that says the saveAsAvro is not implemented.  I am using version 0.1 of
>> spark-avro_2.10
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/spark-sql-writing-in-avro-tp22021.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Marcelo,

Thanks.  The one in the CDH repo fixed it :)

On Wed, Mar 4, 2015 at 4:37 PM, Marcelo Vanzin  wrote:

> Hi Kevin,
>
> If you're using CDH, I'd recommend using the CDH repo [1], and also
> the CDH version when building your app.
>
> [1]
> http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html
>
> On Wed, Mar 4, 2015 at 4:34 PM, Kevin Peng  wrote:
> > Ted,
> >
> > I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not
> too
> > sure about the compatibility issues between 1.2.0 and 1.2.1, that is why
> I
> > would want to stick to 1.2.0.
> >
> >
> >
> > On Wed, Mar 4, 2015 at 4:25 PM, Ted Yu  wrote:
> >>
> >> Kevin:
> >> You can try with 1.2.1
> >>
> >> See this thread: http://search-hadoop.com/m/JW1q5Vfe6X1
> >>
> >> Cheers
> >>
> >> On Wed, Mar 4, 2015 at 4:18 PM, Kevin Peng  wrote:
> >>>
> >>> Marcelo,
> >>>
> >>> Yes that is correct, I am going through a mirror, but 1.1.0 works
> >>> properly, while 1.2.0 does not.  I suspect there is crc in the 1.2.0
> pom
> >>> file.
> >>>
> >>> On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin 
> >>> wrote:
> >>>>
> >>>> Seems like someone set up "m2.mines.com" as a mirror in your pom file
> >>>> or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is
> >>>> in a messed up state).
> >>>>
> >>>> On Wed, Mar 4, 2015 at 3:49 PM, kpeng1  wrote:
> >>>> > Hi All,
> >>>> >
> >>>> > I am currently having problem with the maven dependencies for
> version
> >>>> > 1.2.0
> >>>> > of spark-core and spark-hive.  Here are my dependencies:
> >>>> > 
> >>>> >   org.apache.spark
> >>>> >   spark-core_2.10
> >>>> >   1.2.0
> >>>> > 
> >>>> > 
> >>>> >   org.apache.spark
> >>>> >   spark-hive_2.10
> >>>> >   1.2.0
> >>>> > 
> >>>> >
> >>>> > When the dependencies are set to version 1.1.0, I do not get any
> >>>> > errors.
> >>>> > Here are the errors I am getting from artifactory for version 1.2.0
> of
> >>>> > spark-core:
> >>>> > error=Could not transfer artifact
> >>>> > org.apache.spark\:spark-core_2.10\:pom\:1.2.0 from/to repo
> >>>> > (https\://m2.mines.com\:8443/artifactory/repo)\: Failed to transfer
> >>>> > file\:
> >>>> >
> >>>> > https\://m2.mines.com
> \:8443/artifactory/repo/org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom.
> >>>> > Return code is\: 409 , ReasonPhrase\:Conflict.
> >>>> >
> >>>> > The error is the same for spark-hive.
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> > --
> >>>> > View this message in context:
> >>>> >
> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-maven-dependencies-for-version-1-2-0-but-not-version-1-1-0-tp21919.html
> >>>> > Sent from the Apache Spark User List mailing list archive at
> >>>> > Nabble.com.
> >>>> >
> >>>> >
> -
> >>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>>> > For additional commands, e-mail: user-h...@spark.apache.org
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Marcelo
> >>>
> >>>
> >>
> >
>
>
>
> --
> Marcelo
>


Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Ted,

I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not too
sure about the compatibility issues between 1.2.0 and 1.2.1, that is why I
would want to stick to 1.2.0.



On Wed, Mar 4, 2015 at 4:25 PM, Ted Yu  wrote:

> Kevin:
> You can try with 1.2.1
>
> See this thread: http://search-hadoop.com/m/JW1q5Vfe6X1
>
> Cheers
>
> On Wed, Mar 4, 2015 at 4:18 PM, Kevin Peng  wrote:
>
>> Marcelo,
>>
>> Yes that is correct, I am going through a mirror, but 1.1.0 works
>> properly, while 1.2.0 does not.  I suspect there is crc in the 1.2.0 pom
>> file.
>>
>> On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin 
>> wrote:
>>
>>> Seems like someone set up "m2.mines.com" as a mirror in your pom file
>>> or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is
>>> in a messed up state).
>>>
>>> On Wed, Mar 4, 2015 at 3:49 PM, kpeng1  wrote:
>>> > Hi All,
>>> >
>>> > I am currently having problem with the maven dependencies for version
>>> 1.2.0
>>> > of spark-core and spark-hive.  Here are my dependencies:
>>> > 
>>> >   org.apache.spark
>>> >   spark-core_2.10
>>> >   1.2.0
>>> > 
>>> > 
>>> >   org.apache.spark
>>> >   spark-hive_2.10
>>> >   1.2.0
>>> > 
>>> >
>>> > When the dependencies are set to version 1.1.0, I do not get any
>>> errors.
>>> > Here are the errors I am getting from artifactory for version 1.2.0 of
>>> > spark-core:
>>> > error=Could not transfer artifact
>>> > org.apache.spark\:spark-core_2.10\:pom\:1.2.0 from/to repo
>>> > (https\://m2.mines.com\:8443/artifactory/repo)\: Failed to transfer
>>> file\:
>>> > https\://m2.mines.com
>>> \:8443/artifactory/repo/org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom.
>>> > Return code is\: 409 , ReasonPhrase\:Conflict.
>>> >
>>> > The error is the same for spark-hive.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-maven-dependencies-for-version-1-2-0-but-not-version-1-1-0-tp21919.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>


Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Ted,

I have tried wiping out ~/.m2/org.../spark directory multiple times.  It
doesn't seem to work.



On Wed, Mar 4, 2015 at 4:20 PM, Ted Yu  wrote:

> kpeng1:
> Try wiping out ~/.m2 and build again.
>
> Cheers
>
> On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin 
> wrote:
>
>> Seems like someone set up "m2.mines.com" as a mirror in your pom file
>> or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is
>> in a messed up state).
>>
>> On Wed, Mar 4, 2015 at 3:49 PM, kpeng1  wrote:
>> > Hi All,
>> >
>> > I am currently having problem with the maven dependencies for version
>> 1.2.0
>> > of spark-core and spark-hive.  Here are my dependencies:
>> > 
>> >   org.apache.spark
>> >   spark-core_2.10
>> >   1.2.0
>> > 
>> > 
>> >   org.apache.spark
>> >   spark-hive_2.10
>> >   1.2.0
>> > 
>> >
>> > When the dependencies are set to version 1.1.0, I do not get any errors.
>> > Here are the errors I am getting from artifactory for version 1.2.0 of
>> > spark-core:
>> > error=Could not transfer artifact
>> > org.apache.spark\:spark-core_2.10\:pom\:1.2.0 from/to repo
>> > (https\://m2.mines.com\:8443/artifactory/repo)\: Failed to transfer
>> file\:
>> > https\://m2.mines.com
>> \:8443/artifactory/repo/org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom.
>> > Return code is\: 409 , ReasonPhrase\:Conflict.
>> >
>> > The error is the same for spark-hive.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-maven-dependencies-for-version-1-2-0-but-not-version-1-1-0-tp21919.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
Marcelo,

Yes that is correct, I am going through a mirror, but 1.1.0 works properly,
while 1.2.0 does not.  I suspect there is crc in the 1.2.0 pom file.

On Wed, Mar 4, 2015 at 4:10 PM, Marcelo Vanzin  wrote:

> Seems like someone set up "m2.mines.com" as a mirror in your pom file
> or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is
> in a messed up state).
>
> On Wed, Mar 4, 2015 at 3:49 PM, kpeng1  wrote:
> > Hi All,
> >
> > I am currently having problem with the maven dependencies for version
> 1.2.0
> > of spark-core and spark-hive.  Here are my dependencies:
> > 
> >   org.apache.spark
> >   spark-core_2.10
> >   1.2.0
> > 
> > 
> >   org.apache.spark
> >   spark-hive_2.10
> >   1.2.0
> > 
> >
> > When the dependencies are set to version 1.1.0, I do not get any errors.
> > Here are the errors I am getting from artifactory for version 1.2.0 of
> > spark-core:
> > error=Could not transfer artifact
> > org.apache.spark\:spark-core_2.10\:pom\:1.2.0 from/to repo
> > (https\://m2.mines.com\:8443/artifactory/repo)\: Failed to transfer
> file\:
> > https\://m2.mines.com
> \:8443/artifactory/repo/org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom.
> > Return code is\: 409 , ReasonPhrase\:Conflict.
> >
> > The error is the same for spark-hive.
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-maven-dependencies-for-version-1-2-0-but-not-version-1-1-0-tp21919.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
>
> --
> Marcelo
>


Re: Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-15 Thread Kevin Peng
Sean,

Thanks.  That worked.

Kevin

On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen  wrote:

> This is more of a Java / Maven issue than Spark per se. I would use
> the shade plugin to remove signature files in your final META-INF/
> dir. As Spark does, in its :
>
> 
>   
> *:*
> 
>   org/datanucleus/**
>   META-INF/*.SF
>   META-INF/*.DSA
>   META-INF/*.RSA
> 
>   
> 
>
> On Mon, Sep 15, 2014 at 11:33 PM, kpeng1  wrote:
> > Hi All,
> >
> > I am trying to submit a spark job that I have built in maven using the
> > following command:
> > /usr/bin/spark-submit --deploy-mode client --class com.spark.TheMain
> > --master local[1] /home/cloudera/myjar.jar 100
> >
> > But I seem to be getting the following error:
> > Exception in thread "main" java.lang.SecurityException: Invalid signature
> > file digest for Manifest main attributes
> > at
> >
> sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:286)
> > at
> >
> sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:239)
> > at java.util.jar.JarVerifier.processEntry(JarVerifier.java:307)
> > at java.util.jar.JarVerifier.update(JarVerifier.java:218)
> > at java.util.jar.JarFile.initializeVerifier(JarFile.java:345)
> > at java.util.jar.JarFile.getInputStream(JarFile.java:412)
> > at
> sun.misc.URLClassPath$JarLoader$2.getInputStream(URLClassPath.java:775)
> > at sun.misc.Resource.cachedInputStream(Resource.java:77)
> > at sun.misc.Resource.getByteBuffer(Resource.java:160)
> > at java.net.URLClassLoader.defineClass(URLClassLoader.java:436)
> > at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> > at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> > at java.lang.Class.forName0(Native Method)
> > at java.lang.Class.forName(Class.java:270)
> > at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:289)
> > at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >
> >
> > Here is the pom file I am using to build the jar:
> > http://maven.apache.org/POM/4.0.0";
> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> > http://maven.apache.org/maven-v4_0_0.xsd";>
> >   4.0.0
> >   com.spark
> >   myjar
> >   0.0.1-SNAPSHOT
> >   ${project.artifactId}
> >   My wonderfull scala app
> >   2010
> >   
> > 
> >   My License
> >   http://
> >   repo
> > 
> >   
> >
> >   
> > cdh5.1.0
> > 1.6
> > 1.6
> > UTF-8
> > 2.10
> > 2.10.4
> >   
> >
> >   
> > 
> >   scala-tools.org
> >   Scala-tools Maven2 Repository
> >   https://oss.sonatype.org/content/repositories/snapshots/
> 
> > 
> > 
> >   maven-hadoop
> >   Hadoop Releases
> >
> > https://repository.cloudera.com/content/repositories/releases/
> 
> > 
> > 
> >   cloudera-repos
> >   Cloudera Repos
> >   https://repository.cloudera.com/artifactory/cloudera-repos/
> 
> > 
> >   
> >   
> > 
> >   scala-tools.org
> >   Scala-tools Maven2 Repository
> >   https://oss.sonatype.org/content/repositories/snapshots/
> 
> > 
> >   
> >
> >   
> > 
> >   org.scala-lang
> >   scala-library
> >   ${scala.version}
> > 
> > 
> >   org.apache.spark
> >   spark-core_2.10
> >   1.0.0-${cdh.version}
> > 
> > 
> >   org.apache.spark
> >   spark-tools_2.10
> >   1.0.0-${cdh.version}
> > 
> > 
> >   org.apache.spark
> >   spark-streaming-flume_2.10
> >   1.0.0-${cdh.version}
> > 
> > 
> >   org.apache.spark
> >   spark-streaming_2.10
> >   1.0.0-${cdh.version}
> > 
> > 
> >   org.apache.flume
> >   flume-ng-sdk
> >   1.5.0-${cdh.version}
> >
> >   
> > 
> >   io.netty
> >   netty
> > 
> >   
> > 
> > 
> >   org.apache.flume
> >   flume-ng-core
> >   1.5.0-${cdh.version}
> >
> >   
> > 
> >   io.netty
> >   netty
> > 
> >   
> > 
> > 
> >   org.apache.hbase
> >   hbase-client
> >   0.98.1-${cdh.version}
> >
> >   
> > 
> >   io.netty
> >   netty
> > 
> >   
> > 
> > 
> >   org.apache.hadoop
> >   hadoop-client
> >   2.3.0-${cdh.version}
> >
> > 
> >
> >
> > 
> >   junit
> >   junit
> >   4

Re: Spark Streaming into HBase

2014-09-03 Thread Kevin Peng
Ted,

Here is the full stack trace coming from spark-shell:

14/09/03 16:21:03 ERROR scheduler.JobScheduler: Error running job streaming
job 1409786463000 ms.0

org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)

at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

at akka.actor.ActorCell.invoke(ActorCell.scala:456)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)

at akka.dispatch.Mailbox.run(Mailbox.scala:219)

at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Basically, what I am doing on the terminal where I run nc -lk, I type in
words separated by commas and hit enter i.e. "bill,ted".


On Wed, Sep 3, 2014 at 2:36 PM, Ted Yu  wrote:

> Adding back user@
>
> I am not familiar with the NotSerializableException. Can you show the
> full stack trace ?
>
> See SPARK-1297 for changes you need to make so that Spark works with
> hbase 0.98
>
> Cheers
>
>
> On Wed, Sep 3, 2014 at 2:33 PM, Kevin Peng  wrote:
>
>> Ted,
>>
>> The hbase-site.xml is in the classpath (had worse issues before... until
>> I figured that it wasn't in the path).
>>
>> I get the following error in the spark-shell:
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> not serializable: java.io.NotSerializableException:
>> org.apache.spark.streaming.StreamingContext
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.sc
>> ...
>>
>> I also double checked the hbase table, just in case, and nothing new is
>> written in there.
>>
>> I am using hbase version: 0.98.1-cdh5.1.0 the default one with the
>> CDH5.1.0 distro.
>>
>> Thank you for the help.
>>
>>
>> On Wed, Sep 3, 2014 at 2:09 PM, Ted Yu  wrote:
>>
>>> Is hbase-site.xml in the classpath ?
>>> Do you observe any exception from the code below or in region server log
>>> ?
>>>
>>> Which hbase release are you using ?
>>>
>>>
>>> On Wed, Sep 3, 2014 at 2:05 PM, kpeng1  wrote:
>>>
>>>> I have been trying to understand how spark streaming and hbase connect,
>>>> but
>>>> have not been successful. What I am trying to do is given a spark
>>>> stream,
>>>> process that stream and store the results in an hbase table. So far
>>>> this is
>>>> what I have:
>>>>
>>>> import org.apache.spark.SparkConf
>>>> import org.apache.spark.streaming.{Seconds, StreamingContext}
>>>> import org.apache.spark.streaming.StreamingContext._
>>>> import org.apache.spark.storage.StorageLevel
>>>> import org.apache.hadoop.hbase.HBaseConfiguration
>>>> import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
>>>> import org.apache.hadoop.hbase.util.Bytes
>>>>
>>>> def blah(row: Array[String]) {
>>>>   val hConf = new HBaseConfiguration()
>>>>   val hTable = new HTable(hConf, "table")
>>>>   val thePut = new Put(Bytes.toBytes(row(0)))
>>>>   thePut.add(Bytes.toBytes("cf"), Bytes.toBytes(row(0)),
>>>> Bytes.toBytes(row(0)))
>>>>   hTable