A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2016-05-02 Thread Kapil Raaj
Hi folks,

I am suddenly seeing :

Error:scalac: bad symbolic reference. A signature in Logging.class refers
to type Logger
in package org.slf4j which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
Logging.class.

How can I investigate and fix it, I am using IntellijIdea?

-- 
-Kapil Rajak 


Re: Spark groupby and agg inconsistent and missing data

2015-12-10 Thread Kapil Raaj
Hi Folks,

I am also getting similar issue:

(df.groupBy("email").agg(last("user_id") as
"user_id").select("user_id").count,df.groupBy("email").agg(last("user_id")
as "user_id").select("user_id").distinct.count)

When run on one computer it gives: (15123144,15123144)

When run on cluster it gives:  (15123144,24)

The first one is expected and looks correct but second one is horribly
wrong. One more observation - even if I change data where total count is
more/less than 15123144 I get distinct = 24 on cluster. Any clue? or Jira
ticket? or what can be fix for now?

On Thu, Oct 22, 2015 at 9:59 PM,  wrote:

> nevermind my last email. res2 is filtered so my test does not make sense.
> The issue is not reproduced there. I have the problem somwhere else.
>
>
>
> *From:* Ellafi, Saif A.
> *Sent:* Thursday, October 22, 2015 12:57 PM
> *To:* 'Xiao Li'
> *Cc:* user
> *Subject:* RE: Spark groupby and agg inconsistent and missing data
>
>
>
> Thanks, sorry I cannot share the data and not sure how much significant it
> will be for you.
>
> I am reproducing the issue on a smaller piece of the content and see
> wether I find a reason on the inconsistence.
>
>
>
> val res2 = data.filter($"closed" === $"ever_closed").groupBy("product",
> "band ", "aget", "vine", "time",
> "mm").agg(count($"account_id").as("N"), sum($"balance").as("balance"),
> sum($"spend").as("spend"), sum($"payment").as("payment")).persist()
>
>
>
> then I collect distinct values of “vine” (which is StringType) both from
> data and res2, and res2 is missing a lot of values:
>
>
>
> val t1 = res2.select("vine").distinct.collect
>
> scala> t1.size
>
> res10: Int = 617
>
>
>
> val t_real = data.select("vine").distinct.collect
>
> scala> t_real.size
>
> res9: Int = 639
>
>
>
>
>
> *From:* Xiao Li [mailto:gatorsm...@gmail.com ]
> *Sent:* Thursday, October 22, 2015 12:45 PM
> *To:* Ellafi, Saif A.
> *Cc:* user
> *Subject:* Re: Spark groupby and agg inconsistent and missing data
>
>
>
> Hi, Saif,
>
>
>
> Could you post your code here? It might help others reproduce the errors
> and give you a correct answer.
>
>
>
> Thanks,
>
>
>
> Xiao Li
>
>
>
> 2015-10-22 8:27 GMT-07:00 :
>
> Hello everyone,
>
>
>
> I am doing some analytics experiments under a 4 server stand-alone cluster
> in a spark shell, mostly involving a huge database with groupBy and
> aggregations.
>
>
>
> I am picking 6 groupBy columns and returning various aggregated results in
> a dataframe. GroupBy fields are of two types, most of them are StringType
> and the rest are LongType.
>
>
>
> The data source is a splitted json file dataframe,  once the data is
> persisted, the result is consistent. But if I unload the memory and reload
> the data, the groupBy action returns different content results, missing
> data.
>
>
>
> Could I be missing something? this is rather serious for my analytics, and
> not sure how to properly diagnose this situation.
>
>
>
> Thanks,
>
> Saif
>
>
>
>
>



-- 
-Kapil Rajak 


Getting ParquetDecodingException when I am running my spark application from spark-submit

2015-11-24 Thread Kapil Raaj
The relevant error lines are:

Caused by: parquet.io.ParquetDecodingException: Can't read value in
column [roll_key] BINARY at value 19600 out of 4814, 19600 out of
19600 in currentPage. repetition level: 0, definition level: 1
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 131 in stage 0.0 failed 4 times, most recent failure:
Lost task 131.3 in stage 0.0 (TID 198, dap.changed.com):
parquet.io.ParquetDecodingException: Can not read value at 19600 in
block 0 in file
hdfs://dap.changed.com:8020/data/part-r-00177-51654832-053d-4074-b906-b97ac173807a.gz.parquet

But when I am using spark client and reading it, I am not getting any error.

sqlContext.read.load("/data/").select("roll_key")

Kindly let me know how to debug it.

-- 
-Kapil Rajak 


Enriching df.write.jdbc

2015-10-04 Thread Kapil Raaj
Hello folks,

I would like to contribute code to enrich DataFrame writer api for JDBC to
cover "Update table" feature based on some field name/key passed as LIST of
Strings.

Use Case:
1. df.write.mode(*"Update"*).jdbc(connectionString, "table_name"
,connectionProperties, *keys*)
Or
2. df.write.mode(SaveMode.Append).jdbc(connectionString, "table_name"
,connectionProperties, *keys*)

For the second implementation if "keys" is an empty list it'll work as it
is working for now, if "keys" have something, it'll update those entries.

 Let me know which (1 or 2) is better, I think 2 looks better as I don't
want to introduce a new ENUM for SaveMode, moreover "update" looks
irrelevant in context of Big data transformation.

If this use case is useful, let me know I'll go ahead and send a PR.
Any other tips will highly be appreciated.

thanks,

-- 
kapil