Sparkcontext on udf

2017-10-18 Thread sk skk
I have registered a udf with sqlcontext , I am trying to read another
parquet using sqlcontext under same udf it’s throwing null pointer
exception .

Any help how to access sqlcontext inside a udf ?

Regards,
Sk


Re: Spark streaming for CEP

2017-10-18 Thread Mich Talebzadeh
As you may be aware the granularity that Spark streaming has is
micro-batching and that is limited to 0.5 second. So if you have continuous
ingestion of data then Spark streaming may not be granular enough for CEP.
You may consider other products.

Worth looking at this old thread on mine "Spark support for Complex Event
Processing (CEP)

https://mail-archives.apache.org/mod_mbox/spark-user/201604.mbox/%3CCAJ3fcbB8eaf0JV84bA7XGUK5GajC1yGT3ZgTNCi8arJg56=l...@mail.gmail.com%3E

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 October 2017 at 20:52, anna stax  wrote:

> Hello all,
>
> Has anyone used spark streaming for CEP (Complex Event processing).  Any
> CEP libraries that works well with spark. I have a use case for CEP and
> trying to see if spark streaming is a good fit.
>
> Currently we have a data pipeline using Kafka, Spark streaming and
> Cassandra for data ingestion and near real time dashboard.
>
> Please share your experience.
> Thanks much.
> -Anna
>
>
>


Spark streaming for CEP

2017-10-18 Thread anna stax
Hello all,

Has anyone used spark streaming for CEP (Complex Event processing).  Any
CEP libraries that works well with spark. I have a use case for CEP and
trying to see if spark streaming is a good fit.

Currently we have a data pipeline using Kafka, Spark streaming and
Cassandra for data ingestion and near real time dashboard.

Please share your experience.
Thanks much.
-Anna


possible cause: same TeraGen job sometimes slow and sometimes fast

2017-10-18 Thread Gil Vernik
I performed a series of TeraGen jobs via spark-submit ( each job generated 
equal size dataset into different S3 buckets )
I noticed that some jobs were fast and some were slow.

Slow jobs always had many log prints like
DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_1.0, runningTasks: 1 
( or 2, etc.. )

Fast jobs always have few prints of those lines.

Can someone explain me, why the number of those debug prints are vary for 
different executions of the same job? The more i see those prints - so the 
job is slower.
Does someone experienced the same behavior?

Thanks
Gil.






Re: Need help with String Concat Operation

2017-10-18 Thread 高佳翔
Hi Debu,

First, Instead of using ‘+’, you can use ‘concat’ to concatenate string
columns. And you should enclose “0” with "lit()" to make it a column.
Second, 1440 become null because you didn’t tell spark what to do if the
when clause is failed. So it simply set the value to null. To fix this, you
should add “.otherwise()” right behind “when()”.

The code looks like this:

ctoff_df.withColumn("CTOFF_NEW",
  when(
length(col("CTOFF")) ==3,
concat(lit("0"), col("CTOFF"))
  ).otherwise(
col("CTOFF")
  ))

Best,

JiaXiang
​

On Wed, Oct 18, 2017 at 2:17 PM, Debabrata Ghosh 
wrote:

> Hi,
>  I am having a dataframe column (name of the column is CTOFF)
> and I intend to prefix with '0' in case the length of the column is 3.
> Unfortunately, I am unable to acheive my goal and wonder whether you can
> help me here.
>
> Command which I am executing:
>
> ctoff_dedup_prep_temp = 
> ctoff_df.withColumn('CTOFF_NEW',when(length(col('CTOFF'))
> == 3,'0'+col('CTOFF')))
> ctoff_dedup_prep_temp.show()
>
> ++--+--++---
> --+-+-+
> |EVNT_SRVC_AR|EVNT_FCLTY|EVNT_TP_CD|NTWRK_PRD_CD|
> DY_OF_WK|CTOFF|CTOFF_NEW|
> ++--+--++---
> --+-+-+
> | HKG|   HKC|AR|2,3,7,8,C,D,J,P,Q...|1,2,3,4,5|
> 1440| null|
> | HKG|   HKC|AR| C,Q,T,Y|1,2,3,4,5|
> 730|730.0|
> | HKG|   HKC|AR| E,K,C,Q,T,Y|1,2,3,4,5|
> 600|600.0|
> | HKG|   HKC|AR| E,K,C,Q,T,Y|1,2,3,4,5|
> 900|900.0|
> ++--+--++---
> --+-+-+
>
> The result which I want is:
> ++--+--++---
> --+-+-+
> |EVNT_SRVC_AR|EVNT_FCLTY|EVNT_TP_CD|NTWRK_PRD_CD|
> DY_OF_WK|CTOFF|CTOFF_NEW|
> ++--+--++---
> --+-+-+
> | HKG|   HKC|AR|2,3,7,8,C,D,J,P,Q...|1,2,3,4,5|
> 1440|1440|
> | HKG|   HKC|AR| C,Q,T,Y|1,2,3,4,5|
> 730|0730|
> | HKG|   HKC|AR| E,K,C,Q,T,Y|1,2,3,4,5|
> 600|0600|
> | HKG|   HKC|AR| E,K,C,Q,T,Y|1,2,3,4,5|
> 900|0900|
> ++--+--++---
> --+-+-+
>
> So I want the '0' to be prefixed but it's getting suffixed as '.0'. Any
> clue around why is this happening
>
> Thanks,
>
> Debu
>



-- 
Gao JiaXiang
Data Analyst, GCBI 


Need help with String Concat Operation

2017-10-18 Thread Debabrata Ghosh
Hi,
 I am having a dataframe column (name of the column is CTOFF)
and I intend to prefix with '0' in case the length of the column is 3.
Unfortunately, I am unable to acheive my goal and wonder whether you can
help me here.

Command which I am executing:

ctoff_dedup_prep_temp =
ctoff_df.withColumn('CTOFF_NEW',when(length(col('CTOFF')) ==
3,'0'+col('CTOFF')))
ctoff_dedup_prep_temp.show()

++--+--++-+-+-+
|EVNT_SRVC_AR|EVNT_FCLTY|EVNT_TP_CD|NTWRK_PRD_CD|
DY_OF_WK|CTOFF|CTOFF_NEW|
++--+--++-+-+-+
| HKG|   HKC|AR|2,3,7,8,C,D,J,P,Q...|1,2,3,4,5|
1440| null|
| HKG|   HKC|AR| C,Q,T,Y|1,2,3,4,5|
730|730.0|
| HKG|   HKC|AR| E,K,C,Q,T,Y|1,2,3,4,5|
600|600.0|
| HKG|   HKC|AR| E,K,C,Q,T,Y|1,2,3,4,5|
900|900.0|
++--+--++-+-+-+

The result which I want is:
++--+--++-+-+-+
|EVNT_SRVC_AR|EVNT_FCLTY|EVNT_TP_CD|NTWRK_PRD_CD|
DY_OF_WK|CTOFF|CTOFF_NEW|
++--+--++-+-+-+
| HKG|   HKC|AR|2,3,7,8,C,D,J,P,Q...|1,2,3,4,5|
1440|1440|
| HKG|   HKC|AR| C,Q,T,Y|1,2,3,4,5|
730|0730|
| HKG|   HKC|AR| E,K,C,Q,T,Y|1,2,3,4,5|
600|0600|
| HKG|   HKC|AR| E,K,C,Q,T,Y|1,2,3,4,5|
900|0900|
++--+--++-+-+-+

So I want the '0' to be prefixed but it's getting suffixed as '.0'. Any
clue around why is this happening

Thanks,

Debu


Re: parition by multiple columns/keys

2017-10-18 Thread Imran Rajjad
yes..I think I figured out something like below

Serialized Java Class
-
public class MyMapPartition implements Serializable,MapPartitionsFunction{
 @Override
 public Iterator call(Iterator iter) throws Exception {
  ArrayList list = new ArrayList();
  // ArrayNode array = mapper.createArrayNode();
  Row row=null;
  System.out.println("");
  while(iter.hasNext()){

   row=(Row) iter.next();
   System.out.println(row);
   list.add(row);
  }
  System.out.println("");
  return list.iterator();
 }
}

Unit Test
---
JavaRDD rdd =
jsc.parallelize(Arrays.asList(RowFactory.create(11L,21L,1L)
  ,RowFactory.create(11L,22L,2L)
  ,RowFactory.create(11L,22L,1L)
  ,RowFactory.create(12L,23L,3L)
  ,RowFactory.create(12L,24L,3L)
  ,RowFactory.create(12L,22L,4L)
  ,RowFactory.create(13L,22L,4L)
  ,RowFactory.create(14L,22L,4L)
));
  StructType structType = new StructType();
  structType = structType.add("a", DataTypes.LongType, false)
.add("b", DataTypes.LongType, false)
.add("c", DataTypes.LongType, false);
  ExpressionEncoder encoder = RowEncoder.apply(structType);


  Dataset ds = spark.createDataFrame(rdd, encoder.schema());
  ds.show();

  MyMapPartition mp = new MyMapPartition ();
//Iterator
  //.repartition(new Column("a"),new Column("b"))
   Dataset grouped = ds.groupBy("a", "b","c")
.count()
.repartition(new Column("a"),new Column("b"))
.mapPartitions(mp,encoder);

  grouped.count();

---

output


[12,23,3,1]


[14,22,4,1]


[12,24,3,1]


[12,22,4,1]


[11,22,1,1]
[11,22,2,1]


[11,21,1,1]


[13,22,4,1]



On Wed, Oct 18, 2017 at 10:29 AM, ayan guha  wrote:

> How or what you want to achieve? Ie are planning to do some aggregation on
> group by c1,c2?
>
> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad  wrote:
>
>> Hi,
>>
>> I have a set of rows that are a result of a groupBy(col1,col2,col3).count(
>> ).
>>
>> Is it possible to map rows belong to unique combination inside an
>> iterator?
>>
>> e.g
>>
>> col1 col2 col3
>> a  1  a1
>> a  1  a2
>> b  2  b1
>> b  2  b2
>>
>> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>>
>> regards,
>> Imran
>>
>> --
>> I.R
>>
> --
> Best Regards,
> Ayan Guha
>



-- 
I.R