Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Ewan Leith
Looking at the encoders api documentation at

http://spark.apache.org/docs/latest/api/java/

== Java == Encoders are specified by calling static methods on 
Encoders.

List data = Arrays.asList("abc", "abc", "xyz"); Dataset ds = 
context.createDataset(data, Encoders.STRING());

I think you should be calling

.as((Encoders.STRING(), Encoders.STRING()))

or similar

Ewan

On 8 Aug 2016 06:10, Aseem Bansal  wrote:
Hi All

Has anyone done this with Java API?

On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal 
> wrote:
I need to use few columns out of a csv. But as there is no option to read few 
columns out of csv so
 1. I am reading the whole CSV using SparkSession.csv()
 2.  selecting few of the columns using DataFrame.select()
 3. applying schema using the .as() function of Dataset.  I tried to 
extent org.apache.spark.sql.Encoder as the input for as function

But I am getting the following exception

Exception in thread "main" java.lang.RuntimeException: Only expression encoders 
are supported today

So my questions are -
1. Is it possible to read few columns instead of whole CSV? I cannot change the 
CSV as that is upstream data
2. How do I apply schema to few columns if I cannot write my encoder?




Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Aseem Bansal
Hi All

Has anyone done this with Java API?

On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal  wrote:

> I need to use few columns out of a csv. But as there is no option to read
> few columns out of csv so
>  1. I am reading the whole CSV using SparkSession.csv()
>  2.  selecting few of the columns using DataFrame.select()
>  3. applying schema using the .as() function of Dataset.  I tried to
> extent org.apache.spark.sql.Encoder as the input for as function
>
> But I am getting the following exception
>
> Exception in thread "main" java.lang.RuntimeException: Only expression
> encoders are supported today
>
> So my questions are -
> 1. Is it possible to read few columns instead of whole CSV? I cannot
> change the CSV as that is upstream data
> 2. How do I apply schema to few columns if I cannot write my encoder?
>


Re: Any exceptions during an action doesn't fail the Spark streaming batch in yarn-client mode

2016-08-07 Thread ayan guha
Is it a python app?

On Mon, Aug 8, 2016 at 2:44 PM, Hemalatha A <
hemalatha.amru...@googlemail.com> wrote:

> Hello,
>
> I am seeing multiple exceptions shown in logs during an action, but none
> of them fails the  Spark streaming batch in yarn-client mode, whereas the
> same exception is thrown in Yarn-cluster mode and the application ends.
>
> I am trying to save a Dataframe To cassandra, which results in error due
> to wrong password lets say. The job goes to failed state throwing the below
> exception in Jobs tab in Spark UI but in the streaming tab, the
> corresponding batch remains in active state forever.It doesn't fail the
> streaming batch in yarn-client mode.. Whereas, the same works fine in
> Yarn-cluster mode, it throws the same error and ends the application.
>
> Why is this difference in behaviour in the 2 modes? Why does yarn-client
> mode behaves in this way?
>
> *Exception seen in both modes:*
>
> 16/08/04 08:04:43 ERROR org.apache.spark.streaming.scheduler.JobScheduler: 
> Error running job streaming job 147029788 ms.0
> java.io.IOException: Failed to open native connection to Cassandra at 
> {172.x.x.x}:9042
> at 
> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
> at 
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
> at 
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
> at 
> com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
> at 
> com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
> at 
> com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
> at 
> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
> at 
> com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.getTokenFactory(CassandraRDDPartitioner.scala:184)
> at 
> org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:267)
> at 
> org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:84)
> at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:170)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at 
> boccstreamingall$$anonfun$process_kv_text_stream$1.apply(bocc_spark_all.scala:249)
> at 
> boccstreamingall$$anonfun$process_kv_text_stream$1.apply(bocc_spark_all.scala:233)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:207)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:206)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: com.datastax.driver.core.exceptions.AuthenticationException: 
> Authentication error on host /172.x.x.x:9042: Username and/or password are 
> incorrect
> at com.datastax.driver.core.Connection$8.apply(Connection.java:376)
> at com.datastax.driver.core.Connection$8.apply(Connection.java:346)
> at 
> 

Any exceptions during an action doesn't fail the Spark streaming batch in yarn-client mode

2016-08-07 Thread Hemalatha A
Hello,

I am seeing multiple exceptions shown in logs during an action, but none of
them fails the  Spark streaming batch in yarn-client mode, whereas the same
exception is thrown in Yarn-cluster mode and the application ends.

I am trying to save a Dataframe To cassandra, which results in error due to
wrong password lets say. The job goes to failed state throwing the below
exception in Jobs tab in Spark UI but in the streaming tab, the
corresponding batch remains in active state forever.It doesn't fail the
streaming batch in yarn-client mode.. Whereas, the same works fine in
Yarn-cluster mode, it throws the same error and ends the application.

Why is this difference in behaviour in the 2 modes? Why does yarn-client
mode behaves in this way?

*Exception seen in both modes:*

16/08/04 08:04:43 ERROR
org.apache.spark.streaming.scheduler.JobScheduler: Error running job
streaming job 147029788 ms.0
java.io.IOException: Failed to open native connection to Cassandra at
{172.x.x.x}:9042
at 
com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:162)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
at 
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
at 
com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at 
com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at 
com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at 
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at 
com.datastax.spark.connector.rdd.partitioner.CassandraRDDPartitioner$.getTokenFactory(CassandraRDDPartitioner.scala:184)
at 
org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:267)
at 
org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:84)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:170)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at 
boccstreamingall$$anonfun$process_kv_text_stream$1.apply(bocc_spark_all.scala:249)
at 
boccstreamingall$$anonfun$process_kv_text_stream$1.apply(bocc_spark_all.scala:233)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:207)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.AuthenticationException:
Authentication error on host /172.x.x.x:9042: Username and/or password
are incorrect
at com.datastax.driver.core.Connection$8.apply(Connection.java:376)
at com.datastax.driver.core.Connection$8.apply(Connection.java:346)
at 
shadeio.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:861)
at 
shadeio.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at 
shadeio.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
at 

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Ted Yu
The link in Jerry's response was quite old.

Please see:
http://hbase.apache.org/book.html#security

Thanks

On Sun, Aug 7, 2016 at 6:55 PM, Saisai Shao  wrote:

> 1. Standalone mode doesn't support accessing kerberized Hadoop, simply
> because it lacks the mechanism to distribute delegation tokens via cluster
> manager.
> 2. For the HBase token fetching failure, I think you have to do kinit to
> generate tgt before start spark application (http://hbase.apache.org/0.94/
> book/security.html).
>
> On Mon, Aug 8, 2016 at 12:05 AM, Aneela Saleem 
> wrote:
>
>> Thanks Wojciech and Jacek!
>>
>> I tried with Spark on Yarn with kerberized cluster it works fine now. But
>> now when i try to access Hbase through spark i get the following error:
>>
>> 2016-08-07 20:43:57,617 WARN  
>> [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl: 
>> Exception encountered while connecting to the server : 
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
>> GSSException: No valid credentials provided (Mechanism level: Failed to find 
>> any Kerberos tgt)]
>> 2016-08-07 20:43:57,619 ERROR 
>> [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl: SASL 
>> authentication failed. The most likely cause is missing or invalid 
>> credentials. Consider 'kinit'.
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
>> GSSException: No valid credentials provided (Mechanism level: Failed to find 
>> any Kerberos tgt)]
>>  at 
>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>  at 
>> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:617)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$700(RpcClientImpl.java:162)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:743)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:740)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:415)
>>  at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:740)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:906)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:873)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1241)
>>  at 
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
>>  at 
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
>>  at 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094)
>>  at 
>> org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
>>  at 
>> org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
>>  at 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
>>  at 
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:360)
>>  at 
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:334)
>>  at 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>>  at 
>> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>> Caused by: GSSException: No valid credentials provided (Mechanism level: 
>> Failed to find any Kerberos tgt)
>>  at 
>> sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
>>  at 
>> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
>>  at 
>> sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
>>  at 
>> sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
>>  at 
>> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
>>  at 
>> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
>>  at 
>> 

Re: silence the spark debug logs

2016-08-07 Thread Sachin Janani
Hi,
You can switch of the logs by setting log level to off  as follows:

import org.apache.log4j.Loggerimport org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF)


Regards,
Sachin J


On Mon, Aug 8, 2016 at 9:39 AM, Sumit Khanna  wrote:

> Hello,
>
> I dont want to print the all spark logs, but say a few only, e.g just the
> executions plans etc etc. How do I silence the spark debug ?
>
> Thanks,
> Sumit
>


silence the spark debug logs

2016-08-07 Thread Sumit Khanna
Hello,

I dont want to print the all spark logs, but say a few only, e.g just the
executions plans etc etc. How do I silence the spark debug ?

Thanks,
Sumit


Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
I tried with condition expression  also but it didn't work :(

On Aug 8, 2016 11:13 AM, "Chanh Le"  wrote:

> You should use *df.where(conditionExpr)* which is more convenient to
> express some simple term in SQL.
>
>
> /**
>  * Filters rows using the given SQL expression.
>  * {{{
>  *   peopleDf.where("age > 15")
>  * }}}
>  * @group dfops
>  * @since 1.5.0
>  */
> def where(conditionExpr: String): DataFrame = {
>   filter(Column(SqlParser.parseExpression(conditionExpr)))
> }
>
>
>
>
>
> On Aug 7, 2016, at 10:58 PM, Mich Talebzadeh 
> wrote:
>
> although the logic should be col1 <> a && col(1) <> b
>
> to exclude both
>
> Like
>
> df.filter('transactiontype > " ").filter(not('transactiontype ==="DEB") &&
> not('transactiontype ==="BGC")).select('transactiontype).distinct.
> collect.foreach(println)
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 7 August 2016 at 16:53, Mich Talebzadeh 
> wrote:
>
>> try similar to this
>>
>> df.filter(not('transactiontype ==="DEB") || not('transactiontype
>> ==="CRE"))
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 7 August 2016 at 15:43, Divya Gehlot  wrote:
>>
>>> Hi,
>>> I have use case where I need to use or[||] operator in filter condition.
>>> It seems its not working its taking the condition before the operator
>>> and ignoring the other filter condition after or operator.
>>> As any body faced similar issue .
>>>
>>> Psuedo code :
>>> df.filter(col("colName").notEqual("no_value") ||
>>> col("colName").notEqual(""))
>>>
>>> Am I missing something.
>>> Would really appreciate the help.
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>
>


Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Chanh Le
You should use df.where(conditionExpr) which is more convenient to express some 
simple term in SQL.
 

/**
 * Filters rows using the given SQL expression.
 * {{{
 *   peopleDf.where("age > 15")
 * }}}
 * @group dfops
 * @since 1.5.0
 */
def where(conditionExpr: String): DataFrame = {
  filter(Column(SqlParser.parseExpression(conditionExpr)))
}




> On Aug 7, 2016, at 10:58 PM, Mich Talebzadeh  
> wrote:
> 
> although the logic should be col1 <> a && col(1) <> b
> 
> to exclude both
> 
> Like
> 
> df.filter('transactiontype > " ").filter(not('transactiontype ==="DEB") && 
> not('transactiontype 
> ==="BGC")).select('transactiontype).distinct.collect.foreach(println)
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 7 August 2016 at 16:53, Mich Talebzadeh  > wrote:
> try similar to this
> 
> df.filter(not('transactiontype ==="DEB") || not('transactiontype ==="CRE"))
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 7 August 2016 at 15:43, Divya Gehlot  > wrote:
> Hi,
> I have use case where I need to use or[||] operator in filter condition.
> It seems its not working its taking the condition before the operator and 
> ignoring the other filter condition after or operator.
> As any body faced similar issue .
> 
> Psuedo code :
> df.filter(col("colName").notEqual("no_value") || col("colName").notEqual(""))
> 
> Am I missing something.
> Would really appreciate the help.
> 
> 
> Thanks,
> Divya 
> 
> 



Re: [SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-07 Thread Muthu Jayakumar
Hello Hao Ren,

Doesn't the code...

val add = udf {
  (a: Int) => a + notSer.value
}
Mean UDF function that Int => Int ?

Thanks,
Muthu

On Sun, Aug 7, 2016 at 2:31 PM, Hao Ren  wrote:

> I am playing with spark 2.0
> What I tried to test is:
>
> Create a UDF in which there is a non serializable object.
> What I expected is when this UDF is called during materializing the
> dataFrame where the UDF is used in "select", an task non serializable
> exception should be thrown.
> It depends also which "action" is called on that dataframe.
>
> Here is the code for reproducing the pb:
>
> 
> object DataFrameSerDeTest extends App {
>
>   class A(val value: Int) // It is not serializable
>
>   def run() = {
> val spark = SparkSession
>   .builder()
>   .appName("DataFrameSerDeTest")
>   .master("local[*]")
>   .getOrCreate()
>
> import org.apache.spark.sql.functions.udf
> import spark.sqlContext.implicits._
>
> val notSer = new A(2)
> val add = udf {
>   (a: Int) => a + notSer.value
> }
> val df = spark.createDataFrame(Seq(
>   (1, 2),
>   (2, 2),
>   (3, 2),
>   (4, 2)
> )).toDF("key", "value")
>   .select($"key", add($"value").as("added"))
>
> df.show() // *It should not work because the udf contains a
> non-serializable object, but it works*
>
> df.filter($"key" === 2).show() // *It does not work as expected
> (org.apache.spark.SparkException: Task not serializable)*
>   }
>
>   run()
> }
> 
>
> Also, I tried collect(), count(), first(), limit(). All of them worked
> without non-serializable exceptions.
> It seems only filter() throws the exception. (feature or bug ?)
>
> Any ideas ? Or I just messed things up ?
> Any help is highly appreciated.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>


Random forest binary classification H20 difference Spark

2016-08-07 Thread Javier Rey
Hi everybody.

I have executed RF on H2O I didn't troubles with nulls values, by in
contrast in Spark using dataframes and ML library I obtain this error,l I
know my dataframe contains nulls, but I understand that Random Forest
supports null values:

"Values to assemble cannot be null"

Any advice, that framework can handle this issue?.

Regards,

Samir


Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Saisai Shao
1. Standalone mode doesn't support accessing kerberized Hadoop, simply
because it lacks the mechanism to distribute delegation tokens via cluster
manager.
2. For the HBase token fetching failure, I think you have to do kinit to
generate tgt before start spark application (
http://hbase.apache.org/0.94/book/security.html).

On Mon, Aug 8, 2016 at 12:05 AM, Aneela Saleem 
wrote:

> Thanks Wojciech and Jacek!
>
> I tried with Spark on Yarn with kerberized cluster it works fine now. But
> now when i try to access Hbase through spark i get the following error:
>
> 2016-08-07 20:43:57,617 WARN  
> [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl: 
> Exception encountered while connecting to the server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> 2016-08-07 20:43:57,619 ERROR 
> [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl: SASL 
> authentication failed. The most likely cause is missing or invalid 
> credentials. Consider 'kinit'.
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>   at 
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
>   at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:617)
>   at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$700(RpcClientImpl.java:162)
>   at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:743)
>   at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:740)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:740)
>   at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:906)
>   at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:873)
>   at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1241)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
>   at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094)
>   at 
> org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
>   at 
> org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
>   at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:360)
>   at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:334)
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>   at 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)
>   at 
> sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
>   at 
> sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
>   at 
> sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
>   at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
>   at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
>   ... 25 more
>
>


Using Kyro for DataFrames (Dataset)?

2016-08-07 Thread Jestin Ma
When using DataFrames (Dataset), there's no option for an Encoder.
Does that mean DataFrames (since it builds on top of an RDD) uses Java
serialization? Does using Kyro make sense as an optimization here?

If not, what's the difference between Java/Kyro serialization, Tungsten,
and Encoders?

Thank you!


[SPARK-2.0][SQL] UDF containing non-serializable object does not work as expected

2016-08-07 Thread Hao Ren
I am playing with spark 2.0
What I tried to test is:

Create a UDF in which there is a non serializable object.
What I expected is when this UDF is called during materializing the
dataFrame where the UDF is used in "select", an task non serializable
exception should be thrown.
It depends also which "action" is called on that dataframe.

Here is the code for reproducing the pb:


object DataFrameSerDeTest extends App {

  class A(val value: Int) // It is not serializable

  def run() = {
val spark = SparkSession
  .builder()
  .appName("DataFrameSerDeTest")
  .master("local[*]")
  .getOrCreate()

import org.apache.spark.sql.functions.udf
import spark.sqlContext.implicits._

val notSer = new A(2)
val add = udf {
  (a: Int) => a + notSer.value
}
val df = spark.createDataFrame(Seq(
  (1, 2),
  (2, 2),
  (3, 2),
  (4, 2)
)).toDF("key", "value")
  .select($"key", add($"value").as("added"))

df.show() // *It should not work because the udf contains a
non-serializable object, but it works*

df.filter($"key" === 2).show() // *It does not work as expected
(org.apache.spark.SparkException: Task not serializable)*
  }

  run()
}


Also, I tried collect(), count(), first(), limit(). All of them worked
without non-serializable exceptions.
It seems only filter() throws the exception. (feature or bug ?)

Any ideas ? Or I just messed things up ?
Any help is highly appreciated.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Aneela Saleem
Thanks Wojciech and Jacek!

I tried with Spark on Yarn with kerberized cluster it works fine now. But
now when i try to access Hbase through spark i get the following error:

2016-08-07 20:43:57,617 WARN
[hconnection-0x24b5fa45-metaLookup-shared--pool2-t1]
ipc.RpcClientImpl: Exception encountered while connecting to the
server : javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)]
2016-08-07 20:43:57,619 ERROR
[hconnection-0x24b5fa45-metaLookup-shared--pool2-t1]
ipc.RpcClientImpl: SASL authentication failed. The most likely cause
is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed
to find any Kerberos tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at 
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:617)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$700(RpcClientImpl.java:162)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:743)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:740)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:740)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:906)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:873)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1241)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094)
at 
org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
at 
org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:360)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:334)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
at 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)
at 
sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at 
sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
at 
sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at 
sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
... 25 more


Accessing HBase through Spark with Security enabled

2016-08-07 Thread Aneela Saleem
Hi all,

I'm trying to run a spark job that accesses HBase with security enabled.
When i run the following command:

*/usr/local/spark-2/bin/spark-submit --keytab /etc/hadoop/conf/spark.keytab
--principal spark/hadoop-master@platalyticsrealm --class
com.platalytics.example.spark.App --master yarn  --driver-class-path
/root/hbase-1.2.2/conf /home/vm6/project-1-jar-with-dependencies.jar*


I get the following error:


2016-08-07 20:43:57,617 WARN
[hconnection-0x24b5fa45-metaLookup-shared--pool2-t1]
ipc.RpcClientImpl: Exception encountered while connecting to the
server : javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)]
2016-08-07 20:43:57,619 ERROR
[hconnection-0x24b5fa45-metaLookup-shared--pool2-t1]
ipc.RpcClientImpl: SASL authentication failed. The most likely cause
is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed
to find any Kerberos tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at 
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:617)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$700(RpcClientImpl.java:162)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:743)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:740)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:740)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:906)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:873)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1241)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094)
at 
org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
at 
org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:360)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:334)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
at 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)
at 
sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at 
sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
at 
sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at 
sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
... 25 more


I have Spark running on Yarn with security enabled. I have kinit'd
from console and have provided necessarry principals and keytabs. Can
you please help me find out the issue?


Thanks


Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Mich Talebzadeh
although the logic should be col1 <> a && col(1) <> b

to exclude both

Like

df.filter('transactiontype > " ").filter(not('transactiontype ==="DEB") &&
not('transactiontype
==="BGC")).select('transactiontype).distinct.collect.foreach(println)

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 August 2016 at 16:53, Mich Talebzadeh 
wrote:

> try similar to this
>
> df.filter(not('transactiontype ==="DEB") || not('transactiontype ==="CRE"))
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 7 August 2016 at 15:43, Divya Gehlot  wrote:
>
>> Hi,
>> I have use case where I need to use or[||] operator in filter condition.
>> It seems its not working its taking the condition before the operator and
>> ignoring the other filter condition after or operator.
>> As any body faced similar issue .
>>
>> Psuedo code :
>> df.filter(col("colName").notEqual("no_value") ||
>> col("colName").notEqual(""))
>>
>> Am I missing something.
>> Would really appreciate the help.
>>
>>
>> Thanks,
>> Divya
>>
>
>


Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Mich Talebzadeh
try similar to this

df.filter(not('transactiontype ==="DEB") || not('transactiontype ==="CRE"))

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 August 2016 at 15:43, Divya Gehlot  wrote:

> Hi,
> I have use case where I need to use or[||] operator in filter condition.
> It seems its not working its taking the condition before the operator and
> ignoring the other filter condition after or operator.
> As any body faced similar issue .
>
> Psuedo code :
> df.filter(col("colName").notEqual("no_value") ||
> col("colName").notEqual(""))
>
> Am I missing something.
> Would really appreciate the help.
>
>
> Thanks,
> Divya
>


Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread janardhan shetty
Can you try 'or' keyword instead?
On Aug 7, 2016 7:43 AM, "Divya Gehlot"  wrote:

> Hi,
> I have use case where I need to use or[||] operator in filter condition.
> It seems its not working its taking the condition before the operator and
> ignoring the other filter condition after or operator.
> As any body faced similar issue .
>
> Psuedo code :
> df.filter(col("colName").notEqual("no_value") ||
> col("colName").notEqual(""))
>
> Am I missing something.
> Would really appreciate the help.
>
>
> Thanks,
> Divya
>


Sorting a DStream and taking topN

2016-08-07 Thread Ahmed El-Gamal
I have some DStream in Spark Scala and I want to sort it then take the top
N. The problem is that whenever I try to run it I get
NotSerializableException and the exception message says:

This is because the DStream object is being referred to from within the
closure.

The problem is that I don't know how to solve it.

My try is attached with the e-mail.

I don't mind any other ways to sort a DStream and get its top N rather than
my way.
package com.badrit.realtime

import java.util.Date

import com.badrit.drivers.UnlimitedSpaceTimeDriver
import com.badrit.model.{CellBuilder, DataReader, Trip}
import com.badrit.utility.Printer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Duration, Milliseconds, StreamingContext}

import scala.collection.mutable

object StreamingDriver {
val appName: String = "HotSpotRealTime"
val hostName = "localhost"
val port = 5050
val constrains = UnlimitedSpaceTimeDriver.constrains;
var streamingRate = 1;
var windowSize = 8;
var slidingInterval = 2;
val cellBuilder = new CellBuilder(constrains)
val inputFilePath = "/home/ahmedelgamal/Downloads/green_tripdata_2015-02.csv"

def prepareTestData(sparkStreamCtx: StreamingContext): InputDStream[Trip] = {

val sparkCtx = sparkStreamCtx.sparkContext
val textFile: RDD[String] = sparkCtx.textFile(inputFilePath)
val data: RDD[Trip] = new DataReader().getTrips(textFile)
val groupedData = data.filter(_.pickup.date.before(new Date(2015, 1, 2, 0, 0, 0)))
  .groupBy(trip => trip.pickup.date.getMinutes).sortBy(_._1).map(_._2).collect()

printf("Grouped Data Count is " + groupedData.length)
var dataQueue: mutable.Queue[RDD[Trip]] = mutable.Queue.empty;

groupedData.foreach(trips => dataQueue += sparkCtx.makeRDD(trips.toArray))
printf("\n\nTest Queue size is " + dataQueue.size)


groupedData.zipWithIndex.foreach { case (trips: Iterable[Trip], index: Int) => {
println("Items List " + index)


val passengers: Array[Int] = trips.map(_.passengers).toArray
val cnt = passengers.length
println("Sum is " + passengers.sum)
println("Cnt is " + cnt)

val passengersRdd = sparkCtx.parallelize(passengers)
println("Mean " + passengersRdd.mean())
println("Stdv" + passengersRdd.stdev())

}
}
sparkStreamCtx.queueStream(dataQueue, true)
}


def cellCreator(trip: Trip) = cellBuilder.cellForCarStop(trip.pickup)

def main(args: Array[String]) {
if (args.length < 1) {
streamingRate = 1;
windowSize = 3 //2 hours 60 * 60 * 1000L
slidingInterval = 2 //0.5 hour 60 * 60 * 1000L
}
else {
streamingRate = args(0).toInt;
windowSize = args(1).toInt
slidingInterval = args(2).toInt
}

val sparkConf = new SparkConf().setAppName(appName).setMaster("local[*]")
val sparkStreamCtx = new StreamingContext(sparkConf, Milliseconds(streamingRate))
sparkStreamCtx.sparkContext.setLogLevel("ERROR")
sparkStreamCtx.checkpoint("/tmp")

val data: InputDStream[Trip] = prepareTestData(sparkStreamCtx)
val dataWindow = data.window(new Duration(windowSize), new Duration(slidingInterval))

//my main problem lies in the following line
val newDataWindow = dataWindow.transform(rdd => sparkStreamCtx.sparkContext.parallelize(rdd.take(10)))
newDataWindow.print

sparkStreamCtx.start()
sparkStreamCtx.awaitTerminationOrTimeout(1000)

}
}

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Sorting a DStream and taking topN

2016-08-07 Thread Ahmed El-Gamal
I have some DStream in Spark Scala and I want to sort it then take the top
N. The problem is that whenever I try to run it I get
NotSerializableException and the exception message says:

This is because the DStream object is being referred to from within the
closure.

The problem is that I don't know how to solve it.

My try is attached with the e-mail.

I don't mind any other ways to sort a DStream and get its top N rather than
my way.
package com.badrit.realtime

import java.util.Date

import com.badrit.drivers.UnlimitedSpaceTimeDriver
import com.badrit.model.{CellBuilder, DataReader, Trip}
import com.badrit.utility.Printer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Duration, Milliseconds, StreamingContext}

import scala.collection.mutable

object StreamingDriver {
val appName: String = "HotSpotRealTime"
val hostName = "localhost"
val port = 5050
val constrains = UnlimitedSpaceTimeDriver.constrains;
var streamingRate = 1;
var windowSize = 8;
var slidingInterval = 2;
val cellBuilder = new CellBuilder(constrains)
val inputFilePath = "/home/ahmedelgamal/Downloads/green_tripdata_2015-02.csv"

def prepareTestData(sparkStreamCtx: StreamingContext): InputDStream[Trip] = {

val sparkCtx = sparkStreamCtx.sparkContext
val textFile: RDD[String] = sparkCtx.textFile(inputFilePath)
val data: RDD[Trip] = new DataReader().getTrips(textFile)
val groupedData = data.filter(_.pickup.date.before(new Date(2015, 1, 2, 0, 0, 0)))
  .groupBy(trip => trip.pickup.date.getMinutes).sortBy(_._1).map(_._2).collect()

printf("Grouped Data Count is " + groupedData.length)
var dataQueue: mutable.Queue[RDD[Trip]] = mutable.Queue.empty;

groupedData.foreach(trips => dataQueue += sparkCtx.makeRDD(trips.toArray))
printf("\n\nTest Queue size is " + dataQueue.size)


groupedData.zipWithIndex.foreach { case (trips: Iterable[Trip], index: Int) => {
println("Items List " + index)


val passengers: Array[Int] = trips.map(_.passengers).toArray
val cnt = passengers.length
println("Sum is " + passengers.sum)
println("Cnt is " + cnt)

val passengersRdd = sparkCtx.parallelize(passengers)
println("Mean " + passengersRdd.mean())
println("Stdv" + passengersRdd.stdev())

}
}
sparkStreamCtx.queueStream(dataQueue, true)
}


def cellCreator(trip: Trip) = cellBuilder.cellForCarStop(trip.pickup)

def main(args: Array[String]) {
if (args.length < 1) {
streamingRate = 1;
windowSize = 3 //2 hours 60 * 60 * 1000L
slidingInterval = 2 //0.5 hour 60 * 60 * 1000L
}
else {
streamingRate = args(0).toInt;
windowSize = args(1).toInt
slidingInterval = args(2).toInt
}

val sparkConf = new SparkConf().setAppName(appName).setMaster("local[*]")
val sparkStreamCtx = new StreamingContext(sparkConf, Milliseconds(streamingRate))
sparkStreamCtx.sparkContext.setLogLevel("ERROR")
sparkStreamCtx.checkpoint("/tmp")

val data: InputDStream[Trip] = prepareTestData(sparkStreamCtx)
val dataWindow = data.window(new Duration(windowSize), new Duration(slidingInterval))

//my main problem lies in the following line
val newDataWindow = dataWindow.transform(rdd => sparkStreamCtx.sparkContext.parallelize(rdd.take(10)))
newDataWindow.print

sparkStreamCtx.start()
sparkStreamCtx.awaitTerminationOrTimeout(1000)

}
}

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
Hi,
I have use case where I need to use or[||] operator in filter condition.
It seems its not working its taking the condition before the operator and
ignoring the other filter condition after or operator.
As any body faced similar issue .

Psuedo code :
df.filter(col("colName").notEqual("no_value") ||
col("colName").notEqual(""))

Am I missing something.
Would really appreciate the help.


Thanks,
Divya


Re: Help testing the Spark Extensions for the Apache Bahir 2.0.0 release

2016-08-07 Thread Luciano Resende
Simple, just help us test the available extensions using Spark 2.0.0...
preferable in real workloads that you might be using in your day to day
usage of Spark.

I wrote a quick getting started for using the new MQTT Structured Streaming
on my blog, which can serve as an example:

http://lresende.blogspot.gr/2016/08/getting-started-with-apache-bahir-mqtt_4.html


On Sun, Aug 7, 2016 at 11:24 AM, Sivakumaran S  wrote:

> Hi,
>
> How can I help?
>
> regards,
>
> Sivakumaran S
>
> On 06-Aug-2016, at 6:18 PM, Luciano Resende  wrote:
>
> Apache Bahir is voting it's 2.0.0 release based on Apache Spark 2.0.0.
>
> https://www.mail-archive.com/dev@bahir.apache.org/msg00312.html
>
> We appreciate any help reviewing/testing the release, which contains the
> following Apache Spark extensions:
>
> Akka DStream connector
> MQTT DStream connector
> Twitter DStream connector
> ZeroMQ DStream connector
>
> MQTT Structured Streaming
>
> Thanks in advance
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Help testing the Spark Extensions for the Apache Bahir 2.0.0 release

2016-08-07 Thread Sivakumaran S
Hi,

How can I help? 

regards,

Sivakumaran S
> On 06-Aug-2016, at 6:18 PM, Luciano Resende  wrote:
> 
> Apache Bahir is voting it's 2.0.0 release based on Apache Spark 2.0.0. 
> 
> https://www.mail-archive.com/dev@bahir.apache.org/msg00312.html 
> 
> 
> We appreciate any help reviewing/testing the release, which contains the 
> following Apache Spark extensions:
> 
> Akka DStream connector
> MQTT DStream connector
> Twitter DStream connector
> ZeroMQ DStream connector
> 
> MQTT Structured Streaming
> 
> Thanks in advance
> 
> -- 
> Luciano Resende
> http://twitter.com/lresende1975 
> http://lresende.blogspot.com/