How to PushDown ParquetFilter Spark 2.0.1 dataframe

2017-03-30 Thread Rahul Nandi
Hi,
I have around 2 million data as parquet file in s3. The file structure is
somewhat like
id data
1 abc
2 cdf
3 fas
Now I want to filter and take the records where the id matches with my
required Id.

val requiredDataId = Array(1,2) //Might go upto 100s of records.

df.filter(requiredDataId.contains("id"))

This is my use case.

What will be best way to do this in spark 2.0.1 where I can also pushDown
the filter to parquet?



Thanks and Regards,
Rahul


Parquet Filter PushDown

2017-03-30 Thread Rahul Nandi
Hi,
I have around 2 million data as parquet file in s3. The file structure is
somewhat like
id data
1 abc
2 cdf
3 fas
Now I want to filter and take the records where the id matches with my
required Id.

val requiredDataId = Array(1,2) //Might go upto 100s of records.

df.filter(requiredDataId.contains("id"))

This is my use case.

What will be best way to do this in spark 2.0.1 where I can also pushDown
the filter to parquet?



Thanks and Regards,
Rahul


Predicate not getting pusdhown to PrunedFilterScan

2017-03-30 Thread Hanumath Rao Maduri
Hello All,

I am working on creating a new PrunedFilteredScan operator which has the
ability to execute the predicates pushed to this operator.

However What I observed is that if column with deep in the hierarchy is
used then it is not getting pushed down.

SELECT tom._id, tom.address.city from tom where tom.address.city = "Peter"

Here predicate tom.address.city = "Peter" is not getting pushed down.

However if the first level column name is used then it is getting pushed
down.

SELECT tom._id, tom.address.city from tom where tom.first_name = "Peter"


Please let me know what is the issue in this case.

Thanks,


Re: dataframe filter, unable to bind variable

2017-03-30 Thread hosur narahari
Try lit(fromDate) and lit(toDate). You've to import
org.apache.spark.sql.functions.lit to use it

On 31 Mar 2017 7:45 a.m., "shyla deshpande" 
wrote:

The following works

df.filter($"createdate".between("2017-03-20", "2017-03-22"))


I would like to pass variables fromdate and todate to the filter

 instead of constants. Unable to get the syntax right. Please help.


Thanks


Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Also, is it possible to cache logical plan and parsed query so that in
subsequent executions it can be reused. It would improve overall query
performance particularly in streaming jobs
On Thu, Mar 30, 2017 at 10:06 PM Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Hi Ayan,
>
> I have searched Spark configuration options but couldn't find one to pin
> execution plans in memory. Can you please help?
>
>
> Thanks
>
> Sathish
>
> On Thu, Mar 30, 2017 at 9:30 PM ayan guha  wrote:
>
> I think there is an option of pinning execution plans in memory to avoid
> such scenarios
>
> On Fri, Mar 31, 2017 at 1:25 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
> Hi Everyone,
>
> I have complex SQL with approx 2000 lines of code and works with 50+
> tables with 50+ left joins and transformations. All the tables are fully
> cached in Memory with sufficient storage memory and working memory. The
> issue is after the launch of the query for the execution; the query takes
> approximately 40 seconds to appear in the Jobs/SQL in the application UI.
>
> While the execution takes only 25 seconds; the execution is delayed by 40
> seconds by the scheduler so the total runtime of the query becomes 65
> seconds(40s + 25s). Also, there are enough cores available during this wait
> time. I couldn't figure out why DAG scheduler is delaying the execution by
> 40 seconds. Is this due to time taken for Query Parsing and Query Planning
> for the Complex SQL? If thats the case; how do we optimize this Query
> Parsing and Query Planning time in Spark? Any help would be helpful.
>
>
> Thanks
>
> Sathish
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>


Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Hi Ayan,

I have searched Spark configuration options but couldn't find one to pin
execution plans in memory. Can you please help?


Thanks

Sathish

On Thu, Mar 30, 2017 at 9:30 PM ayan guha  wrote:

> I think there is an option of pinning execution plans in memory to avoid
> such scenarios
>
> On Fri, Mar 31, 2017 at 1:25 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
> Hi Everyone,
>
> I have complex SQL with approx 2000 lines of code and works with 50+
> tables with 50+ left joins and transformations. All the tables are fully
> cached in Memory with sufficient storage memory and working memory. The
> issue is after the launch of the query for the execution; the query takes
> approximately 40 seconds to appear in the Jobs/SQL in the application UI.
>
> While the execution takes only 25 seconds; the execution is delayed by 40
> seconds by the scheduler so the total runtime of the query becomes 65
> seconds(40s + 25s). Also, there are enough cores available during this wait
> time. I couldn't figure out why DAG scheduler is delaying the execution by
> 40 seconds. Is this due to time taken for Query Parsing and Query Planning
> for the Complex SQL? If thats the case; how do we optimize this Query
> Parsing and Query Planning time in Spark? Any help would be helpful.
>
>
> Thanks
>
> Sathish
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread ayan guha
I think there is an option of pinning execution plans in memory to avoid
such scenarios

On Fri, Mar 31, 2017 at 1:25 PM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Hi Everyone,
>
> I have complex SQL with approx 2000 lines of code and works with 50+
> tables with 50+ left joins and transformations. All the tables are fully
> cached in Memory with sufficient storage memory and working memory. The
> issue is after the launch of the query for the execution; the query takes
> approximately 40 seconds to appear in the Jobs/SQL in the application UI.
>
> While the execution takes only 25 seconds; the execution is delayed by 40
> seconds by the scheduler so the total runtime of the query becomes 65
> seconds(40s + 25s). Also, there are enough cores available during this wait
> time. I couldn't figure out why DAG scheduler is delaying the execution by
> 40 seconds. Is this due to time taken for Query Parsing and Query Planning
> for the Complex SQL? If thats the case; how do we optimize this Query
> Parsing and Query Planning time in Spark? Any help would be helpful.
>
>
> Thanks
>
> Sathish
>



-- 
Best Regards,
Ayan Guha


Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Hi Everyone,

I have complex SQL with approx 2000 lines of code and works with 50+ tables
with 50+ left joins and transformations. All the tables are fully cached in
Memory with sufficient storage memory and working memory. The issue is
after the launch of the query for the execution; the query takes
approximately 40 seconds to appear in the Jobs/SQL in the application UI.

While the execution takes only 25 seconds; the execution is delayed by 40
seconds by the scheduler so the total runtime of the query becomes 65
seconds(40s + 25s). Also, there are enough cores available during this wait
time. I couldn't figure out why DAG scheduler is delaying the execution by
40 seconds. Is this due to time taken for Query Parsing and Query Planning
for the Complex SQL? If thats the case; how do we optimize this Query
Parsing and Query Planning time in Spark? Any help would be helpful.


Thanks

Sathish


dataframe filter, unable to bind variable

2017-03-30 Thread shyla deshpande
The following works

df.filter($"createdate".between("2017-03-20", "2017-03-22"))


I would like to pass variables fromdate and todate to the filter

 instead of constants. Unable to get the syntax right. Please help.


Thanks


Re: Will the setting for spark.default.parallelism be used for spark.sql.shuffle.output.partitions?

2017-03-30 Thread shyla deshpande
The spark version I am using is spark 2.1.

On Thu, Mar 30, 2017 at 9:58 AM, shyla deshpande 
wrote:

> Thanks
>


Looking at EMR Logs

2017-03-30 Thread Paul Tremblay
I am looking for tips on evaluating my Spark job after it has run.

I know that right now I can look at the history of jobs through the web ui.
I also know how to look at the current resources being used by a similar
web ui.

However, I would like to look at the logs after the job is finished to
evaluate such things as how many tasks were completed, how many executors
were used, etc. I currently save my logs to S3.

Thanks!

Henry

-- 
Paul Henry Tremblay
Robert Half Technology


Predicate not getting pusdhown to PrunedFilterScan

2017-03-30 Thread Hanumath Rao Maduri
Hello All,

I am working on creating a new PrunedFilteredScan operator which has the
ability to execute the predicates pushed to this operator.

However What I observed is that if column with deep in the hierarchy is
used then it is not getting pushed down.

SELECT tom._id, tom.address.city from tom where tom.address.city = "Peter"

Here predicate tom.address.city = "Peter" is not getting pushed down.

However if the first level column name is used then it is getting pushed
down.

SELECT tom._id, tom.address.city from tom where tom.first_name = "Peter"


Please let me know what is the issue in this case.

Thanks,


spark 2 and kafka consumer with ssl/kerberos

2017-03-30 Thread bilsch
Ok, forgive me if this ends up being a duplicate posting I've emailed it
twice and it never shows up!

---

'm working on a poc spark job to pull data from a kafka topic with kerberos
enabled ( required ) brokers.

The code seems to connect to kafka and enter a polling mode. When I toss
something onto the topic I get an exception which I just can't seem to
figure out. Any ideas?

I have a full gist up at
https://gist.github.com/bilsch/17f4a4c4303ed3e004e2234a5904f0de with a lot
of details. If I use the hdfs/spark client code for just normal operations
everything works fine but for some reason the streaming code is having
issues. I have verified the KafkaClient object is in the jaas config. The
keytab is good etc.

Guessing I'm doing something wrong I just have not figured out what yet! Any
thoughts?

The exception:

17/03/30 12:54:00 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
host5.some.org.net): org.apache.kafka.common.KafkaException: Failed to
construct kafka consumer
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:702)
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:557)
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:540)
at
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.(CachedKafkaConsumer.scala:47)
at
org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get(CachedKafkaConsumer.scala:157)
at
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:210)
at
org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:185)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.KafkaException:
org.apache.kafka.common.KafkaException: Jaas configuration not found
at
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:86)
at
org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:70)
at
org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:83)
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:623)
... 14 more
Caused by: org.apache.kafka.common.KafkaException: Jaas configuration not
found
at
org.apache.kafka.common.security.kerberos.KerberosLogin.getServiceName(KerberosLogin.java:299)
at
org.apache.kafka.common.security.kerberos.KerberosLogin.configure(KerberosLogin.java:103)
at
org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:45)
at
org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:68)
at
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:78)
... 17 more
Caused by: java.io.IOException: Could not find a 'KafkaClient' entry in this
configuration.
at
org.apache.kafka.common.security.JaasUtils.jaasConfig(JaasUtils.java:50)
at
org.apache.kafka.common.security.kerberos.KerberosLogin.getServiceName(KerberosLogin.java:297)
... 21 more



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-2-and-kafka-consumer-with-ssl-kerberos-tp28551.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark 2 and kafka consumer with ssl/kerberos

2017-03-30 Thread Bill Schwanitz
I'm working on a poc spark job to pull data from a kafka topic with
kerberos enabled ( required ) brokers.

The code seems to connect to kafka and enter a polling mode. When I toss
something onto the topic I get an exception which I just can't seem to
figure out. Any ideas?

I have a full gist up at https://gist.github.com/bilsch/
17f4a4c4303ed3e004e2234a5904f0de with a lot of details. If I use the
hdfs/spark client code for just normal operations everything works fine but
for some reason the streaming code is having issues. I have verified the
KafkaClient object is in the jaas config. The keytab is good etc.

Guessing I'm doing something wrong I just have not figured out what yet!
Any thoughts?

The exception:

17/03/30 12:54:00 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
host5.some.org.net): org.apache.kafka.common.KafkaException: Failed to
construct kafka consumer
at org.apache.kafka.clients.consumer.KafkaConsumer.(
KafkaConsumer.java:702)
at org.apache.kafka.clients.consumer.KafkaConsumer.(
KafkaConsumer.java:557)
at org.apache.kafka.clients.consumer.KafkaConsumer.(
KafkaConsumer.java:540)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.<
init>(CachedKafkaConsumer.scala:47)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.
get(CachedKafkaConsumer.scala:157)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(
KafkaRDD.scala:210)
at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:185)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.KafkaException:
org.apache.kafka.common.KafkaException:
Jaas configuration not found
at org.apache.kafka.common.network.SaslChannelBuilder.
configure(SaslChannelBuilder.java:86)
at org.apache.kafka.common.network.ChannelBuilders.
create(ChannelBuilders.java:70)
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(
ClientUtils.java:83)
at org.apache.kafka.clients.consumer.KafkaConsumer.(
KafkaConsumer.java:623)
... 14 more
Caused by: org.apache.kafka.common.KafkaException: Jaas configuration not
found
at org.apache.kafka.common.security.kerberos.KerberosLogin.getServiceName(
KerberosLogin.java:299)
at org.apache.kafka.common.security.kerberos.KerberosLogin.configure(
KerberosLogin.java:103)
at org.apache.kafka.common.security.authenticator.LoginManager.(
LoginManager.java:45)
at org.apache.kafka.common.security.authenticator.LoginManager.
acquireLoginManager(LoginManager.java:68)
at org.apache.kafka.common.network.SaslChannelBuilder.
configure(SaslChannelBuilder.java:78)
... 17 more
Caused by: java.io.IOException: Could not find a 'KafkaClient' entry in
this configuration.
at org.apache.kafka.common.security.JaasUtils.jaasConfig(JaasUtils.java:50)
at org.apache.kafka.common.security.kerberos.KerberosLogin.getServiceName(
KerberosLogin.java:297)
... 21 more


Re: Spark streaming + kafka error with json library

2017-03-30 Thread Srikanth
Thanks for the tip. That worked. When would one use the assembly?

On Wed, Mar 29, 2017 at 7:13 PM, Tathagata Das 
wrote:

> Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly)
>
> On Wed, Mar 29, 2017 at 9:59 AM, Srikanth  wrote:
>
>> Hello,
>>
>> I'm trying to use "org.json4s" % "json4s-native" library in a spark
>> streaming + kafka direct app.
>> When I use the latest version of the lib I get an error similar to this
>> 
>> The work around suggest there is to use version 3.2.10. As spark has a
>> hard dependency on this version.
>>
>> I forced this version in SBT with
>> dependencyOverrides += "org.json4s" %% "json4s-native" % "3.2.10"
>>
>> But now it seems to have some conflict with spark-streaming-kafka-0-10-ass
>> embly
>>
>> [error] (*:assembly) deduplicate: different file contents found in the
>> following:
>>
>> [error] C:\Users\stati\.ivy2\cache\org.apache.spark\spark-streaming-
>> kafka-0-10-assembly_2.11\jars\spark-streaming-kafka-0-10-
>> assembly_2.11-2.1.0.jar:scala/util/parsing/combinator/Implic
>> itConversions$$anonfun$flatten2$1.class
>>
>> [error] C:\Users\stati\.ivy2\cache\org.scala-lang.modules\scala-pars
>> er-combinators_2.11\bundles\scala-parser-combinators_2.11-
>> 1.0.4.jar:scala/util/parsing/combinator/ImplicitConversions
>> $$anonfun$flatten2$1.class
>>
>> DependencyTree didn't show spark-streaming-kafka-0-10-assembly pulling
>> json4s-native.
>> Any idea how to resolve this? I'm using spark version 2.1.0
>>
>> Thanks,
>> Srikanth
>>
>
>


Re: httpclient conflict in spark

2017-03-30 Thread Arvind Kandaswamy
Hi Steve,

I was indeed using spark 2.1. I was getting this error while calling spark
via Zeppelin. Zeppelin comes with older version of httpclient apparently. I
copied the httpclient 4.5.2 and httpclient 4.2.2 into
zeppelin/interpreter/spark and this problem went away.


Thank you for your help.



On Thu, Mar 30, 2017 at 9:55 AM, Steve Loughran 
wrote:

>
> On 29 Mar 2017, at 14:42, Arvind Kandaswamy 
> wrote:
>
> Hello,
>
> I am getting the following error. I get this error when trying to use AWS
> S3. This appears to be a conflict with httpclient. AWS S3 comes with
> httplient-4.5.2.jar. I am not sure how to force spark to use this version.
> I have tried spark.driver.userClassPathFirst = true, spark.executor.
> userClassPathFirst=true. Did not help. I am using Zeppelin to call the
> spark engine in case if that is an issue.
>
>
> Spark 2.x ships with httpclient 4.5.2; there is no conflict there. if you
> are on 1.6, you could actually try bumping the httplient and httpcomponents
>  to be consistent
>
> 
> 4.5.2
> 4.4.4
>
> its important to have org.apache.httpcomponents / httpcore in sync with
> httpclient; it's probably there where your problem is arising
>
> Is there anything else that I can try?
>
> java.lang.NoSuchMethodError: org.apache.http.conn.ssl.
> SSLConnectionSocketFactory.(Ljavax/net/ssl/
> SSLContext;Ljavax/net/ssl/HostnameVerifier;)V
> at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.(
> SdkTLSSocketFactory.java:56)
> at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory
> .getPreferredSocketFactory(ApacheConnectionManagerFactory.java:92)
> at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory
> .create(ApacheConnectionManagerFactory.java:65)
> at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory
> .create(ApacheConnectionManagerFactory.java:58)
> at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(
> ApacheHttpClientFactory.java:51)
> at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(
> ApacheHttpClientFactory.java:39)
> at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:314)
> at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:298)
> at com.amazonaws.AmazonWebServiceClient.(
> AmazonWebServiceClient.java:165)
> at com.amazonaws.services.s3.AmazonS3Client.(
> AmazonS3Client.java:583)
> at com.amazonaws.services.s3.AmazonS3Client.(
> AmazonS3Client.java:563)
> at com.amazonaws.services.s3.AmazonS3Client.(
> AmazonS3Client.java:541)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
> S3AFileSystem.java:235)
>
>
>


spark kafka consumer with kerberos

2017-03-30 Thread Bill Schwanitz
I'm working on a poc spark job to pull data from a kafka topic with
kerberos enabled ( required ) brokers.

The code seems to connect to kafka and enter a polling mode. When I toss
something onto the topic I get an exception which I just can't seem to
figure out. Any ideas?

I have a full gist up at
https://gist.github.com/bilsch/17f4a4c4303ed3e004e2234a5904f0de with a lot
of details. If I use the hdfs/spark client code for just normal operations
everything works fine but for some reason the streaming code is having
issues. I have verified the KafkaClient object is in the jaas config. The
keytab is good etc.

Guessing I'm doing something wrong I just have not figured out what yet!
Any thoughts?

The exception:

17/03/30 12:54:00 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
host5.some.org.net): org.apache.kafka.common.KafkaException: Failed to
construct kafka consumer
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:702)
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:557)
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:540)
at
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.(CachedKafkaConsumer.scala:47)
at
org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get(CachedKafkaConsumer.scala:157)
at
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:210)
at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:185)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.KafkaException:
org.apache.kafka.common.KafkaException: Jaas configuration not found
at
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:86)
at
org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:70)
at
org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:83)
at
org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:623)
... 14 more
Caused by: org.apache.kafka.common.KafkaException: Jaas configuration not
found
at
org.apache.kafka.common.security.kerberos.KerberosLogin.getServiceName(KerberosLogin.java:299)
at
org.apache.kafka.common.security.kerberos.KerberosLogin.configure(KerberosLogin.java:103)
at
org.apache.kafka.common.security.authenticator.LoginManager.(LoginManager.java:45)
at
org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:68)
at
org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:78)
... 17 more
Caused by: java.io.IOException: Could not find a 'KafkaClient' entry in
this configuration.
at org.apache.kafka.common.security.JaasUtils.jaasConfig(JaasUtils.java:50)
at
org.apache.kafka.common.security.kerberos.KerberosLogin.getServiceName(KerberosLogin.java:297)
... 21 more


Re: Why VectorUDT private?

2017-03-30 Thread Koert Kuipers
sorry meant to say:
we know when we upgrade that we might run into minor inconveniences that
are completely our own doing/fault.

also, with yarn it has become really easy to run against an exact spark
version of our choosing, since there is no longer such a thing as a
centrally managed spark distro on the cluster.

On Thu, Mar 30, 2017 at 1:34 PM, Koert Kuipers  wrote:

> i agree with that.
>
> we work within that assumption. we compile and run against a single exact
> spark version. we know when we upgrade that we might run into minor
> inconveniences that our completely our own doing/fault. the trade off has
> been totally worth it to me.
>
>
> On Thu, Mar 30, 2017 at 1:20 PM, Michael Armbrust 
> wrote:
>
>> I think really the right way to think about things that are marked
>> private is, "this may disappear or change in a future minor release".  If
>> you are okay with that, working about the visibility restrictions is
>> reasonable.
>>
>> On Thu, Mar 30, 2017 at 5:52 AM, Koert Kuipers  wrote:
>>
>>> I stopped asking long time ago why things are private in spark... I
>>> mean... The conversion between ml and mllib vectors is private... the
>>> conversion between spark vector and breeze used to be (or still is?)
>>> private. it just goes on. Lots of useful stuff is private[SQL].
>>>
>>> Luckily there are simple ways to get around these visibility restrictions
>>>
>>> On Mar 29, 2017 22:57, "Ryan"  wrote:
>>>
 I'm writing a transformer and the input column is vector type(which is
 the output column from other transformer). But as the VectorUDT is private,
 how could I check/transform schema for the vector column?

>>>
>>
>


Re: Why VectorUDT private?

2017-03-30 Thread Koert Kuipers
i agree with that.

we work within that assumption. we compile and run against a single exact
spark version. we know when we upgrade that we might run into minor
inconveniences that our completely our own doing/fault. the trade off has
been totally worth it to me.


On Thu, Mar 30, 2017 at 1:20 PM, Michael Armbrust 
wrote:

> I think really the right way to think about things that are marked private
> is, "this may disappear or change in a future minor release".  If you are
> okay with that, working about the visibility restrictions is reasonable.
>
> On Thu, Mar 30, 2017 at 5:52 AM, Koert Kuipers  wrote:
>
>> I stopped asking long time ago why things are private in spark... I
>> mean... The conversion between ml and mllib vectors is private... the
>> conversion between spark vector and breeze used to be (or still is?)
>> private. it just goes on. Lots of useful stuff is private[SQL].
>>
>> Luckily there are simple ways to get around these visibility restrictions
>>
>> On Mar 29, 2017 22:57, "Ryan"  wrote:
>>
>>> I'm writing a transformer and the input column is vector type(which is
>>> the output column from other transformer). But as the VectorUDT is private,
>>> how could I check/transform schema for the vector column?
>>>
>>
>


Re: Why VectorUDT private?

2017-03-30 Thread Michael Armbrust
I think really the right way to think about things that are marked private
is, "this may disappear or change in a future minor release".  If you are
okay with that, working about the visibility restrictions is reasonable.

On Thu, Mar 30, 2017 at 5:52 AM, Koert Kuipers  wrote:

> I stopped asking long time ago why things are private in spark... I
> mean... The conversion between ml and mllib vectors is private... the
> conversion between spark vector and breeze used to be (or still is?)
> private. it just goes on. Lots of useful stuff is private[SQL].
>
> Luckily there are simple ways to get around these visibility restrictions
>
> On Mar 29, 2017 22:57, "Ryan"  wrote:
>
>> I'm writing a transformer and the input column is vector type(which is
>> the output column from other transformer). But as the VectorUDT is private,
>> how could I check/transform schema for the vector column?
>>
>


Will the setting for spark.default.parallelism be used for spark.sql.shuffle.output.partitions?

2017-03-30 Thread shyla deshpande
Thanks


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Pierce Lamb
SnappyData should work well for what you want, it deeply integrates an
in-memory database with Spark which supports ingesting streaming data and
concurrently querying it from a dashboard. SnappyData currently has an
integration with Apache Zeppelin (notebook visualization) and soon it will
have one with Tableau. Finally, the database and Spark executors share the
same JVM in SnappyData, so you will get better performance than any
database that works with Spark over a connector.

Repo: https://github.com/SnappyDataInc/snappydata

Using it with Zeppelin:
https://github.com/SnappyDataInc/snappydata/blob/master/docs/aqp_aws.md#using-apache-zeppelin

Hope this helps

On Thu, Mar 30, 2017 at 7:56 AM, Alonso Isidoro Roman 
wrote:

> you can check if you want this link
> 
>
>
> elastic, kibana and spark working together.
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2017-03-30 16:25 GMT+02:00 Miel Hostens :
>
>> We're doing exactly same thing over here!
>>
>> Spark + ELK
>>
>> Met vriendelijk groeten,
>>
>> *Miel Hostens*,
>>
>> *B**uitenpraktijk+*
>> Department of Reproduction, Obstetrics and Herd Health
>> 
>>
>> Ambulatory Clinic
>> Faculty of Veterinary Medicine 
>> Ghent University  
>> Salisburylaan 133
>> 9820 Merelbeke
>> Belgium
>>
>>
>>
>> Tel:   +32 9 264 75 28 <+32%209%20264%2075%2028>
>>
>> Fax:  +32 9 264 77 97 <+32%209%20264%2077%2097>
>>
>> Mob: +32 478 59 37 03 <+32%20478%2059%2037%2003>
>>
>>  e-mail: miel.host...@ugent.be
>>
>> On 30 March 2017 at 15:01, Szuromi Tamás  wrote:
>>
>>> For us, after some Spark Streaming transformation, Elasticsearch +
>>> Kibana is a great combination to store and visualize data.
>>> An alternative solution that we use is Spark Streaming put some data
>>> back to Kafka and we consume it with nodejs.
>>>
>>> Cheers,
>>> Tamas
>>>
>>> 2017-03-30 9:25 GMT+02:00 Alonso Isidoro Roman :
>>>
 Read this first:

 http://www.oreilly.com/data/free/big-data-analytics-emerging
 -architecture.csp

 https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf

 http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf

 http://www.gartner.com/smarterwithgartner/six-best-practices
 -for-real-time-analytics/

 https://speakerdeck.com/elasticsearch/using-elasticsearch-lo
 gstash-and-kibana-to-create-realtime-dashboards

 https://www.youtube.com/watch?v=PuvHINcU9DI

 then take a look to

 https://kudu.apache.org/

 Tell us later what you think.




 Alonso Isidoro Roman
 [image: https://]about.me/alonso.isidoro.roman

 

 2017-03-30 7:14 GMT+02:00 Gaurav Pandya :

> Hi Noorul,
>
> Thanks for the reply.
> But then how to build the dashboard report? Don't we need to store
> data anywhere?
> Please suggest.
>
> Thanks.
> Gaurav
>
> On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
> noo...@noorul.com> wrote:
>
>> I think better place would be a in memory cache for real time.
>>
>> Regards,
>> Noorul
>>
>> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 
>> wrote:
>> > I am getting streaming data and want to show them onto dashboards
>> in real
>> > time?
>> > May I know how best we can handle these streaming data? where to
>> store? (DB
>> > or HDFS or ???)
>> > I want to give users a real time analytics experience.
>> >
>> > Please suggest possible ways. Thanks.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-o
>> n-dashboards-for-real-time-user-experience-tp28548.html
>> > Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >
>> > 
>> -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>
>

>>>
>>
>


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Alonso Isidoro Roman
you can check if you want this link



elastic, kibana and spark working together.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2017-03-30 16:25 GMT+02:00 Miel Hostens :

> We're doing exactly same thing over here!
>
> Spark + ELK
>
> Met vriendelijk groeten,
>
> *Miel Hostens*,
>
> *B**uitenpraktijk+*
> Department of Reproduction, Obstetrics and Herd Health
> 
>
> Ambulatory Clinic
> Faculty of Veterinary Medicine 
> Ghent University  
> Salisburylaan 133
> 9820 Merelbeke
> Belgium
>
>
>
> Tel:   +32 9 264 75 28 <+32%209%20264%2075%2028>
>
> Fax:  +32 9 264 77 97 <+32%209%20264%2077%2097>
>
> Mob: +32 478 59 37 03 <+32%20478%2059%2037%2003>
>
>  e-mail: miel.host...@ugent.be
>
> On 30 March 2017 at 15:01, Szuromi Tamás  wrote:
>
>> For us, after some Spark Streaming transformation, Elasticsearch + Kibana
>> is a great combination to store and visualize data.
>> An alternative solution that we use is Spark Streaming put some data back
>> to Kafka and we consume it with nodejs.
>>
>> Cheers,
>> Tamas
>>
>> 2017-03-30 9:25 GMT+02:00 Alonso Isidoro Roman :
>>
>>> Read this first:
>>>
>>> http://www.oreilly.com/data/free/big-data-analytics-emerging
>>> -architecture.csp
>>>
>>> https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf
>>>
>>> http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf
>>>
>>> http://www.gartner.com/smarterwithgartner/six-best-practices
>>> -for-real-time-analytics/
>>>
>>> https://speakerdeck.com/elasticsearch/using-elasticsearch-lo
>>> gstash-and-kibana-to-create-realtime-dashboards
>>>
>>> https://www.youtube.com/watch?v=PuvHINcU9DI
>>>
>>> then take a look to
>>>
>>> https://kudu.apache.org/
>>>
>>> Tell us later what you think.
>>>
>>>
>>>
>>>
>>> Alonso Isidoro Roman
>>> [image: https://]about.me/alonso.isidoro.roman
>>>
>>> 
>>>
>>> 2017-03-30 7:14 GMT+02:00 Gaurav Pandya :
>>>
 Hi Noorul,

 Thanks for the reply.
 But then how to build the dashboard report? Don't we need to store data
 anywhere?
 Please suggest.

 Thanks.
 Gaurav

 On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
 noo...@noorul.com> wrote:

> I think better place would be a in memory cache for real time.
>
> Regards,
> Noorul
>
> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 
> wrote:
> > I am getting streaming data and want to show them onto dashboards in
> real
> > time?
> > May I know how best we can handle these streaming data? where to
> store? (DB
> > or HDFS or ???)
> > I want to give users a real time analytics experience.
> >
> > Please suggest possible ways. Thanks.
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-o
> n-dashboards-for-real-time-user-experience-tp28548.html
> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >
> > 
> -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>


>>>
>>
>


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Miel Hostens
We're doing exactly same thing over here!

Spark + ELK

Met vriendelijk groeten,

*Miel Hostens*,

*B**uitenpraktijk+*
Department of Reproduction, Obstetrics and Herd Health


Ambulatory Clinic
Faculty of Veterinary Medicine 
Ghent University  
Salisburylaan 133
9820 Merelbeke
Belgium



Tel:   +32 9 264 75 28

Fax:  +32 9 264 77 97

Mob: +32 478 59 37 03

 e-mail: miel.host...@ugent.be

On 30 March 2017 at 15:01, Szuromi Tamás  wrote:

> For us, after some Spark Streaming transformation, Elasticsearch + Kibana
> is a great combination to store and visualize data.
> An alternative solution that we use is Spark Streaming put some data back
> to Kafka and we consume it with nodejs.
>
> Cheers,
> Tamas
>
> 2017-03-30 9:25 GMT+02:00 Alonso Isidoro Roman :
>
>> Read this first:
>>
>> http://www.oreilly.com/data/free/big-data-analytics-emerging
>> -architecture.csp
>>
>> https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf
>>
>> http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf
>>
>> http://www.gartner.com/smarterwithgartner/six-best-practices
>> -for-real-time-analytics/
>>
>> https://speakerdeck.com/elasticsearch/using-elasticsearch-
>> logstash-and-kibana-to-create-realtime-dashboards
>>
>> https://www.youtube.com/watch?v=PuvHINcU9DI
>>
>> then take a look to
>>
>> https://kudu.apache.org/
>>
>> Tell us later what you think.
>>
>>
>>
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>> 2017-03-30 7:14 GMT+02:00 Gaurav Pandya :
>>
>>> Hi Noorul,
>>>
>>> Thanks for the reply.
>>> But then how to build the dashboard report? Don't we need to store data
>>> anywhere?
>>> Please suggest.
>>>
>>> Thanks.
>>> Gaurav
>>>
>>> On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
>>> noo...@noorul.com> wrote:
>>>
 I think better place would be a in memory cache for real time.

 Regards,
 Noorul

 On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 
 wrote:
 > I am getting streaming data and want to show them onto dashboards in
 real
 > time?
 > May I know how best we can handle these streaming data? where to
 store? (DB
 > or HDFS or ???)
 > I want to give users a real time analytics experience.
 >
 > Please suggest possible ways. Thanks.
 >
 >
 >
 >
 > --
 > View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-o
 n-dashboards-for-real-time-user-experience-tp28548.html
 > Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 >
 > -
 > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
 >

>>>
>>>
>>
>


Re: httpclient conflict in spark

2017-03-30 Thread Steve Loughran

On 29 Mar 2017, at 14:42, Arvind Kandaswamy 
> wrote:

Hello,

I am getting the following error. I get this error when trying to use AWS S3. 
This appears to be a conflict with httpclient. AWS S3 comes with 
httplient-4.5.2.jar. I am not sure how to force spark to use this version. I 
have tried spark.driver.userClassPathFirst = true, 
spark.executor.userClassPathFirst=true. Did not help. I am using Zeppelin to 
call the spark engine in case if that is an issue.


Spark 2.x ships with httpclient 4.5.2; there is no conflict there. if you are 
on 1.6, you could actually try bumping the httplient and httpcomponents  to be 
consistent


4.5.2
4.4.4

its important to have org.apache.httpcomponents / httpcore in sync with 
httpclient; it's probably there where your problem is arising

Is there anything else that I can try?

java.lang.NoSuchMethodError: 
org.apache.http.conn.ssl.SSLConnectionSocketFactory.(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V
at 
com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.(SdkTLSSocketFactory.java:56)
at 
com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:92)
at 
com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:65)
at 
com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:58)
at 
com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:51)
at 
com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:39)
at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:314)
at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:298)
at com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:165)
at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:583)
at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:563)
at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:541)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)



Re: Need help for RDD/DF transformation.

2017-03-30 Thread Yong Zhang
Unfortunately, I don't think there is any optimized way to do this. Maybe 
someone else can correct me, but in theory, there is no way other than a 
cartesian product of your 2 sides if you can not change the data.


Think about it, if you want to join between 2 different types (Array and Int in 
your case), Spark cannot do HashJoin, nor SortMergeJoin. In the RelationDB 
world, you have to do a NestedLoop, which is cartesian join in BigData world. 
After you create a cartesian product of both, then check if Array column 
contains the other column.


I saw a similar question answered in 
stackoverflow,
 but the answer is NOT correct for serious case.

Assume the key never contains any duplicate strings:


scala> sc.version
res1: String = 1.6.3

scala> val df1 = Seq((1, "a"), (3, "a"), (5, "b")).toDF("key", "value")
df1: org.apache.spark.sql.DataFrame = [key: int, value: string]

scala> val df2 = Seq((Array(1,2,3), "other1"),(Array(4,5), 
"other2")).toDF("keys", "other")
df2: org.apache.spark.sql.DataFrame = [keys: array, other: string]

scala> df1.show
+---+-+
|key|value|
+---+-+
|  1|a|
|  3|a|
|  5|b|
+---+-+


scala> df2.show
+-+--+
| keys| other|
+-+--+
|[1, 2, 3]|other1|
|   [4, 5]|other2|
+-+--+


scala> df1.join(df2, 
df2("keys").cast("string").contains(df1("key").cast("string"))).show
+---+-+-+--+
|key|value| keys| other|
+---+-+-+--+
|  1|a|[1, 2, 3]|other1|
|  3|a|[1, 2, 3]|other1|
|  5|b|   [4, 5]|other2|
+---+-+-+--+

This code looks like working in Spark 1.6.x, but in fact, it has serious issue. 
As it assumes that the key will never have any conflict string in it, but this 
won't be true for any serious data.
So [11,2,3] contains "1" will return true, but is broken in your logic.

And it won't work in Spark 2.x, as the cast("string") logic changes for Array 
in Spark 2.x. The idea behind whole thing is to transfer your Array field to a 
String type, and use contains method to check if it contains another field (In 
String type too). But this is impossible to match the Array.contains(element) 
logic in most cases.

You need to know your data, then try to see if you can find any optimized way 
to avoid cartesian product. For example, maybe make sure "key" in DF1, always 
guarantee presenting the first element of the Array in a logic order, so you 
can just pick the first element out from the Array "keys" of DF2, to join. 
Otherwise, I don't see any way to avoid a cartesian join.

Yong


From: Mungeol Heo 
Sent: Thursday, March 30, 2017 3:05 AM
To: ayan guha
Cc: Yong Zhang; user@spark.apache.org
Subject: Re: Need help for RDD/DF transformation.

Hello ayan,

Same key will not exists in different lists.
Which means, If "1" exists in a list, then it will not be presented in
another list.

Thank you.

On Thu, Mar 30, 2017 at 3:56 PM, ayan guha  wrote:
> Is it possible for one key in 2 groups in rdd2?
>
> [1,2,3]
> [1,4,5]
>
> ?
>
> On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo  wrote:
>>
>> Hello Yong,
>>
>> First of all, thank your attention.
>> Note that the values of elements, which have values at RDD/DF1, in the
>> same list will be always same.
>> Therefore, the "1" and "3", which from RDD/DF 1, will always have the
>> same value which is "a".
>>
>> The goal here is assigning same value to elements of the list which
>> does not exist in RDD/DF 1.
>> So, all the elements in the same list can have same value.
>>
>> Or, the final RDD/DF also can be like this,
>>
>> [1, 2, 3], a
>> [4, 5], b
>>
>> Thank you again.
>>
>> - Mungeol
>>
>>
>> On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang  wrote:
>> > What is the desired result for
>> >
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, c
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> >
>> > Yong
>> >
>> > 
>> > From: Mungeol Heo 
>> > Sent: Wednesday, March 29, 2017 5:37 AM
>> > To: user@spark.apache.org
>> > Subject: Need help for RDD/DF transformation.
>> >
>> > Hello,
>> >
>> > Suppose, I have two RDD or data frame like addressed below.
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, a
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> > I need to create a new RDD/DF like below from RDD/DF 1 and 2.
>> >
>> > 1, a
>> > 2, a
>> > 3, a
>> > 4, b
>> > 5, b
>> >
>> > Is there an efficient way to do this?
>> > Any help will be great.
>> >
>> > Thank you.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
> --
> Best 

Re: apache-spark: Converting List of Rows into Dataset Java

2017-03-30 Thread Karin Valisova
Looks like the parallelization into RDD was the right move I was omitting,

JavaRDD jsonRDD = new JavaSparkContext(sparkSession.
sparkContext()).parallelize(results);

then I created a schema as

List fields  = new ArrayList();
fields.add(DataTypes.createStructField("column_name1",
DataTypes.StringType, true));
fields.add
StructType schema = DataTypes.createStructType(fields);

and then just, voilà! Have my dataset withou any nullpointers exceptions :)

Dataset resultDataset = spark.createDataFrame(rdd, schema);

Thanks a lot!!
Have a nice day,
Karin

On Wed, Mar 29, 2017 at 4:17 AM, Richard Xin 
wrote:

> Maybe you could try something like that:
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("Rows2DataSet")
> .master("local")
> .getOrCreate();
> List results = new LinkedList();
> JavaRDD jsonRDD =
> new JavaSparkContext(sparkSession.
> sparkContext()).parallelize(results);
>
> Dataset peopleDF = sparkSession.createDataFrame(jsonRDD,
> Row.class);
>
> Richard Xin
>
>
> On Tuesday, March 28, 2017 7:51 AM, Karin Valisova 
> wrote:
>
>
> Hello!
>
> I am running Spark on Java and bumped into a problem I can't solve or find
> anything helpful among answered questions, so I would really appreciate
> your help.
>
> I am running some calculations, creating rows for each result:
>
> List results = new LinkedList();
>
> for(something){
> results.add(RowFactory.create( someStringVariable, someIntegerVariable ));
>  }
>
> Now I ended up with a list of rows I need to turn into dataframe to
> perform some spark sql operations on them, like groupings and sorting.
> Would like to keep the dataTypes.
>
> I tried:
>
> Dataset toShow = spark.createDataFrame(results, Row.class);
>
> but it throws nullpointer. (spark being SparkSession) Is my logic wrong
> there somewhere, should this operation be possible, resulting in what I
> want?
> Or do I have to create a custom class which extends serializable and
> create a list of those objects rather than Rows? Will I be able to perform
> SQL queries on dataset consisting of custom class objects rather than rows?
>
> I'm sorry if this is a duplicate question.
> Thank you for your help!
> Karin
>
>
>


-- 

datapine GmbH
Skalitzer Straße 33
10999 Berlin

email: ka...@datapine.com


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Szuromi Tamás
For us, after some Spark Streaming transformation, Elasticsearch + Kibana
is a great combination to store and visualize data.
An alternative solution that we use is Spark Streaming put some data back
to Kafka and we consume it with nodejs.

Cheers,
Tamas

2017-03-30 9:25 GMT+02:00 Alonso Isidoro Roman :

> Read this first:
>
> http://www.oreilly.com/data/free/big-data-analytics-
> emerging-architecture.csp
>
> https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf
>
> http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf
>
> http://www.gartner.com/smarterwithgartner/six-best-
> practices-for-real-time-analytics/
>
> https://speakerdeck.com/elasticsearch/using-elasticsearch-logstash-and-
> kibana-to-create-realtime-dashboards
>
> https://www.youtube.com/watch?v=PuvHINcU9DI
>
> then take a look to
>
> https://kudu.apache.org/
>
> Tell us later what you think.
>
>
>
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2017-03-30 7:14 GMT+02:00 Gaurav Pandya :
>
>> Hi Noorul,
>>
>> Thanks for the reply.
>> But then how to build the dashboard report? Don't we need to store data
>> anywhere?
>> Please suggest.
>>
>> Thanks.
>> Gaurav
>>
>> On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
>> noo...@noorul.com> wrote:
>>
>>> I think better place would be a in memory cache for real time.
>>>
>>> Regards,
>>> Noorul
>>>
>>> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 
>>> wrote:
>>> > I am getting streaming data and want to show them onto dashboards in
>>> real
>>> > time?
>>> > May I know how best we can handle these streaming data? where to
>>> store? (DB
>>> > or HDFS or ???)
>>> > I want to give users a real time analytics experience.
>>> >
>>> > Please suggest possible ways. Thanks.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-o
>>> n-dashboards-for-real-time-user-experience-tp28548.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >
>>>
>>
>>
>


Re: Why VectorUDT private?

2017-03-30 Thread Koert Kuipers
I stopped asking long time ago why things are private in spark... I mean...
The conversion between ml and mllib vectors is private... the conversion
between spark vector and breeze used to be (or still is?) private. it just
goes on. Lots of useful stuff is private[SQL].

Luckily there are simple ways to get around these visibility restrictions

On Mar 29, 2017 22:57, "Ryan"  wrote:

> I'm writing a transformer and the input column is vector type(which is the
> output column from other transformer). But as the VectorUDT is private, how
> could I check/transform schema for the vector column?
>


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Alonso Isidoro Roman
Read this first:

http://www.oreilly.com/data/free/big-data-analytics-emerging-architecture.csp

https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf

http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf

http://www.gartner.com/smarterwithgartner/six-best-practices-for-real-time-analytics/

https://speakerdeck.com/elasticsearch/using-elasticsearch-logstash-and-kibana-to-create-realtime-dashboards

https://www.youtube.com/watch?v=PuvHINcU9DI

then take a look to

https://kudu.apache.org/

Tell us later what you think.




Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2017-03-30 7:14 GMT+02:00 Gaurav Pandya :

> Hi Noorul,
>
> Thanks for the reply.
> But then how to build the dashboard report? Don't we need to store data
> anywhere?
> Please suggest.
>
> Thanks.
> Gaurav
>
> On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
> noo...@noorul.com> wrote:
>
>> I think better place would be a in memory cache for real time.
>>
>> Regards,
>> Noorul
>>
>> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 
>> wrote:
>> > I am getting streaming data and want to show them onto dashboards in
>> real
>> > time?
>> > May I know how best we can handle these streaming data? where to store?
>> (DB
>> > or HDFS or ???)
>> > I want to give users a real time analytics experience.
>> >
>> > Please suggest possible ways. Thanks.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-
>> on-dashboards-for-real-time-user-experience-tp28548.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>
>


Re: Need help for RDD/DF transformation.

2017-03-30 Thread Mungeol Heo
Hello ayan,

Same key will not exists in different lists.
Which means, If "1" exists in a list, then it will not be presented in
another list.

Thank you.

On Thu, Mar 30, 2017 at 3:56 PM, ayan guha  wrote:
> Is it possible for one key in 2 groups in rdd2?
>
> [1,2,3]
> [1,4,5]
>
> ?
>
> On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo  wrote:
>>
>> Hello Yong,
>>
>> First of all, thank your attention.
>> Note that the values of elements, which have values at RDD/DF1, in the
>> same list will be always same.
>> Therefore, the "1" and "3", which from RDD/DF 1, will always have the
>> same value which is "a".
>>
>> The goal here is assigning same value to elements of the list which
>> does not exist in RDD/DF 1.
>> So, all the elements in the same list can have same value.
>>
>> Or, the final RDD/DF also can be like this,
>>
>> [1, 2, 3], a
>> [4, 5], b
>>
>> Thank you again.
>>
>> - Mungeol
>>
>>
>> On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang  wrote:
>> > What is the desired result for
>> >
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, c
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> >
>> > Yong
>> >
>> > 
>> > From: Mungeol Heo 
>> > Sent: Wednesday, March 29, 2017 5:37 AM
>> > To: user@spark.apache.org
>> > Subject: Need help for RDD/DF transformation.
>> >
>> > Hello,
>> >
>> > Suppose, I have two RDD or data frame like addressed below.
>> >
>> > RDD/DF 1
>> >
>> > 1, a
>> > 3, a
>> > 5, b
>> >
>> > RDD/DF 2
>> >
>> > [1, 2, 3]
>> > [4, 5]
>> >
>> > I need to create a new RDD/DF like below from RDD/DF 1 and 2.
>> >
>> > 1, a
>> > 2, a
>> > 3, a
>> > 4, b
>> > 5, b
>> >
>> > Is there an efficient way to do this?
>> > Any help will be great.
>> >
>> > Thank you.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
> --
> Best Regards,
> Ayan Guha

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Need help for RDD/DF transformation.

2017-03-30 Thread ayan guha
Is it possible for one key in 2 groups in rdd2?

[1,2,3]
[1,4,5]

?

On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo  wrote:

> Hello Yong,
>
> First of all, thank your attention.
> Note that the values of elements, which have values at RDD/DF1, in the
> same list will be always same.
> Therefore, the "1" and "3", which from RDD/DF 1, will always have the
> same value which is "a".
>
> The goal here is assigning same value to elements of the list which
> does not exist in RDD/DF 1.
> So, all the elements in the same list can have same value.
>
> Or, the final RDD/DF also can be like this,
>
> [1, 2, 3], a
> [4, 5], b
>
> Thank you again.
>
> - Mungeol
>
>
> On Wed, Mar 29, 2017 at 9:03 PM, Yong Zhang  wrote:
> > What is the desired result for
> >
> >
> > RDD/DF 1
> >
> > 1, a
> > 3, c
> > 5, b
> >
> > RDD/DF 2
> >
> > [1, 2, 3]
> > [4, 5]
> >
> >
> > Yong
> >
> > 
> > From: Mungeol Heo 
> > Sent: Wednesday, March 29, 2017 5:37 AM
> > To: user@spark.apache.org
> > Subject: Need help for RDD/DF transformation.
> >
> > Hello,
> >
> > Suppose, I have two RDD or data frame like addressed below.
> >
> > RDD/DF 1
> >
> > 1, a
> > 3, a
> > 5, b
> >
> > RDD/DF 2
> >
> > [1, 2, 3]
> > [4, 5]
> >
> > I need to create a new RDD/DF like below from RDD/DF 1 and 2.
> >
> > 1, a
> > 2, a
> > 3, a
> > 4, b
> > 5, b
> >
> > Is there an efficient way to do this?
> > Any help will be great.
> >
> > Thank you.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Gaurav Pandya
Hi Noorul,

Thanks for the reply.
But then how to build the dashboard report? Don't we need to store data
anywhere?
Please suggest.

Thanks.
Gaurav

On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
noo...@noorul.com> wrote:

> I think better place would be a in memory cache for real time.
>
> Regards,
> Noorul
>
> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 
> wrote:
> > I am getting streaming data and want to show them onto dashboards in real
> > time?
> > May I know how best we can handle these streaming data? where to store?
> (DB
> > or HDFS or ???)
> > I want to give users a real time analytics experience.
> >
> > Please suggest possible ways. Thanks.
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-best-we-can-store-streaming-
> data-on-dashboards-for-real-time-user-experience-tp28548.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>