Difference between Typed and untyped transformation in dataset API

2019-02-21 Thread Akhilanand
What is the key difference between Typed and untyped transformation in
dataset API?
How do I determine if its typed or untyped?
Any gotchas when to use what apart from the reason that it does the job for
me?


Re: Spark-hive integration on HDInsight

2019-02-21 Thread amit kumar singh
Hey jay

How you are making your cluster  are you using spark cluster 

All this thing should be set up automatically 



Sent from my iPhone

> On Feb 21, 2019, at 12:12 PM, Felix Cheung  wrote:
> 
> You should check with HDInsight support
> 
> From: Jay Singh 
> Sent: Wednesday, February 20, 2019 11:43:23 PM
> To: User
> Subject: Spark-hive integration on HDInsight
>  
> I am trying to integrate  spark with hive on HDInsight  spark cluster .
> I copied hive-site.xml in spark/conf directory. In addition I added hive 
> metastore properties like jdbc connection info on Ambari as well. But still 
> the database and tables created using spark-sql are not visible in hive. 
> Changed ‘spark.sql.warehouse.dir’ value also to point to hive warehouse 
> directory.
> Spark does work with hive not having LLAP ON. What am I missing in the 
> configuration to integrate spark with hive ? Any pointer will be appreciated.
>  
> thx  


Re: Spark-hive integration on HDInsight

2019-02-21 Thread Felix Cheung
You should check with HDInsight support


From: Jay Singh 
Sent: Wednesday, February 20, 2019 11:43:23 PM
To: User
Subject: Spark-hive integration on HDInsight

I am trying to integrate  spark with hive on HDInsight  spark cluster .
I copied hive-site.xml in spark/conf directory. In addition I added hive 
metastore properties like jdbc connection info on Ambari as well. But still the 
database and tables created using spark-sql are not visible in hive. Changed 
‘spark.sql.warehouse.dir’ value also to point to hive warehouse directory.
Spark does work with hive not having LLAP ON. What am I missing in the 
configuration to integrate spark with hive ? Any pointer will be appreciated.

thx


Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-21 Thread Gabor Somogyi
>From the info you've provided not much to say.
Maybe you could collect sample app, logs etc, open a jira and we can take a
deeper look at it...

BR,
G


On Thu, Feb 21, 2019 at 4:14 PM Guillermo Ortiz 
wrote:

> I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct Stream
> as connector. I consume data from Kafka and autosave the offsets.
> I can see Spark doing commits in the logs of the last offsets processed,
> Sometimes I have restarted spark and it starts from the beginning, when I'm
> using the same groupId.
>
> Why could it happen? it only happen rarely.
>


Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-21 Thread Guillermo Ortiz
I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct Stream
as connector. I consume data from Kafka and autosave the offsets.
I can see Spark doing commits in the logs of the last offsets processed,
Sometimes I have restarted spark and it starts from the beginning, when I'm
using the same groupId.

Why could it happen? it only happen rarely.


Structured streaming performance issues

2019-02-21 Thread gvdongen
Hi everyone, 
I have the following pipeline:
Ingest 2 streams from Kafka -> parse JSON -> join both streams -> aggregate
on a key over the last second -> output to Kafka
with
Join: inner join in interval of one second, with watermarking 50 ms
Aggregation: tumbling window of one second, with watermarking 50ms

Data is published at a constant rate at 400 events/second and all joins and
aggregations are done on a second-level.

The median end-to-end latency of an event (-), is 5 seconds and the p99 latency is
7500. I was wondering why this was so high and investigated where it came
from:

   - My output trigger is set to 0, so it should process as fast as
possible, yet it still makes batches at an average interval of 1-2 seconds.
   - If processing a batch takes 1-2 seconds, does this mean that all events
are published in one burst at the end of the two seconds?
- Garbage collection is under control since I switched to G1GC and since
I changed the parameter for spark.sql.streaming.minBatchesToRetain to 2.
Practically all of the time goes to executor computing time.
- The watermark: I put the watermark at 50ms for the join and the
aggregation inputs. If I look at the microbatch execution progress
(attachment  progress.txt
 
), I see that the event time watermark is less that the minimum event time
of the batch and 2 seconds less than the maximum event time of the batch.
Therefore, I was wondering whether this means that events of batch t1 will
only be send out after processing batch t2 so two seconds later? Or when
would the watermark update?
- If I look at the number of input records in the query progress, the
number is exactly two times the expected amount. I read somewhere this could
be the case if you use two sinks but this is not what I am doing. Are there
other reasons this behavior might occur?

Could it mean that the median 5 sec latency comes from the following
factors:

 
+ <2 sec processing time>
+ <2 sec delay before watermark advances far enough>
= 5 seconds

If this is a plausible cause, what can I do to increase performance?
As a reference frame, the exact similar pipeline in Spark Streaming has a
median latency of 760 ms and p99 of 2300.

Thank you in advance!






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org