Re: MongoDB plugin to Spark - too many open cursors

2020-10-25 Thread lec ssmi
Is the connection pool configured by mongodb full?

Daniel Stojanov  于2020年10月26日周一 上午10:28写道:

> Hi,
>
>
> I receive an error message from the MongoDB server if there are too many
> Spark applications trying to access the database at the same time (about
> 3 or 4), "Cannot open a new cursor since too many cursors are already
> opened." I am not too sure of how to remedy this. I am not sure how the
> plugin behaves when it's pulling data.
>
> It appears that a given running application will open many connections
> to the database. The total number of cursors in the database's setting
> is many more than the number of read operations occurring in Spark.
>
>
> Does the plugin keep a connection/cursor open to the database even after
> it has pulled out the data into a dataframe?
>
> Why are there so many open cursors for a single read operation?
>
> Does catching the exception, sleeping for a while, then trying again
> make sense? If cursors are kept open throughout the life of the
> application this would not make sense.
>
>
> Plugin version: org.mongodb.spark:mongo-spark-connector_2.12:2.4.1
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


MongoDB plugin to Spark - too many open cursors

2020-10-25 Thread Daniel Stojanov

Hi,


I receive an error message from the MongoDB server if there are too many 
Spark applications trying to access the database at the same time (about 
3 or 4), "Cannot open a new cursor since too many cursors are already 
opened." I am not too sure of how to remedy this. I am not sure how the 
plugin behaves when it's pulling data.


It appears that a given running application will open many connections 
to the database. The total number of cursors in the database's setting 
is many more than the number of read operations occurring in Spark.



Does the plugin keep a connection/cursor open to the database even after 
it has pulled out the data into a dataframe?


Why are there so many open cursors for a single read operation?

Does catching the exception, sleeping for a while, then trying again 
make sense? If cursors are kept open throughout the life of the 
application this would not make sense.



Plugin version: org.mongodb.spark:mongo-spark-connector_2.12:2.4.1


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to migrate DataSourceV2 into Spark 3.0.0

2020-10-25 Thread rafaelkyrdan
Sorry for late response. 
I was able to migrate my project to Spark 3.0.0
Here some hints what I did:
https://gist.github.com/rafaelkyrdan/2bea8385aadd71be5bf67cddeec59581



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: mission statement : unified

2020-10-25 Thread Stephen Boesch
While the core of the Spark is and has been quite solid and a go-to
infrastructure, the *streaming *part of the story was still quite weak at
least through mid last year.  I went into depth on both structured and the
older DStream.  The structured in particular was difficult to use: both in
terms of limitations on what it supports and documentation/examples.  Has
there been meaningful advancements in the past twelve+ months?

On Sun, 25 Oct 2020 at 13:58, Khalid Mammadov 
wrote:

> Correct. Also as explained in the book LearningSpark2.0 by Databiricks:
>
> Unified Analytics
> While the notion of unification is not unique to Spark, it is a core
> component of its design philosophy and evolution. In November 2016, the
> Association for Computing Machinery (ACM) recognized Apache Spark and
> conferred upon its original creators the prestigious ACM Award for their
> paper describing Apache Spark as a “Unified Engine for Big Data
> Processing.” The award-winning paper notes that Spark replaces all the
> separate batch processing, graph, stream, and query engines like Storm,
> Impala, Dremel, Pregel, etc. with a unified stack of components that
> addresses diverse workloads under a single distributed fast engine.
>
> Khalid
>
> On 19 Oct 2020, at 07:03, Sonal Goyal  wrote:
>
> 
> My thought is that Spark supports analytics for structured and
> unstructured data, batch as well as real time. This was pretty
> revolutionary when Spark first came out. That's where the unified term came
> from I think. Even after all these years, Spark remains the trusted
> framework for enterprise analytics.
>
> On Mon, 19 Oct 2020, 11:24 Gourav Sengupta  wrote:
>
>> Hi,
>>
>> I think that it is just a marketing statement. But with SPARK 3.x, now
>> that you are seeing that SPARK is no more than just another distributed
>> data processing engine, they are trying to join data pre-processing into ML
>> pipelines directly. I may call that unified.
>>
>> But you get the same with several other frameworks as well now so not
>> quite sure how unified creates a unique brand value.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Sun, Oct 18, 2020 at 6:40 PM Hulio andres  wrote:
>>
>>>
>>> Apache Spark's  mission statement is  *Apache Spark™* is a unified
>>> analytics engine for large-scale data processing.
>>>
>>> To what is the word "unified" inferring ?
>>>
>>>
>>>
>>>
>>>
>>>
>>> - To
>>> unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: mission statement : unified

2020-10-25 Thread Khalid Mammadov
Correct. Also as explained in the book LearningSpark2.0 by Databiricks:

Unified Analytics
While the notion of unification is not unique to Spark, it is a core component 
of its design philosophy and evolution. In November 2016, the Association for 
Computing Machinery (ACM) recognized Apache Spark and conferred upon its 
original creators the prestigious ACM Award for their paper describing Apache 
Spark as a “Unified Engine for Big Data Processing.” The award-winning paper 
notes that Spark replaces all the separate batch processing, graph, stream, and 
query engines like Storm, Impala, Dremel, Pregel, etc. with a unified stack of 
components that addresses diverse workloads under a single distributed fast 
engine.

Khalid

> On 19 Oct 2020, at 07:03, Sonal Goyal  wrote:
> 
> 
> My thought is that Spark supports analytics for structured and unstructured 
> data, batch as well as real time. This was pretty revolutionary when Spark 
> first came out. That's where the unified term came from I think. Even after 
> all these years, Spark remains the trusted framework for enterprise 
> analytics. 
> 
>> On Mon, 19 Oct 2020, 11:24 Gourav Sengupta > Hi,
>> 
>> I think that it is just a marketing statement. But with SPARK 3.x, now that 
>> you are seeing that SPARK is no more than just another distributed data 
>> processing engine, they are trying to join data pre-processing into ML 
>> pipelines directly. I may call that unified. 
>> 
>> But you get the same with several other frameworks as well now so not quite 
>> sure how unified creates a unique brand value.
>> 
>> 
>> Regards,
>> Gourav Sengupta 
>> 
>>> On Sun, Oct 18, 2020 at 6:40 PM Hulio andres  wrote:
>>>  
>>> Apache Spark's  mission statement is  Apache Spark™ is a unified analytics 
>>> engine for large-scale data processing. 
>>>  
>>> To what is the word "unified" inferring ?
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>> - To 
>>> unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Spark hive build and connectivity

2020-10-25 Thread hopefulnick
For compatibility,it's recommended:- Use compatible version of Hive.- Build
Spark without hive and configure hive to use Spark.Here is the way to build
Spark with custom Hive. It worked for me and hope helpful to you. Hive on
Spark

   



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/