Re: spark sql StackOverflow

2018-05-14 Thread Alessandro Solimando
Hi,
I am not familiar with ATNConfigSet, but some thoughts that might help.

How many distinct key1 (resp. key2) values do you have? Are these values
reasonably stable over time?

Are these records ingested in real-time or are they loaded from a datastore?

If the latter case the DB might be able to efficiently perform the
filtering, especially if equipped with a proper index over key1/key2 (or a
composite one).

In such case the filter push-down could be very effective (I didn't get if
you just need to count or do something more with the matching record).

Alternatively, you could try to group by (key1,key2), and then filter (it
again depends on the kind of output you have in mind).

If the datastore/stream is distributed and supports partitioning, you could
partition your records by either key1 or key2 (or key1+key2), so they are
already "separated" and can be consumed more efficiently (e.g., the groupby
could then be local to a single partition).

Best regards,
Alessandro

On 15 May 2018 at 08:32, onmstester onmstester  wrote:

> Hi,
>
> I need to run some queries on huge amount input records. Input rate for
> records are 100K/seconds.
> A record is like (key1,key2,value) and the application should report
> occurances of kye1 = something && key2 == somethingElse.
> The problem is there are too many filters in my query: more than 3
> thousands pair of key1 and key2 should be filtered.
> I was simply puting 1 millions of records in a temptable each time and
> running a query sql using spark-sql on temp table:
> select * from mytemptable where (kye1 = something && key2 ==
> somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or
> ...(3thousands or!!!)
> And i encounter StackOverFlow at ATNConfigSet.java line 178.
>
> So i have two options IMHO:
> 1. Either put all key1 and key2 filter pairs in another temp table and do
> a join between  two temp table
> 2. Or use spark-stream that i'm not familiar with and i don't know if it
> could handle 3K of filters.
>
> Which way do you suggest? what is the best solution for my problem
> 'performance-wise'?
>
> Thanks in advance
>
>


Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Nick Pentreath
Multi column support for StringIndexer didn’t make it into Spark 2.3.0

The PR is still in progress I think - should be available in 2.4.0

On Mon, 14 May 2018 at 22:32, Mina Aslani  wrote:

> Please take a look at the api doc:
> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feature/StringIndexer.html
>
> On Mon, May 14, 2018 at 4:30 PM, Mina Aslani  wrote:
>
>> Hi,
>>
>> There is no SetInputCols/SetOutputCols for StringIndexer in Spark java.
>> How multiple input/output columns can be specified then?
>>
>> Regards,
>> Mina
>>
>
>


spark sql StackOverflow

2018-05-14 Thread onmstester onmstester
Hi, 



I need to run some queries on huge amount input records. Input rate for records 
are 100K/seconds.

A record is like (key1,key2,value) and the application should report occurances 
of kye1 = something && key2 == somethingElse.

The problem is there are too many filters in my query: more than 3 thousands 
pair of key1 and key2 should be filtered.

I was simply puting 1 millions of records in a temptable each time and running 
a query sql using spark-sql on temp table:

select * from mytemptable where (kye1 = something && key2 == 
somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) 
or ...(3thousands or!!!)

And i encounter StackOverFlow at ATNConfigSet.java line 178.



So i have two options IMHO:

1. Either put all key1 and key2 filter pairs in another temp table and do a 
join between  two temp table

2. Or use spark-stream that i'm not familiar with and i don't know if it could 
handle 3K of filters.



Which way do you suggest? what is the best solution for my problem 
'performance-wise'?



Thanks in advance




Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Mina Aslani
Please take a look at the api doc:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feature/StringIndexer.html

On Mon, May 14, 2018 at 4:30 PM, Mina Aslani  wrote:

> Hi,
>
> There is no SetInputCols/SetOutputCols for StringIndexer in Spark java.
> How multiple input/output columns can be specified then?
>
> Regards,
> Mina
>


How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-14 Thread Mina Aslani
Hi,

There is no SetInputCols/SetOutputCols for StringIndexer in Spark java.
How multiple input/output columns can be specified then?

Regards,
Mina


Re: UDTF registration fails for hiveEnabled SQLContext

2018-05-14 Thread Gourav Sengupta
Hi Mick,

the error practically gives you the suggestion to correct it "use
sparkSession.udf.register(...)
instead".

Regards,
Gourav Sengupta

On Mon, May 14, 2018 at 8:59 AM, Mick Davies 
wrote:

> The examples were lost by formatting:
>
> Exception is:
>
> No handler for UDAF 'com.iqvia.rwas.omop.udtf.ParallelExplode'. Use
> sparkSession.udf.register(...) instead.; line 1 pos 7
> org.apache.spark.sql.AnalysisException: No handler for UDAF
> 'com.iqvia.rwas.omop.udtf.ParallelExplode'. Use
> sparkSession.udf.register(...) instead.; line 1 pos 7
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.
> makeFunctionExpression(SessionCatalog.scala:1134)
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$
> anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$
> makeFunctionBuilder$1.apply(SessionCatalog.scala:1114)
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$
> anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$
> makeFunctionBuilder$1.apply(SessionCatalog.scala:1114)
> at
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.
> lookupFunction(FunctionRegistry.scala:115)
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.
> lookupFunction(SessionCatalog.scala:1245)
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Arrow][Dremio]

2018-05-14 Thread Pierce Lamb
Hi Xavier,

Along the lines of connecting to multiple sources of data and replacing ETL
tools you may want to check out Confluent's blog on building a real-time
streaming ETL pipeline on Kafka

as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData
 where
Spark is central to connecting to multiple data sources, executing SQL on
streams etc. These should provide nice comparisons to your ideas about
Dremio + Spark as ETL tools.

Disclaimer: I am a SnappyData employee

Hope this helps,

Pierce

On Mon, May 14, 2018 at 2:24 AM, xmehaut  wrote:

> Hi Michaël,
>
> I'm not an expert of Dremio, i just try to evaluate the potential of this
> techno and what impacts it could have on spark, and how they can work
> together, or how spark could use even further arrow internally along the
> existing algorithms.
>
> Dremio has already a quite rich api set enabling to access for instance to
> metadata, sql queries, or even to create virtual datasets programmatically.
> They also have a lot of predefined functions, and I imagine there will be
> more an more fucntions in the future, eg machine learning functions like
> the
> ones we may find in azure sql server which enables to mix sql and ml
> functions.  Acces to dremio is made through jdbc, and we may imagine to
> access virtual datasets through spark and create dynamically new datasets
> from the api connected to parquets files stored dynamycally by spark on
> hdfs, azure datalake or s3... Of course a more thight integration between
> both should be better with a spark read/write connector to dremio :)
>
> regards
> xavier
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


assertion failed: Beginning offset 34242088 is after the ending offset 34242084 for topic partition 2. You either provided an invalid fromOffset, or the Kafka topic has been damaged

2018-05-14 Thread ravidspark
Hi Community,

Seeing the below message in the logs and Spark application is getting
terminated. There is an issue with our Kafka service and it auto restarts
during which leader reelection happens.

*Exception:*
assertion failed: Beginning offset 34242088 is after the ending offset
34242084 for topic  partition 2. You either provided an invalid
fromOffset, or the Kafka topic has been damaged.

After analyzing it, I understand that the problem is that the Spark consumer
has an earliest offset which it processed and Kafka returns an old offset
(correct me if I am wrong). Is this behavior intended? Is there a way we can
handle these situations without letting the app die?

Any inputs are greatly appreciated.

Thanks,
Ravi



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Arrow][Dremio]

2018-05-14 Thread xmehaut
Hi Michaël,

I'm not an expert of Dremio, i just try to evaluate the potential of this
techno and what impacts it could have on spark, and how they can work
together, or how spark could use even further arrow internally along the
existing algorithms.

Dremio has already a quite rich api set enabling to access for instance to
metadata, sql queries, or even to create virtual datasets programmatically.
They also have a lot of predefined functions, and I imagine there will be
more an more fucntions in the future, eg machine learning functions like the
ones we may find in azure sql server which enables to mix sql and ml
functions.  Acces to dremio is made through jdbc, and we may imagine to
access virtual datasets through spark and create dynamically new datasets
from the api connected to parquets files stored dynamycally by spark on
hdfs, azure datalake or s3... Of course a more thight integration between
both should be better with a spark read/write connector to dremio :)

regards
xavier



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Arrow][Dremio]

2018-05-14 Thread Michael Shtelma
Hi Xavier,

Dremio is looking really interesting and has nice UI. I think the idea to
replace SSIS or similar tools with Dremio is not so bad, but what about
complex scenarios with a lot of code and transformations ?
Is it possible to use Dremio via API and define own transformations and
transformation workflows with Java or Scala in Dremio?
I am not sure, if it is supported at all.
I think Dremio guys are looking forward to give users access to Sabot API
in order to use Dremio in the same way you can use spark, but I am not sure
if it is possible now.
Have you also tried comparing performance with Spark ? Are there any
benchmarks ?

Best,
Michael

On Mon, May 14, 2018 at 6:53 AM, xmehaut  wrote:

> Hello,
> I've some question about Spark and Apache Arrow. Up to now, Arrow is only
> used for sharing data between Python and Spark executors instead of
> transmitting them through sockets. I'm studying currently Dremio as an
> interesting way to access multiple sources of data, and as a potential
> replacement of ETL tools, included sparksql.
> It seems, if the promises are actually right, that arrow and dremio may be
> changing game for these two purposes (data source abstraction, etl tasks),
> leaving then spark on te two following goals , ie ml/dl and graph
> processing, which can be a danger for spark at middle term with the arising
> of multiple frameworks in these areas.
> My question is then :
> - is there a means to use arrow more broadly in spark itself and not only
> for sharing data?
> - what are the strenghts and weaknesses of spark wrt Arrow and consequently
> Dremio?
> - What is the difference finally between databricks DBIO and Dremio/arrow?
> -How do you see the future of spark regarding these assumptions?
> regards
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: UDTF registration fails for hiveEnabled SQLContext

2018-05-14 Thread Mick Davies
The examples were lost by formatting:

Exception is:

No handler for UDAF 'com.iqvia.rwas.omop.udtf.ParallelExplode'. Use
sparkSession.udf.register(...) instead.; line 1 pos 7
org.apache.spark.sql.AnalysisException: No handler for UDAF
'com.iqvia.rwas.omop.udtf.ParallelExplode'. Use
sparkSession.udf.register(...) instead.; line 1 pos 7
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionExpression(SessionCatalog.scala:1134)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1114)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1114)
at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:115)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1245)






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org