Question regarding Projection PushDown

2021-08-27 Thread satyajit vegesna
Hi All,

Please help with below question,

I am trying to build my own data source to connect to CustomAerospike.
Now I am almost done with everything, but still not sure how to implement
Projection Pushdown while selecting nested columns.

Spark does implicit for column projection pushdown, but looks like nested
projection pushdown needs custom implementation. I would like to know if
there is a way i can do it myself and any code pointer would be helpful.

Currently even though i try to select("col1.nested2") projection pushdown
is considering using col1, but does not help in picking col1.nested2. My
plan is to create custom projection push down by implementing a method in
compute that does pull specific column.nestedcol and converts it to Row. My
problem in doing so is I am unable to access the nestedcolumn i am passing
in select using my data source. In my relation class i am only getting col1
and i need a way to be able to access the nested2 col that is provided in
select query.

Regards.


Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-27 Thread Sean Owen
Maybe, I'm just confused why it's needed at all. Other profiles that add a
dependency seem OK, but something's different here.

One thing we can/should change is to simply remove the
 block in the profile. It should always be a direct
dep in Scala 2.13 (which lets us take out the profiles in submodules, which
just repeat that)
We can also update the version, by the by.

I tried this and the resulting POM still doesn't look like what I expect
though.

(The binary release is OK, FWIW - it gets pulled in as a JAR as expected)

On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy  wrote:

> Hi Sean,
>
> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
> help you out here.
>
> Cheers,
>
> Steve C
>
> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>
> OK right, you would have seen a different error otherwise.
>
> Yes profiles are only a compile-time thing, but they should affect the
> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
> scala-parallel-collections as a dependency in the POM as expected (not in a
> profile). However I see what you see in the .pom in the release repo, and
> in my local repo after building - it's just sitting there as a profile as
> if it weren't activated or something.
>
> I'm confused then, that shouldn't be what happens. I'd say maybe there is
> a problem with the release script, but seems to affect a simple local
> build. Anyone else more expert in this see the problem, while I try to
> debug more?
> The binary distro may actually be fine, I'll check; it may even not matter
> much for users who generally just treat Spark as a compile-time-only
> dependency either. But I can see it would break exactly your case,
> something like a self-contained test job.
>
> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy  wrote:
>
>> I did indeed.
>>
>> The generated spark-core_2.13-3.2.0.pom that is created alongside the jar
>> file in the local repo contains:
>>
>> 
>>   scala-2.13
>>   
>> 
>>   org.scala-lang.modules
>>
>> scala-parallel-collections_${scala.binary.version}
>> 
>>   
>> 
>>
>> which means this dependency will be missing for unit tests that create
>> SparkSessions from library code only, a technique inspired by Spark’s own
>> unit tests.
>>
>> Cheers,
>>
>> Steve C
>>
>> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>>
>> Did you run ./dev/change-scala-version.sh 2.13 ? that's required first to
>> update POMs. It works fine for me.
>>
>> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
>> s...@infomedia.com.au.invalid> wrote:
>>
>>> Hi all,
>>>
>>> Being adventurous I have built the RC1 code with:
>>>
>>> -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver
>>> -Phive-2.3 -Pscala-2.13 -Dhadoop.version=3.2.2
>>>
>>>
>>> And then attempted to build my Java based spark application.
>>>
>>> However, I found a number of our unit tests were failing with:
>>>
>>> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>>>
>>> at
>>> org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>>> at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
>>> at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
>>> at
>>> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>>> …
>>>
>>>
>>> I tracked this down to a missing dependency:
>>>
>>> 
>>>   org.scala-lang.modules
>>>
>>> scala-parallel-collections_${scala.binary.version}
>>> 
>>>
>>>
>>> which unfortunately appears only in a profile in the pom files
>>> associated with the various spark dependencies.
>>>
>>> As far as I know it is not possible to activate profiles in dependencies
>>> in maven builds.
>>>
>>> Therefore I suspect that right now a Scala 2.13 migration is not quite
>>> as seamless as we would like.
>>>
>>> I stress that this is only an issue for developers that write unit tests
>>> for their applications, as the Spark runtime environment will always have
>>> the necessary dependencies available to it.
>>>
>>> (You might consider upgrading the
>>> org.scala-lang.modules:scala-parallel-collections_2.13 version from 0.2 to
>>> 1.0.3 though!)
>>>
>>> Cheers and thanks for the great work!
>>>
>>> Steve Coy
>>>
>>>
>>> On 21 Aug 2021, at 3:05 am, Gengliang Wang  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>>  version 3.2.0.
>>>
>>> The vote is open until 11:59pm Pacific time Aug 25 and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>> 

Re: How to improve the concurrent query performance of spark SQL query

2021-08-27 Thread Mich Talebzadeh
There are many ways of interacting with Hive DW from Spark.

You can either use the API from Spark to Hive native or you can use JDBC
connection (local or remote spark).

What is the reference to the driver in this context? Bottom line using
concurrent queries, you will have to go through Hive and that is where as
you pointed out, you may have concurrency issues. Spark IMO does not play
such a significant role here. Your concurrency will rise from the way Hive
is configured to handle multiple threads. If Hive metastore is on Oracle
you will have or expect v.good performance. On the other hand if you use
some MySql etc, then you will have bottleneck on the hive side.

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 27 Aug 2021 at 02:32, Tao Li  wrote:

> In the high concurrency scenario, the query performance of spark SQL is
> limited by namenode and hive Metastore. There are some caches in the code,
> but the effect is limited. Do we have a practical and effective way to
> solve the time-consuming problem of driver in concurrent query?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>