RE: RDD object Out of scope.

2019-05-21 Thread Nasrulla Khan Haris
Thanks Sean, that makes sense. 

Regards,
Nasrulla

-Original Message-
From: Sean Owen  
Sent: Tuesday, May 21, 2019 6:24 PM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: Re: RDD object Out of scope.

I'm not clear what you're asking. An RDD itself is just an object in the JVM. 
It will be garbage collected if there are no references. What else would there 
be to clean up in your case? ContextCleaner handles cleaned up of persisted 
RDDs, etc.

On Tue, May 21, 2019 at 7:39 PM Nasrulla Khan Haris 
 wrote:
>
> I am trying to find the code that cleans up uncached RDD.
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> From: Charoes 
> Sent: Tuesday, May 21, 2019 5:10 PM
> To: Nasrulla Khan Haris 
> Cc: Wenchen Fan ; dev@spark.apache.org
> Subject: Re: RDD object Out of scope.
>
>
>
> If you cached a RDD and hold a reference of that RDD in your code, then your 
> RDD will NOT be cleaned up.
>
> There is a ReferenceQueue in ContextCleaner, which is used to keep tracking 
> the reference of RDD, Broadcast, and Accumulator etc.
>
>
>
> On Wed, May 22, 2019 at 1:07 AM Nasrulla Khan Haris 
>  wrote:
>
> Thanks for reply Wenchen, I am curious as what happens when RDD goes out of 
> scope when it is not cached.
>
>
>
> Nasrulla
>
>
>
> From: Wenchen Fan 
> Sent: Tuesday, May 21, 2019 6:28 AM
> To: Nasrulla Khan Haris 
> Cc: dev@spark.apache.org
> Subject: Re: RDD object Out of scope.
>
>
>
> RDD is kind of a pointer to the actual data. Unless it's cached, we don't 
> need to clean up the RDD.
>
>
>
> On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris 
>  wrote:
>
> HI Spark developers,
>
>
>
> Can someone point out the code where RDD objects go out of scope ?. I found 
> the contextcleaner code in which only persisted RDDs are cleaned up in 
> regular intervals if the RDD is registered to cleanup. I have not found where 
> the destructor for RDD object is invoked. I am trying to understand when RDD 
> cleanup happens when the RDD is not persisted.
>
>
>
> Thanks in advance, appreciate your help.
>
> Nasrulla
>
>


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2019-05-21 Thread 张万新
Thanks, I'll check it out.

Arun Mahadevan  于 2019年5月21日周二 01:31写道:

> Heres the proposal for supporting it in "append" mode -
> https://github.com/apache/spark/pull/23576. You could see if it addresses
> your requirement and post your feedback in the PR.
> For "update" mode its going to be much harder to support this without
> first adding support for "retractions", otherwise we would end up with
> wrong results.
>
> - Arun
>
>
> On Mon, 20 May 2019 at 01:34, Gabor Somogyi 
> wrote:
>
>> There is PR for this but not yet merged.
>>
>> On Mon, May 20, 2019 at 10:13 AM 张万新  wrote:
>>
>>> Hi there,
>>>
>>> I'd like to know what's the root reason why multiple aggregations on
>>> streaming dataframe is not allowed since it's a very useful feature, and
>>> flink has supported it for a long time.
>>>
>>> Thanks.
>>>
>>


Re: RDD object Out of scope.

2019-05-21 Thread Sean Owen
I'm not clear what you're asking. An RDD itself is just an object in
the JVM. It will be garbage collected if there are no references. What
else would there be to clean up in your case? ContextCleaner handles
cleaned up of persisted RDDs, etc.

On Tue, May 21, 2019 at 7:39 PM Nasrulla Khan Haris
 wrote:
>
> I am trying to find the code that cleans up uncached RDD.
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> From: Charoes 
> Sent: Tuesday, May 21, 2019 5:10 PM
> To: Nasrulla Khan Haris 
> Cc: Wenchen Fan ; dev@spark.apache.org
> Subject: Re: RDD object Out of scope.
>
>
>
> If you cached a RDD and hold a reference of that RDD in your code, then your 
> RDD will NOT be cleaned up.
>
> There is a ReferenceQueue in ContextCleaner, which is used to keep tracking 
> the reference of RDD, Broadcast, and Accumulator etc.
>
>
>
> On Wed, May 22, 2019 at 1:07 AM Nasrulla Khan Haris 
>  wrote:
>
> Thanks for reply Wenchen, I am curious as what happens when RDD goes out of 
> scope when it is not cached.
>
>
>
> Nasrulla
>
>
>
> From: Wenchen Fan 
> Sent: Tuesday, May 21, 2019 6:28 AM
> To: Nasrulla Khan Haris 
> Cc: dev@spark.apache.org
> Subject: Re: RDD object Out of scope.
>
>
>
> RDD is kind of a pointer to the actual data. Unless it's cached, we don't 
> need to clean up the RDD.
>
>
>
> On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris 
>  wrote:
>
> HI Spark developers,
>
>
>
> Can someone point out the code where RDD objects go out of scope ?. I found 
> the contextcleaner code in which only persisted RDDs are cleaned up in 
> regular intervals if the RDD is registered to cleanup. I have not found where 
> the destructor for RDD object is invoked. I am trying to understand when RDD 
> cleanup happens when the RDD is not persisted.
>
>
>
> Thanks in advance, appreciate your help.
>
> Nasrulla
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



https://github.com/google/zetasql

2019-05-21 Thread kant kodali
https://github.com/google/zetasql


RE: RDD object Out of scope.

2019-05-21 Thread Nasrulla Khan Haris
I am trying to find the code that cleans up uncached RDD.

Thanks,
Nasrulla

From: Charoes 
Sent: Tuesday, May 21, 2019 5:10 PM
To: Nasrulla Khan Haris 
Cc: Wenchen Fan ; dev@spark.apache.org
Subject: Re: RDD object Out of scope.

If you cached a RDD and hold a reference of that RDD in your code, then your 
RDD will NOT be cleaned up.
There is a ReferenceQueue in ContextCleaner, which is used to keep tracking the 
reference of RDD, Broadcast, and Accumulator etc.

On Wed, May 22, 2019 at 1:07 AM Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.invalid>>
 wrote:
Thanks for reply Wenchen, I am curious as what happens when RDD goes out of 
scope when it is not cached.

Nasrulla

From: Wenchen Fan mailto:cloud0...@gmail.com>>
Sent: Tuesday, May 21, 2019 6:28 AM
To: Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.invalid>>
Cc: dev@spark.apache.org
Subject: Re: RDD object Out of scope.

RDD is kind of a pointer to the actual data. Unless it's cached, we don't need 
to clean up the RDD.

On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.invalid>>
 wrote:
HI Spark developers,

Can someone point out the code where RDD objects go out of scope ?. I found the 
contextcleaner
 code in which only persisted RDDs are cleaned up in regular intervals if the 
RDD is registered to cleanup. I have not found where the destructor for RDD 
object is invoked. I am trying to understand when RDD cleanup happens when the 
RDD is not persisted.

Thanks in advance, appreciate your help.
Nasrulla



Re: RDD object Out of scope.

2019-05-21 Thread Charoes
If you cached a RDD and hold a reference of that RDD in your code, then
your RDD will NOT be cleaned up.
There is a ReferenceQueue in ContextCleaner, which is used to keep tracking
the reference of RDD, Broadcast, and Accumulator etc.

On Wed, May 22, 2019 at 1:07 AM Nasrulla Khan Haris
 wrote:

> Thanks for reply Wenchen, I am curious as what happens when RDD goes out
> of scope when it is not cached.
>
>
>
> Nasrulla
>
>
>
> *From:* Wenchen Fan 
> *Sent:* Tuesday, May 21, 2019 6:28 AM
> *To:* Nasrulla Khan Haris 
> *Cc:* dev@spark.apache.org
> *Subject:* Re: RDD object Out of scope.
>
>
>
> RDD is kind of a pointer to the actual data. Unless it's cached, we don't
> need to clean up the RDD.
>
>
>
> On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris <
> nasrulla.k...@microsoft.com.invalid> wrote:
>
> HI Spark developers,
>
>
>
> Can someone point out the code where RDD objects go out of scope ?. I
> found the contextcleaner
> 
> code in which only persisted RDDs are cleaned up in regular intervals if
> the RDD is registered to cleanup. I have not found where the destructor for
> RDD object is invoked. I am trying to understand when RDD cleanup happens
> when the RDD is not persisted.
>
>
>
> Thanks in advance, appreciate your help.
>
> Nasrulla
>
>
>
>


Re: DataSourceV2Reader Q

2019-05-21 Thread Andrew Melo
Hi Ryan,

On Tue, May 21, 2019 at 2:48 PM Ryan Blue  wrote:
>
> Are you sure that your schema conversion is correct? If you're running with a 
> recent Spark version, then that line is probably `name.hashCode()`. That file 
> was last updated 6 months ago so I think it is likely that `name` is the null 
> in your version.

Thanks for taking a look -- in my traceback, "line 264" of
attributeReference.hashCode() is:

h = h * 37 + metadata.hashCode()

If I look within the StructType at the top of the schema, each
StructField indeed has null for the metadata, which I improperly
passing in instead of Metadata.empty()

Thanks again,
Andrew

>
> On Tue, May 21, 2019 at 11:39 AM Andrew Melo  wrote:
>>
>> Hello,
>>
>> I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/)
>> file format to replace a previous DSV1 source that was in use before.
>>
>> I have a bare skeleton of the reader, which can properly load the
>> files and pass their schema into Spark 2.4.3, but any operation on the
>> resulting DataFrame (except for printSchema()) causes an NPE deep in
>> the guts of spark [1]. I'm baffled, though, since both logging
>> statements and coverage says that neither planBatchInputPartitions nor
>> any of the methods in my partition class are called -- the only thing
>> called is readSchema and the constructors.
>>
>> I followed the pattern from "JavaBatchDataSourceV2.java" -- is it
>> possible that test-case isn't up to date? Are there any other example
>> Java DSV2 readers out in the wild I could compare against?
>>
>> Thanks!
>> Andrew
>>
>> [1]
>>
>> java.lang.NullPointerException
>> at 
>> org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:264)
>> at 
>> scala.collection.mutable.FlatHashTable$class.findElemImpl(FlatHashTable.scala:129)
>> at 
>> scala.collection.mutable.FlatHashTable$class.containsElem(FlatHashTable.scala:124)
>> at scala.collection.mutable.HashSet.containsElem(HashSet.scala:40)
>> at scala.collection.mutable.HashSet.contains(HashSet.scala:57)
>> at scala.collection.GenSetLike$class.apply(GenSetLike.scala:44)
>> at scala.collection.mutable.AbstractSet.apply(Set.scala:46)
>> at scala.collection.SeqLike$$anonfun$distinct$1.apply(SeqLike.scala:506)
>> at scala.collection.immutable.List.foreach(List.scala:392)
>> at scala.collection.SeqLike$class.distinct(SeqLike.scala:505)
>> at scala.collection.AbstractSeq.distinct(Seq.scala:41)
>> at 
>> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$unique$1.apply(package.scala:147)
>> at 
>> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$unique$1.apply(package.scala:147)
>> at 
>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>> at 
>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>> at 
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>> at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
>> at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
>> at 
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>> at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>> at 
>> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.unique(package.scala:147)
>> at 
>> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:152)
>> at 
>> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct(package.scala:151)
>> at 
>> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:229)
>> at 
>> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:890)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:892)
>> at 
>> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:889)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:898)
>> at 
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:898)
>> at 
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>> at 
>> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>> at 
>> o

Re: DataSourceV2Reader Q

2019-05-21 Thread Ryan Blue
Are you sure that your schema conversion is correct? If you're running with
a recent Spark version, then that line is probably `name.hashCode()`. That
file was last updated 6 months ago so I think it is likely that `name` is
the null in your version.

On Tue, May 21, 2019 at 11:39 AM Andrew Melo  wrote:

> Hello,
>
> I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/)
> file format to replace a previous DSV1 source that was in use before.
>
> I have a bare skeleton of the reader, which can properly load the
> files and pass their schema into Spark 2.4.3, but any operation on the
> resulting DataFrame (except for printSchema()) causes an NPE deep in
> the guts of spark [1]. I'm baffled, though, since both logging
> statements and coverage says that neither planBatchInputPartitions nor
> any of the methods in my partition class are called -- the only thing
> called is readSchema and the constructors.
>
> I followed the pattern from "JavaBatchDataSourceV2.java" -- is it
> possible that test-case isn't up to date? Are there any other example
> Java DSV2 readers out in the wild I could compare against?
>
> Thanks!
> Andrew
>
> [1]
>
> java.lang.NullPointerException
> at
> org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:264)
> at
> scala.collection.mutable.FlatHashTable$class.findElemImpl(FlatHashTable.scala:129)
> at
> scala.collection.mutable.FlatHashTable$class.containsElem(FlatHashTable.scala:124)
> at scala.collection.mutable.HashSet.containsElem(HashSet.scala:40)
> at scala.collection.mutable.HashSet.contains(HashSet.scala:57)
> at scala.collection.GenSetLike$class.apply(GenSetLike.scala:44)
> at scala.collection.mutable.AbstractSet.apply(Set.scala:46)
> at scala.collection.SeqLike$$anonfun$distinct$1.apply(SeqLike.scala:506)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.SeqLike$class.distinct(SeqLike.scala:505)
> at scala.collection.AbstractSeq.distinct(Seq.scala:41)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$unique$1.apply(package.scala:147)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$unique$1.apply(package.scala:147)
> at
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
> at
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.unique(package.scala:147)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:152)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct(package.scala:151)
> at
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:229)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:890)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:892)
> at
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:889)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:898)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:898)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:898)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:958)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.a

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-21 Thread Sean Owen
Tough one. Yes it's because Hive is still 'included' with the
no-Hadoop build. I think the avro scope is on purpose in that it's
meant to use the version in the larger Hadoop installation it will run
on. But, I suspect you'll find 1.7 doesn't work. Yes, there's a rat's
nest of compatibility problems here which makes tinkering with the
versions problematic.

My instinct is to say that if Spark uses Avro directly and needs 1.8
(and probably doesn't want 1.9) then this needs to be included with
the package, not left to hadoop.deps.scope.

Does anyone who knows more about this piece know better?


On Tue, May 21, 2019 at 1:32 PM Michael Heuer  wrote:
>
> The scopes for avro-1.8.2.jar and avro-mapred-1.8.2-hadoop2.jar are different
>
> 
>   org.apache.avro
>   avro
>   ${avro.version}
>   ${hadoop.deps.scope}
> ...
> 
>   org.apache.avro
>   avro-mapred
>   ${avro.version}
>   ${avro.mapred.classifier}
>   ${hive.deps.scope}
>
>
> What needs to be done then?  At a minimum, something should be added to the 
> release notes for 2.4.3 to say that the 
> spark-2.4.3-bin-without-hadoop-scala-2.12 binary distribution is incompatible 
> with Hadoop 2.7.7 (and perhaps earlier and later versions, I haven't 
> confirmed).
>
> Note that Avro 1.9.0 was just released, with many binary and source 
> incompatibilities compared to 1.8.2, so this problem may soon be getting 
> worse, unless all of Parquet, Hadoop, Hive, and Spark can all make the move 
> simultaneously.
>
>michael
>
>
> On May 20, 2019, at 5:03 PM, Koert Kuipers  wrote:
>
> its somewhat weird because avro-mapred-1.8.2-hadoop2.jar is included in the 
> hadoop-provided distro, but avro-1.8.2.jar is not. i tried to fix it but i am 
> not too familiar with the pom file.
>
> regarding jline you only run into this if you use spark-shell (and it isnt 
> always reproducible it seems). see SPARK-25783
> best,
> koert
>
>
>
>
> On Mon, May 20, 2019 at 5:43 PM Sean Owen  wrote:
>>
>> Re: 1), I think we tried to fix that on the build side and it requires
>> flags that not all tar versions (i.e. OS X) have. But that's
>> tangential.
>>
>> I think the Avro + Parquet dependency situation is generally
>> problematic -- see JIRA for some details. But yes I'm not surprised if
>> Spark has a different version from Hadoop 2.7.x and that would cause
>> problems -- if using Avro. I'm not sure the mistake is that the JARs
>> are missing, as I think this is supposed to be a 'provided'
>> dependency, but I haven't looked into it. If there's any easy obvious
>> correction to be made there, by all means.
>>
>> Not sure what the deal is with jline... I'd expect that's in the
>> "hadoop-provided" distro? That one may be a real issue if it's
>> considered provided but isn't used that way.
>>
>>
>> On Mon, May 20, 2019 at 4:15 PM Koert Kuipers  wrote:
>> >
>> > we run it without issues on hadoop 2.6 - 2.8 on top of my head.
>> >
>> > we however do some post-processing on the tarball:
>> > 1) we fix the ownership of the files inside the tar.gz file (should be 
>> > uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown 
>> > user).
>> > 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these 
>> > jars missing in provided profile is simply a mistake.
>> >
>> > best,
>> > koert
>> >
>> > On Mon, May 20, 2019 at 3:37 PM Michael Heuer  wrote:
>> >>
>> >> Hello,
>> >>
>> >> Which Hadoop version or versions are compatible with Spark 2.4.3 and 
>> >> Scala 2.12?
>> >>
>> >> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is 
>> >> missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there 
>> >> are classpath conflicts at runtime, as Hadoop 2.7.7 includes 
>> >> avro-1.7.4.jar.
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-27781
>> >>
>> >>michael
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



DataSourceV2Reader Q

2019-05-21 Thread Andrew Melo
Hello,

I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/)
file format to replace a previous DSV1 source that was in use before.

I have a bare skeleton of the reader, which can properly load the
files and pass their schema into Spark 2.4.3, but any operation on the
resulting DataFrame (except for printSchema()) causes an NPE deep in
the guts of spark [1]. I'm baffled, though, since both logging
statements and coverage says that neither planBatchInputPartitions nor
any of the methods in my partition class are called -- the only thing
called is readSchema and the constructors.

I followed the pattern from "JavaBatchDataSourceV2.java" -- is it
possible that test-case isn't up to date? Are there any other example
Java DSV2 readers out in the wild I could compare against?

Thanks!
Andrew

[1]

java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:264)
at 
scala.collection.mutable.FlatHashTable$class.findElemImpl(FlatHashTable.scala:129)
at 
scala.collection.mutable.FlatHashTable$class.containsElem(FlatHashTable.scala:124)
at scala.collection.mutable.HashSet.containsElem(HashSet.scala:40)
at scala.collection.mutable.HashSet.contains(HashSet.scala:57)
at scala.collection.GenSetLike$class.apply(GenSetLike.scala:44)
at scala.collection.mutable.AbstractSet.apply(Set.scala:46)
at scala.collection.SeqLike$$anonfun$distinct$1.apply(SeqLike.scala:506)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.SeqLike$class.distinct(SeqLike.scala:505)
at scala.collection.AbstractSeq.distinct(Seq.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$unique$1.apply(package.scala:147)
at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$unique$1.apply(package.scala:147)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.unique(package.scala:147)
at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct$lzycompute(package.scala:152)
at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.direct(package.scala:151)
at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:229)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:890)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:892)
at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:889)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:898)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:898)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:898)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:958)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:958)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$cat

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-21 Thread Michael Heuer
The scopes for avro-1.8.2.jar and avro-mapred-1.8.2-hadoop2.jar are different


  org.apache.avro
  avro
  ${avro.version}
  ${hadoop.deps.scope}
...

  org.apache.avro
  avro-mapred
  ${avro.version}
  ${avro.mapred.classifier}
  ${hive.deps.scope}


What needs to be done then?  At a minimum, something should be added to the 
release notes for 2.4.3 to say that the 
spark-2.4.3-bin-without-hadoop-scala-2.12 binary distribution is incompatible 
with Hadoop 2.7.7 (and perhaps earlier and later versions, I haven't confirmed).

Note that Avro 1.9.0 was just released, with many binary and source 
incompatibilities compared to 1.8.2, so this problem may soon be getting worse, 
unless all of Parquet, Hadoop, Hive, and Spark can all make the move 
simultaneously.

   michael


> On May 20, 2019, at 5:03 PM, Koert Kuipers  wrote:
> 
> its somewhat weird because avro-mapred-1.8.2-hadoop2.jar is included in the 
> hadoop-provided distro, but avro-1.8.2.jar is not. i tried to fix it but i am 
> not too familiar with the pom file.
> 
> regarding jline you only run into this if you use spark-shell (and it isnt 
> always reproducible it seems). see SPARK-25783 
> 
> best,
> koert
> 
> 
> 
> 
> On Mon, May 20, 2019 at 5:43 PM Sean Owen  > wrote:
> Re: 1), I think we tried to fix that on the build side and it requires
> flags that not all tar versions (i.e. OS X) have. But that's
> tangential.
> 
> I think the Avro + Parquet dependency situation is generally
> problematic -- see JIRA for some details. But yes I'm not surprised if
> Spark has a different version from Hadoop 2.7.x and that would cause
> problems -- if using Avro. I'm not sure the mistake is that the JARs
> are missing, as I think this is supposed to be a 'provided'
> dependency, but I haven't looked into it. If there's any easy obvious
> correction to be made there, by all means.
> 
> Not sure what the deal is with jline... I'd expect that's in the
> "hadoop-provided" distro? That one may be a real issue if it's
> considered provided but isn't used that way.
> 
> 
> On Mon, May 20, 2019 at 4:15 PM Koert Kuipers  > wrote:
> >
> > we run it without issues on hadoop 2.6 - 2.8 on top of my head.
> >
> > we however do some post-processing on the tarball:
> > 1) we fix the ownership of the files inside the tar.gz file (should be 
> > uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown 
> > user).
> > 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these 
> > jars missing in provided profile is simply a mistake.
> >
> > best,
> > koert
> >
> > On Mon, May 20, 2019 at 3:37 PM Michael Heuer  > > wrote:
> >>
> >> Hello,
> >>
> >> Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 
> >> 2.12?
> >>
> >> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is 
> >> missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there 
> >> are classpath conflicts at runtime, as Hadoop 2.7.7 includes 
> >> avro-1.7.4.jar.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-27781 
> >> 
> >>
> >>michael



RE: RDD object Out of scope.

2019-05-21 Thread Nasrulla Khan Haris
Thanks for reply Wenchen, I am curious as what happens when RDD goes out of 
scope when it is not cached.

Nasrulla

From: Wenchen Fan 
Sent: Tuesday, May 21, 2019 6:28 AM
To: Nasrulla Khan Haris 
Cc: dev@spark.apache.org
Subject: Re: RDD object Out of scope.

RDD is kind of a pointer to the actual data. Unless it's cached, we don't need 
to clean up the RDD.

On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris 
mailto:nasrulla.k...@microsoft.com.invalid>>
 wrote:
HI Spark developers,

Can someone point out the code where RDD objects go out of scope ?. I found the 
contextcleaner
 code in which only persisted RDDs are cleaned up in regular intervals if the 
RDD is registered to cleanup. I have not found where the destructor for RDD 
object is invoked. I am trying to understand when RDD cleanup happens when the 
RDD is not persisted.

Thanks in advance, appreciate your help.
Nasrulla



Re: Access to live data of cached dataFrame

2019-05-21 Thread Wenchen Fan
When you cache a dataframe, you actually cache a logical plan. That's why
re-creating the dataframe doesn't work: Spark finds out the logical plan is
cached and picks the cached data.

You need to uncache the dataframe, or go back to the SQL way:
df.createTempView("abc")
spark.table("abc").cache()
df.show // returns latest data.
spark.table("abc").show // returns cached data.


On Mon, May 20, 2019 at 3:33 AM Tomas Bartalos 
wrote:

> I'm trying to re-read however I'm getting cached data (which is a bit
> confusing). For re-read I'm issuing:
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count
>
> The cache seems to be global influencing also new dataframes.
>
> So the question is how should I re-read without loosing the cached data
> (without using unpersist) ?
>
> As I mentioned with sql its possible - I can create a cached view, so wen
> I access the original table I get live data, when I access the view I get
> cached data.
>
> BR,
> Tomas
>
> On Fri, 17 May 2019, 8:57 pm Sean Owen,  wrote:
>
>> A cached DataFrame isn't supposed to change, by definition.
>> You can re-read each time or consider setting up a streaming source on
>> the table which provides a result that updates as new data comes in.
>>
>> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos 
>> wrote:
>> >
>> > Hello,
>> >
>> > I have a cached dataframe:
>> >
>> >
>> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
>> >
>> > I would like to access the "live" data for this data frame without
>> deleting the cache (using unpersist()). Whatever I do I always get the
>> cached data on subsequent queries. Even adding new column to the query
>> doesn't help:
>> >
>> >
>> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
>> lit("dummy"))
>> >
>> >
>> > I'm able to workaround this using cached sql view, but I couldn't find
>> a pure dataFrame solution.
>> >
>> > Thank you,
>> > Tomas
>>
>


Re: RDD object Out of scope.

2019-05-21 Thread Wenchen Fan
RDD is kind of a pointer to the actual data. Unless it's cached, we don't
need to clean up the RDD.

On Tue, May 21, 2019 at 1:48 PM Nasrulla Khan Haris
 wrote:

> HI Spark developers,
>
>
>
> Can someone point out the code where RDD objects go out of scope ?. I
> found the contextcleaner
> 
> code in which only persisted RDDs are cleaned up in regular intervals if
> the RDD is registered to cleanup. I have not found where the destructor for
> RDD object is invoked. I am trying to understand when RDD cleanup happens
> when the RDD is not persisted.
>
>
>
> Thanks in advance, appreciate your help.
>
> Nasrulla
>
>
>