Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22646#discussion_r229099482
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,9 +1126,38 @@ object SQLContext
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22646#discussion_r229093388
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,9 +1126,38 @@ object SQLContext
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22646#discussion_r224962460
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,9 +1126,38 @@ object SQLContext
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22646#discussion_r223212772
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1098,12 +1099,19 @@ object SQLContext {
data: Iterator
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22646#discussion_r223212724
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1115,8 +1123,31 @@ object SQLContext
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/22527
Thanks! I created a new PR with array, list and map support.
---
-
To unsubscribe, e-mail: reviews-unsubscr
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/22646
Support for nested JavaBean arrays, lists and maps in createDataFrame
## What changes were proposed in this pull request?
Continuing from #22527, this PR seeks to add support
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22527#discussion_r222817293
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1098,16 +1098,26 @@ object SQLContext {
data: Iterator
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/22527
@ueshin Yes. I am already working on array/list support. Will add maps as
well. It shouldn't require a rewrite now that the code is restructured, just
new cases in pattern match. So I think
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22527#discussion_r222415998
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1098,16 +1098,24 @@ object SQLContext {
data: Iterator
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22527#discussion_r222415649
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1098,16 +1098,24 @@ object SQLContext {
data: Iterator
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/22527
I restructured the code in this commit to allow easier addition of
array/list support in the future.
---
-
To unsubscribe
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22527#discussion_r222079624
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1100,13 +1101,23 @@ object SQLContext {
attrs: Seq
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/22527#discussion_r219691829
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -1100,13 +1101,24 @@ object SQLContext {
attrs: Seq
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/22527
[SPARK-17952][SQL] Nested Java beans support in createDataFrame
## What changes were proposed in this pull request?
When constructing a DataFrame from a Java bean, using nested beans
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/20505
Yes, that is the idea. Frankly, I am not that familiar with how the
compiler resolves all the implicit parameters to say confidently what is going
on. But here's my take:
I did
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/20505#discussion_r167437023
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
---
@@ -165,11 +165,15 @@ abstract class SQLImplicits extends
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/20505#discussion_r165903346
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
---
@@ -165,11 +165,15 @@ abstract class SQLImplicits extends
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/20505
[SPARK-23251][SQL] Add checks for collection element Encoders
Implicit methods of `SQLImplicits` providing Encoders for collections did
not check for
Encoders for their elements
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16986#discussion_r121266969
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala ---
@@ -258,6 +265,80 @@ class DatasetPrimitiveSuite extends
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16986#discussion_r121265890
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala ---
@@ -258,6 +265,80 @@ class DatasetPrimitiveSuite extends
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/18009#discussion_r121263319
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala ---
@@ -28,6 +28,8 @@ case class SeqClass(s: Seq[Int
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16986
So I tried to simplify the code as much as possible, removing unneeded
parameters. I must admit I am not entirely sure about whether I am handling all
the data types correctly but everything
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16986#discussion_r120010531
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -652,6 +653,299 @@ case class MapObjects
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16986#discussion_r120010461
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -652,6 +653,299 @@ case class MapObjects
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16986#discussion_r120010289
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -652,6 +653,299 @@ case class MapObjects
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16986#discussion_r120009967
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -652,6 +653,299 @@ case class MapObjects
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16986
That was because of my other PR that just got accepted. Just a matter of
appending unit tests. I resolved the conflict from browser for now. Can rebase
later if merge commits
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16986#discussion_r117363594
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
---
@@ -329,35 +329,19 @@ object ScalaReflection extends
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/18011
[SPARK-19089][SQL] Add support for nested sequences
## What changes were proposed in this pull request?
Replaced specific sequence encoders with generic sequence encoder to enable
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/18009
[SPARK-18891][SQL] Support for specific Java List subtypes
## What changes were proposed in this pull request?
Add support for specific Java `List` subtypes in deserialization as well
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16986
Rebased onto the current master and integrated a few minor changes from the
code review of #16541 in case anyone is still interested in this feature
---
If your project is set up for it, you
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
@ueshin Thanks for the fix
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Thanks. Made the suggested changes in my latest commit.
I also encountered a minor problem when doing final testing. When using a
collection type that is a type alias (e.g., scala.List
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16541#discussion_r106810993
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -589,6 +590,170 @@ case class MapObjects
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16541#discussion_r106810940
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SequenceBenchmark.scala
---
@@ -0,0 +1,74 @@
+/*
+ * Licensed
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16541#discussion_r106810789
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -589,6 +590,170 @@ case class MapObjects
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
That seems to be the case here, yes.
What about the other benefits I mentioned (adding support for Java `List`s
and future Scala 2.13 compatibility)? I think the codegen is also more
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Well, technically yes. But I would say it's a little more than that.
The current approach to deserialization of `Seq`s is to copy the data into
an array, construct a `WrappedArray
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Also please note the [UnsafeArrayData-producing
branch](https://github.com/michalsenkyr/spark/compare/dataset-seq-builder...michalsenkyr:dataset-seq-builder-unsafe)
that is not yet merged
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Would it be possible for somebody to review this PR for me? I have a few
ideas that are dependent on this and I'd like to get to work on them. Most
notably support for Java Lists.
Maybe
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16986
Added support for Java Maps with support for pre-allocation (capacity
argument on constructor) and sensible defaults for interfaces/abstract classes.
Also includes implicit encoders
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/16986
[SPARK-18891][SQL] Support for Map collection types
## What changes were proposed in this pull request?
Add support for arbitrary Scala `Map` types in deserialization as well
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Apologies for taking so long.
I tried modifying the serialization logic as best as I could to serialize
into `UnsafeArrayData` ([branch
diff](https://github.com/michalsenkyr/spark
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
I added the benchmarks based on the code you provided but I am getting
almost the same results before and after the optimization (see description). So
either the added benefit is really small
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Added benchmarks.
I didn't find any standardized way of benchmarking codegen so I wrote a
simple script for Spark Shell. Benchmarks were run on a laptop so the
collections couldn't
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16546#discussion_r95927063
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
---
@@ -120,6 +120,32 @@ object ScalaReflection extends
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16541#discussion_r95923927
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -589,6 +590,171 @@ case class MapObjects
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16541#discussion_r95921320
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -589,6 +590,171 @@ case class MapObjects
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16541#discussion_r95909808
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -589,6 +590,171 @@ case class MapObjects
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Added codegen comparison for a simple `List` dataset.
I will also prepare a benchmark and add some results later. Those will be
for `List`, `mutable.Queue` and `Seq`. Where `List
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16541#discussion_r95667424
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
---
@@ -589,6 +590,171 @@ case class MapObjects
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16541
Also, the new `CollectObjects` copies quite a bit of code from
`MapObjects`. Should I move the code into a common trait in order to reduce
duplicity or should I leave it as is?
---
If your
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/16541
[SPARK-19088][SQL] Optimize sequence type deserialization codegen
## What changes were proposed in this pull request?
Optimization of arbitrary Scala sequence deserialization
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16240
Not sure how to run MiMa tests locally so I tried my best to figure out
what was necessary. Hope this fixes it.
The downside of the fix is that I had to restore the original methods
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16240#discussion_r94504343
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala ---
@@ -130,6 +130,30 @@ class DatasetPrimitiveSuite extends
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16240#discussion_r93816269
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
---
@@ -100,31 +97,36 @@ abstract class SQLImplicits {
// Seqs
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16240#discussion_r93807181
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
---
@@ -312,12 +312,46 @@ object ScalaReflection extends
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16240#discussion_r93805236
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
---
@@ -100,31 +97,36 @@ abstract class SQLImplicits {
// Seqs
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16240
I actually read that but IDEA complained when I tried to place the
`Product` encoder into a separate trait. So I opted for specificity.
However, I tried it again right now and even though
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16329#discussion_r93682192
--- Diff:
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala
---
@@ -0,0 +1,87 @@
+/*
+ * Licensed
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16329
If you are having trouble building Javadoc, try switching to Java 7
temporarily. Java 8 introduced stricter Javadoc rules that may fail the docs
build. Unfortunately Jenkins doesn't, so new
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16240
None of them. The compilation will fail. That is why I had to provide those
additional implicits.
```
scala> class Test[T]
defined class Test
scala> implicit def
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16157
Sorry for the delay. You are probably right that the partitioning is
primarily determined by data locality and that it is therefore appropriate in
some cases and shouldn't be worded
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16240
Possible optimization: Instead of conversions using `to`, we can use
`Builder`s. This way we could get rid of the conversion overhead. This would
require adding a new codegen method that would
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16240
Added support for arbitrary sequences.
Now also Queues, ArrayBuffers and such can be used in datasets (all are
serialized into ArrayType).
I had to alter and add new implicit
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16240
I would like to add that the conversion is specific to `List[_]`. I can add
support for arbitrary sequence types through the use of `CanBuildFrom` if it is
desirable.
We can also
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/16240
[SPARK-16792][SQL] Dataset containing a Case Class with a List type causes
a CompileException (converting sequence to list)
## What changes were proposed in this pull request?
Added
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16157#discussion_r91716176
--- Diff: docs/programming-guide.md ---
@@ -347,7 +347,7 @@ Some notes on reading files with Spark:
Apart from text files, Spark's Scala API
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16201#discussion_r91429780
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
---
@@ -81,7 +81,7 @@ class DecisionTreeClassifier
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16201
This time I inspected both the generated Javadoc and Scaladoc. It should be
fine now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/16201
[SPARK-3359][DOCS] Fix grater-than symbols in Javadoc to allow building
with Java 8
## What changes were proposed in this pull request?
The API documentation build was failing when
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16157#discussion_r91405625
--- Diff: docs/programming-guide.md ---
@@ -347,7 +347,7 @@ Some notes on reading files with Spark:
Apart from text files, Spark's Scala API
Github user michalsenkyr commented on the issue:
https://github.com/apache/spark/pull/16157
I added a few more sentences describing the cases in which the user might
want to use the argument. However, I am afraid this might be a little too
descriptive.
---
If your project is set up
Github user michalsenkyr commented on a diff in the pull request:
https://github.com/apache/spark/pull/16157#discussion_r90975179
--- Diff: docs/programming-guide.md ---
@@ -347,7 +347,7 @@ Some notes on reading files with Spark:
Apart from text files, Spark's Scala API
GitHub user michalsenkyr opened a pull request:
https://github.com/apache/spark/pull/16157
[SPARK-18723][DOC] Expanded programming guide information on wholeTexâ¦
## What changes were proposed in this pull request?
Add additional information to wholeTextFiles
76 matches
Mail list logo