Github user kmader commented on the issue:
https://github.com/apache/spark/pull/15327
@rxin on the PS, how would you foresee the SQL implementation for binary
support? is there a standard method of going from bytestreams to dataframes?
---
If your project is set up for it, you can
Github user kmader commented on the pull request:
https://github.com/apache/spark/pull/9417#issuecomment-153550148
@srowen @hvanhovell this is a nice improvement and more elegant than the
original approach.
As a side node, In our code base (which uses PortableDataStream
GitHub user kmader opened a pull request:
https://github.com/apache/spark/pull/3123
Syncing up local copy
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/4Quant/spark master
Alternatively you can review and apply these changes
Github user kmader closed the pull request at:
https://github.com/apache/spark/pull/3123
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user kmader commented on a diff in the pull request:
https://github.com/apache/spark/pull/1658#discussion_r19582168
--- Diff: core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user kmader commented on a diff in the pull request:
https://github.com/apache/spark/pull/1658#discussion_r19133684
--- Diff:
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -220,6 +227,83 @@ class JavaSparkContext(val sc: SparkContext) extends
Github user kmader commented on the pull request:
https://github.com/apache/spark/pull/1658#issuecomment-59832070
So I made the requested changes and added a few more tests, but the tests
appear to have not run for a strange reason:
https://amplab.cs.berkeley.edu/jenkins/job
Github user kmader commented on a diff in the pull request:
https://github.com/apache/spark/pull/1658#discussion_r18335807
--- Diff: core/src/main/scala/org/apache/spark/input/RawFileInput.scala ---
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user kmader commented on a diff in the pull request:
https://github.com/apache/spark/pull/1658#discussion_r18267674
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -511,6 +511,67 @@ class SparkContext(config: SparkConf) extends Logging
Github user kmader commented on a diff in the pull request:
https://github.com/apache/spark/pull/1658#discussion_r18267344
--- Diff: core/src/main/scala/org/apache/spark/input/RawFileInput.scala ---
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF
Github user kmader commented on the pull request:
https://github.com/apache/spark/pull/1658#issuecomment-55769371
Thanks @jrabary for this find, it had to do with the new method for
handling PortableDataStreams which didn't calculate the name correctly. I
think I have it fixe
Github user kmader commented on the pull request:
https://github.com/apache/spark/pull/1658#issuecomment-54744540
Hey @mateiz, Sorry, I had other projects to work on. I have made the
changes and called the new class ```PortableDataStream```
---
If your project is set up for it
Github user kmader commented on the pull request:
https://github.com/apache/spark/pull/1658#issuecomment-52219293
Addressing the major issues brought up
Do we need both a stream API and a byte array one? The byte array might be
more problematic with out of memory, but stream
Github user kmader commented on a diff in the pull request:
https://github.com/apache/spark/pull/1658#discussion_r16177677
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -511,6 +511,67 @@ class SparkContext(config: SparkConf) extends Logging
Github user kmader commented on the pull request:
https://github.com/apache/spark/pull/1658#issuecomment-52049280
@freeman-lab looks good, I will add it to this pull request if that's ok
for you. I think my personal preference would be do keep byteFile for standard
operation
Github user kmader commented on the pull request:
https://github.com/apache/spark/pull/1658#issuecomment-50700133
Thanks for the feedback, I have made the changes requested, created an
issue (https://issues.apache.org/jira/browse/SPARK-2759), and added a
dataStreamFiles to both
GitHub user kmader opened a pull request:
https://github.com/apache/spark/pull/1658
Generic Binary File Support in Spark
The additions add the abstract BinaryFileInputFormat and BinaryRecordReader
classes for reading in data as a byte stream and converting it to another
format
17 matches
Mail list logo