[GitHub] spark issue #17311: [SPARK-19970][SQL] Table owner should be USER instead of...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/17311 The difference is due to https://github.com/apache/spark/commit/3881f342b49efdb1e0d5ee27f616451ea1928c5d#diff-6fd847124f8eae45ba2de1cf7d6296feR855 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106962007 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -34,6 +34,8 @@ import org.apache.spark.util.{ShutdownHookManager, Utils} */ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolean) extends Logging { + private val METADATA_FILE_SUFFIX = ".meta" --- End diff -- Hmm, good point... there's currently no metadata kept in the `DiskStore` class, but then this shouldn't be a lot of data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106962403 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -17,48 +17,61 @@ package org.apache.spark.storage -import java.io.{FileOutputStream, IOException, RandomAccessFile} +import java.io._ import java.nio.ByteBuffer +import java.nio.channels.{Channels, ReadableByteChannel, WritableByteChannel} import java.nio.channels.FileChannel.MapMode +import java.nio.charset.StandardCharsets.UTF_8 -import com.google.common.io.Closeables +import scala.collection.mutable.ListBuffer -import org.apache.spark.SparkConf +import com.google.common.io.{ByteStreams, Closeables, Files} +import io.netty.channel.FileRegion +import io.netty.util.AbstractReferenceCounted + +import org.apache.spark.{SecurityManager, SparkConf} import org.apache.spark.internal.Logging -import org.apache.spark.util.Utils +import org.apache.spark.network.buffer.ManagedBuffer +import org.apache.spark.security.CryptoStreamUtils +import org.apache.spark.util.{ByteBufferInputStream, Utils} import org.apache.spark.util.io.ChunkedByteBuffer /** * Stores BlockManager blocks on disk. */ -private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) extends Logging { +private[spark] class DiskStore( +conf: SparkConf, +diskManager: DiskBlockManager, +securityManager: SecurityManager) extends Logging { private val minMemoryMapBytes = conf.getSizeAsBytes("spark.storage.memoryMapThreshold", "2m") def getSize(blockId: BlockId): Long = { -diskManager.getFile(blockId.name).length +val file = diskManager.getMetadataFile(blockId) +Files.toString(file, UTF_8).toLong } /** * Invokes the provided callback function to write the specific block. * * @throws IllegalStateException if the block already exists in the disk store. */ - def put(blockId: BlockId)(writeFunc: FileOutputStream => Unit): Unit = { + def put(blockId: BlockId)(writeFunc: WritableByteChannel => Unit): Unit = { if (contains(blockId)) { throw new IllegalStateException(s"Block $blockId is already present in the disk store") } logDebug(s"Attempting to put block $blockId") val startTime = System.currentTimeMillis val file = diskManager.getFile(blockId) -val fileOutputStream = new FileOutputStream(file) +val out = new CountingWritableChannel(openForWrite(file)) var threwException: Boolean = true try { - writeFunc(fileOutputStream) + writeFunc(out) + Files.write(out.getCount().toString(), diskManager.getMetadataFile(blockId), UTF_8) threwException = false } finally { try { -Closeables.close(fileOutputStream, threwException) +Closeables.close(out, threwException) --- End diff -- This was the previous behavior, but well, doesn't hurt to fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/14617 while I kind of like the hover because it doesn't clutter the page, it does bring up a couple concerns: - user can't sort by them - user might not know to hover (none of the other pages that I know of show useful values like this) The latter one is perhaps the one that is more concerning. Unless we can highlight it or somehow draw the users attention to it they will not know its there. But the first one is also pretty valid. If I have thousands of executors and looking for one that is out of off heap memory, its pretty difficult to hover over all of them. I don't see any changes to the Storage UI page, do we want to add something there as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17274: [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17274#discussion_r106963651 --- Diff: R/pkg/inst/tests/testthat/test_context.R --- @@ -177,6 +177,12 @@ test_that("add and get file to be downloaded with Spark job on every node", { spark.addFile(path) download_path <- spark.getSparkFiles(filename) expect_equal(readLines(download_path), words) + + # Test spark.getSparkFiles works well on executors. + seq <- seq(from = 1, to = 10, length.out = 5) + f <- function(seq) { spark.getSparkFiles(filename) } + spark.lapply(seq, f) --- End diff -- I think we should at least check the return value from spark.lapply --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106964049 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -73,55 +86,219 @@ private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) e } def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = { -put(blockId) { fileOutputStream => - val channel = fileOutputStream.getChannel - Utils.tryWithSafeFinally { -bytes.writeFully(channel) - } { -channel.close() - } +put(blockId) { channel => + bytes.writeFully(channel) } } - def getBytes(blockId: BlockId): ChunkedByteBuffer = { + def getBytes(blockId: BlockId): BlockData = { val file = diskManager.getFile(blockId.name) -val channel = new RandomAccessFile(file, "r").getChannel -Utils.tryWithSafeFinally { - // For small files, directly read rather than memory map - if (file.length < minMemoryMapBytes) { -val buf = ByteBuffer.allocate(file.length.toInt) -channel.position(0) -while (buf.remaining() != 0) { - if (channel.read(buf) == -1) { -throw new IOException("Reached EOF before filling buffer\n" + - s"offset=0\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}") +val blockSize = getSize(blockId) + +securityManager.getIOEncryptionKey() match { + case Some(key) => +// Encrypted blocks cannot be memory mapped; return a special object that does decryption +// and provides InputStream / FileRegion implementations for reading the data. +new EncryptedBlockData(file, blockSize, conf, key) + + case _ => +val channel = new FileInputStream(file).getChannel() +if (blockSize < minMemoryMapBytes) { + // For small files, directly read rather than memory map. + Utils.tryWithSafeFinally { +val buf = ByteBuffer.allocate(blockSize.toInt) +while (buf.remaining() > 0) { + channel.read(buf) +} +buf.flip() +new ByteBufferBlockData(new ChunkedByteBuffer(buf)) + } { +channel.close() + } +} else { + Utils.tryWithSafeFinally { +new ByteBufferBlockData( + new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length))) + } { +channel.close() } } -buf.flip() -new ChunkedByteBuffer(buf) - } else { -new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length)) - } -} { - channel.close() } } def remove(blockId: BlockId): Boolean = { val file = diskManager.getFile(blockId.name) -if (file.exists()) { - val ret = file.delete() - if (!ret) { -logWarning(s"Error deleting ${file.getPath()}") +val meta = diskManager.getMetadataFile(blockId) + +def delete(f: File): Boolean = { + if (f.exists()) { +val ret = f.delete() +if (!ret) { + logWarning(s"Error deleting ${file.getPath()}") +} + +ret + } else { +false } - ret -} else { - false } + +delete(file) & delete(meta) } def contains(blockId: BlockId): Boolean = { val file = diskManager.getFile(blockId.name) file.exists() } + + private def openForWrite(file: File): WritableByteChannel = { +val out = new FileOutputStream(file).getChannel() +try { + securityManager.getIOEncryptionKey().map { key => +CryptoStreamUtils.createWritableChannel(out, conf, key) + }.getOrElse(out) +} catch { + case e: Exception => +out.close() --- End diff -- There might be exceptions specific for the commons-crypto library being thrown. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106965268 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -73,55 +86,219 @@ private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) e } def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = { -put(blockId) { fileOutputStream => - val channel = fileOutputStream.getChannel - Utils.tryWithSafeFinally { -bytes.writeFully(channel) - } { -channel.close() - } +put(blockId) { channel => + bytes.writeFully(channel) } } - def getBytes(blockId: BlockId): ChunkedByteBuffer = { + def getBytes(blockId: BlockId): BlockData = { val file = diskManager.getFile(blockId.name) -val channel = new RandomAccessFile(file, "r").getChannel -Utils.tryWithSafeFinally { - // For small files, directly read rather than memory map - if (file.length < minMemoryMapBytes) { -val buf = ByteBuffer.allocate(file.length.toInt) -channel.position(0) -while (buf.remaining() != 0) { - if (channel.read(buf) == -1) { -throw new IOException("Reached EOF before filling buffer\n" + - s"offset=0\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}") +val blockSize = getSize(blockId) + +securityManager.getIOEncryptionKey() match { + case Some(key) => +// Encrypted blocks cannot be memory mapped; return a special object that does decryption +// and provides InputStream / FileRegion implementations for reading the data. +new EncryptedBlockData(file, blockSize, conf, key) + + case _ => +val channel = new FileInputStream(file).getChannel() +if (blockSize < minMemoryMapBytes) { + // For small files, directly read rather than memory map. + Utils.tryWithSafeFinally { +val buf = ByteBuffer.allocate(blockSize.toInt) +while (buf.remaining() > 0) { + channel.read(buf) +} +buf.flip() +new ByteBufferBlockData(new ChunkedByteBuffer(buf)) + } { +channel.close() + } +} else { + Utils.tryWithSafeFinally { +new ByteBufferBlockData( + new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length))) + } { +channel.close() } } -buf.flip() -new ChunkedByteBuffer(buf) - } else { -new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length)) - } -} { - channel.close() } } def remove(blockId: BlockId): Boolean = { val file = diskManager.getFile(blockId.name) -if (file.exists()) { - val ret = file.delete() - if (!ret) { -logWarning(s"Error deleting ${file.getPath()}") +val meta = diskManager.getMetadataFile(blockId) + +def delete(f: File): Boolean = { + if (f.exists()) { +val ret = f.delete() +if (!ret) { + logWarning(s"Error deleting ${file.getPath()}") +} + +ret + } else { +false } - ret -} else { - false } + +delete(file) & delete(meta) } def contains(blockId: BlockId): Boolean = { val file = diskManager.getFile(blockId.name) file.exists() } + + private def openForWrite(file: File): WritableByteChannel = { +val out = new FileOutputStream(file).getChannel() +try { + securityManager.getIOEncryptionKey().map { key => +CryptoStreamUtils.createWritableChannel(out, conf, key) + }.getOrElse(out) +} catch { + case e: Exception => +out.close() +throw e +} + } + +} + +private class EncryptedBlockData( +file: File, +blockSize: Long, +conf: SparkConf, +key: Array[Byte]) extends BlockData { + + override def toInputStream(): InputStream = Channels.newInputStream(open()) + + override def toManagedBuffer(): ManagedBuffer = new EncryptedManagedBuffer() + + override def toByteBuffer(allocator: Int => ByteBuffer): ChunkedByteBuffer = { +val source = open() +try { + var remaining = blockSize + val chunks = new ListBuffer[ByteBuffer]() + while (remaining > 0) { +val chunkSize = math.min(remaining, Int.MaxValue) +val chunk = allocator(chunkSi
[GitHub] spark pull request #17274: [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17274#discussion_r106963159 --- Diff: R/pkg/R/context.R --- @@ -345,7 +351,7 @@ spark.getSparkFilesRootDirectory <- function() { #'} #' @note spark.getSparkFiles since 2.1.0 spark.getSparkFiles <- function(fileName) { - callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName)) + file.path(spark.getSparkFilesRootDirectory(), as.character(fileName)) --- End diff -- perhaps to check, does getSparkFilesRootDirectory returns something that file.path can handle, for instance, `file://` wouldn't work properly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17274: [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17274#discussion_r106963388 --- Diff: R/pkg/R/context.R --- @@ -345,7 +351,7 @@ spark.getSparkFilesRootDirectory <- function() { #'} #' @note spark.getSparkFiles since 2.1.0 spark.getSparkFiles <- function(fileName) { - callJStatic("org.apache.spark.SparkFiles", "get", as.character(fileName)) + file.path(spark.getSparkFilesRootDirectory(), as.character(fileName)) --- End diff -- or that fileName can be something file.path doesn't handle but SparkFiles.get() can? like an absolute path? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17362: [SPARK-20033][SQL] support hive permanent function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17362 **[Test build #74890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74890/testReport)** for PR 17362 at commit [`9325a7c`](https://github.com/apache/spark/commit/9325a7c2a4ce9d0db6c39d7781e517082a2fe4be). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106963546 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -73,55 +86,219 @@ private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) e } def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = { -put(blockId) { fileOutputStream => - val channel = fileOutputStream.getChannel - Utils.tryWithSafeFinally { -bytes.writeFully(channel) - } { -channel.close() - } +put(blockId) { channel => + bytes.writeFully(channel) --- End diff -- The channel is owned by the code in the `put` method, which does that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17363: [SPARK-19970][SQL][BRANCH-2.1] Table owner should...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/17363 [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USER instead of PRINCIPAL in kerberized clusters ## What changes were proposed in this pull request? In the kerberized hadoop cluster, when Spark creates tables, the owner of tables are filled with PRINCIPAL strings instead of USER names. This is inconsistent with Hive and causes problems when using [ROLE](https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization) in Hive. We had better to fix this. **BEFORE** ```scala scala> sql("create table t(a int)").show scala> sql("desc formatted t").show(false) ... |Owner: |sp...@example.com | | ``` **AFTER** ```scala scala> sql("create table t(a int)").show scala> sql("desc formatted t").show(false) ... |Owner: |spark | | ``` ## How was this patch tested? Manually do `create table` and `desc formatted` because this happens in Kerberized clusters. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-19970-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17363.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17363 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17363: [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USE...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17363 **[Test build #74891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74891/testReport)** for PR 17363 at commit [`1328f1d`](https://github.com/apache/spark/commit/1328f1d204991f4b4a993a523bd01584bab3122c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17326: [SPARK-19985][ML] Fixed copy method for some ML M...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/17326#discussion_r106969492 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -74,6 +74,7 @@ class MultilayerPerceptronClassifierSuite .setMaxIter(100) .setSolver("l-bfgs") val model = trainer.fit(dataset) +MLTestingUtils.checkCopy(model) --- End diff -- Thanks @hhbyyh , I agree that this is maybe not the best general way to test it, but this is how it is done in every other `Model` test. Maybe this is ok for now and we could follow up with a discussion on the best way to check "basic" ML functionality? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17357: [SPARK-20025][CORE] Fix spark's driver failover mechanis...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/17357 Please change the title of this PR. "Fixed foo" is nearly useless when scanning the commit log in the future since it doesn't tell us anything about either the nature of the problem or the actual code change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17321: [SPARK-19899][ML] Replace featuresCol with itemsC...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17321#discussion_r106970923 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -37,7 +37,20 @@ import org.apache.spark.sql.types._ /** * Common params for FPGrowth and FPGrowthModel */ -private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { +private[fpm] trait FPGrowthParams extends Params with HasPredictionCol { + + /** + * Items column name. + * Default: "items" + * @group param + */ + @Since("2.2.0") --- End diff -- I think it's OK to have it here. Only FPGrowth and FPGrowthModel will ever inherit from FPGrowthParams. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17287: [SPARK-19945][SQL]add test suite for SessionCatalog with...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17287 I addressed these comments in the PR https://github.com/apache/spark/pull/17354 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17354 **[Test build #74892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74892/testReport)** for PR 17354 at commit [`31f1099`](https://github.com/apache/spark/commit/31f1099e4c926cbfdf0eb74e999be051e7257d17). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17321: [SPARK-19899][ML] Replace featuresCol with itemsCol in m...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17321 LGTM Thanks for the PR! Merging with master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17339: [SPARK-20010][SQL] Sort information is lost after sort m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17339 **[Test build #74888 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74888/testReport)** for PR 17339 at commit [`9d2ab54`](https://github.com/apache/spark/commit/9d2ab549f19427ffc71ef9cef0517c080b71e31e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SortOrder(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17339: [SPARK-20010][SQL] Sort information is lost after sort m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17339 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17321: [SPARK-19899][ML] Replace featuresCol with itemsC...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17321 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17339: [SPARK-20010][SQL] Sort information is lost after sort m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17339 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74888/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17293: [SPARK-19950][SQL] Fix to ignore nullable when df.load()...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/17293 @HyukjinKwon I added data validation using schema information for Parquet Reader. Could you please take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17363: [SPARK-19970][SQL][BRANCH-2.1] Table owner should...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/17363#discussion_r106973465 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -828,7 +828,7 @@ private[hive] class HiveClientImpl( hiveTable.setFields(schema.asJava) } hiveTable.setPartCols(partCols.asJava) -hiveTable.setOwner(conf.getUser) +hiveTable.setOwner(state.getAuthenticator().getUserName()) --- End diff -- @vanzin . I made a backport for branch-2.1 and tested in kerberized cluster. This one uses `state` as you advised. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14124: [SPARK-16472][SQL] Force user specified schema to the nu...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/14124 #17293 added data validation using schema information for Parquet Reader. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17343 **[Test build #74893 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74893/testReport)** for PR 17343 at commit [`5fe279e`](https://github.com/apache/spark/commit/5fe279ea3b5a70206be71ff9a5ebf2175cf0f8d1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14617: [SPARK-17019][Core] Expose on-heap and off-heap memory u...
Github user squito commented on the issue: https://github.com/apache/spark/pull/14617 good points @tgravescs . What about making them additional metrics, turned on by a checkbox, like the extra task metrics? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17342: [SPARK-12868][SQL] Allow adding jars from hdfs
Github user weiqingy commented on a diff in the pull request: https://github.com/apache/spark/pull/17342#discussion_r106977152 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala --- @@ -148,6 +149,8 @@ private[sql] class SharedState(val sparkContext: SparkContext) extends Logging { object SharedState { + URL.setURLStreamHandlerFactory(new SparkUrlStreamHandlerFactory()) --- End diff -- `URL#setURLStreamHandlerFactory` can only be called once per JVM. If we use `FsUrlStreamHandlerFactory` directly, we won't be able to support other factories. I wrapped it for future extendability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17342: [SPARK-12868][SQL] Allow adding jars from hdfs
Github user weiqingy commented on a diff in the pull request: https://github.com/apache/spark/pull/17342#discussion_r106977805 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2767,3 +2767,24 @@ private[spark] class CircularBuffer(sizeInBytes: Int = 10240) extends java.io.Ou new String(nonCircularBuffer, StandardCharsets.UTF_8) } } + + +/** + * Factory for URL stream handlers. It relies on 'protocol' to choose the appropriate + * UrlStreamHandlerFactory to create URLStreamHandler. Adding new 'if' branches in + * 'createURLStreamHandler' like 'hdfsHandler' to support more protocols. + */ +private[spark] class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory { + private var hdfsHandler : URLStreamHandler = _ + + def createURLStreamHandler(protocol: String): URLStreamHandler = { +if (protocol.compareToIgnoreCase("hdfs") == 0) { --- End diff -- Yeah, that's a good point. I'll check with Hadoop for all supported file systems, and ideally if we can get them via some API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17342: [SPARK-12868][SQL] Allow adding jars from hdfs
Github user weiqingy commented on a diff in the pull request: https://github.com/apache/spark/pull/17342#discussion_r106978348 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2767,3 +2767,24 @@ private[spark] class CircularBuffer(sizeInBytes: Int = 10240) extends java.io.Ou new String(nonCircularBuffer, StandardCharsets.UTF_8) } } + + +/** + * Factory for URL stream handlers. It relies on 'protocol' to choose the appropriate + * UrlStreamHandlerFactory to create URLStreamHandler. Adding new 'if' branches in + * 'createURLStreamHandler' like 'hdfsHandler' to support more protocols. + */ +private[spark] class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory { + private var hdfsHandler : URLStreamHandler = _ + + def createURLStreamHandler(protocol: String): URLStreamHandler = { +if (protocol.compareToIgnoreCase("hdfs") == 0) { + if (hdfsHandler == null) { +hdfsHandler = new FsUrlStreamHandlerFactory().createURLStreamHandler(protocol) --- End diff -- Thanks, I'll follow your suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17362: [SPARK-20033][SQL] support hive permanent function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17362 **[Test build #74890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74890/testReport)** for PR 17362 at commit [`9325a7c`](https://github.com/apache/spark/commit/9325a7c2a4ce9d0db6c39d7781e517082a2fe4be). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17362: [SPARK-20033][SQL] support hive permanent function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17362 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17362: [SPARK-20033][SQL] support hive permanent function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17362 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74890/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17170 Jenkins retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17218 Jenkins retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17170 **[Test build #74896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74896/testReport)** for PR 17170 at commit [`706514d`](https://github.com/apache/spark/commit/706514da26107ef25bef028e2143fa0a09e5cc19). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17361: [SPARK-20030][SS][WIP]Event-time-based timeout for MapGr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17361 **[Test build #74894 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74894/testReport)** for PR 17361 at commit [`6e9f408`](https://github.com/apache/spark/commit/6e9f408d73090429d7497840d6daa7a4e19439e6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17218 **[Test build #74895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74895/testReport)** for PR 17218 at commit [`0a3798d`](https://github.com/apache/spark/commit/0a3798d906e1341303b6872d44d5ce68d853aae4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17218 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17218 **[Test build #74895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74895/testReport)** for PR 17218 at commit [`0a3798d`](https://github.com/apache/spark/commit/0a3798d906e1341303b6872d44d5ce68d853aae4). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class HasItemsCol(Params):` * `class FPGrowthModel(JavaModel, JavaMLWritable, JavaMLReadable):` * `class FPGrowth(JavaEstimator, HasItemsCol, HasPredictionCol,` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17218 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74895/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17343 @rxin - Updated documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17354 **[Test build #74892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74892/testReport)** for PR 17354 at commit [`31f1099`](https://github.com/apache/spark/commit/31f1099e4c926cbfdf0eb74e999be051e7257d17). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17354 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17354 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74892/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17343 **[Test build #74897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74897/testReport)** for PR 17343 at commit [`06c1909`](https://github.com/apache/spark/commit/06c1909fbf35170639d12f32480b369c702dbe24). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17218 **[Test build #74898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74898/testReport)** for PR 17218 at commit [`5f4673e`](https://github.com/apache/spark/commit/5f4673e74049d9f6918f4e029215ae6c8364043e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17218 **[Test build #74898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74898/testReport)** for PR 17218 at commit [`5f4673e`](https://github.com/apache/spark/commit/5f4673e74049d9f6918f4e029215ae6c8364043e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17218 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17218: [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrow...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17218 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74898/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17302: [SPARK-19959][SQL] Fix to throw NullPointerExcept...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/17302#discussion_r106992055 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala --- @@ -70,7 +70,20 @@ object RDDConversions { object ExternalRDD { def apply[T: Encoder](rdd: RDD[T], session: SparkSession): LogicalPlan = { -val externalRdd = ExternalRDD(CatalystSerde.generateObjAttr[T], rdd)(session) +val attr = { + val attr = CatalystSerde.generateObjAttr[T] --- End diff -- I see. `ScalaReflection.schemaFor[T]` requires `TypeTag`. IIUC, to pass `TypeTag` to `CatalystSerde.generateObjAttr` seems to require several code changes across files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17363: [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USE...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17363 **[Test build #74891 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74891/testReport)** for PR 17363 at commit [`1328f1d`](https://github.com/apache/spark/commit/1328f1d204991f4b4a993a523bd01584bab3122c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17363: [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USE...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17363 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74891/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17363: [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USE...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17363 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12004 **[Test build #74899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74899/testReport)** for PR 12004 at commit [`83d9368`](https://github.com/apache/spark/commit/83d936870ad0651fc2622593e53d3e31d7eb8d4b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17328: [SPARK-19975][Python][SQL] Add map_keys and map_values f...
Github user maver1ck commented on the issue: https://github.com/apache/spark/pull/17328 Looks good :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17170 **[Test build #74896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74896/testReport)** for PR 17170 at commit [`706514d`](https://github.com/apache/spark/commit/706514da26107ef25bef028e2143fa0a09e5cc19). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17170 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17170 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74896/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...
Github user budde commented on the issue: https://github.com/apache/spark/pull/17250 @brkyvz PR has been updated, apologies for the delay. I've added ```SerializableCredentialsProvider.Builder```, which I'm willing to hear suggestions for a better name on. I wanted to stay away from something like ```AWSCredentials.Builder``` so as to avoid confusion with similarly-named classes in the AWS Java SDK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17250 **[Test build #74901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74901/testReport)** for PR 17250 at commit [`d6afaef`](https://github.com/apache/spark/commit/d6afaef4099d9b20d0d1257ec5942d1bb5b868af). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17354 **[Test build #74900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74900/testReport)** for PR 17354 at commit [`e860fd7`](https://github.com/apache/spark/commit/e860fd77a62dbec4ce11c89864a925223aba6727). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17363: [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USE...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/17363 This is also tested manually in kerberized cluster, @vanzin . BTW, Spark 1.6 has the same issue at [ClientWrapper](https://github.com/apache/spark/blob/branch-1.6/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala#L379). According to the email, there exists demands on more Apache Spark 1.6.X. May I create a backport for branch-1.6? How do you think about that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17287: [SPARK-19945][SQL]add test suite for SessionCatal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17287#discussion_r106997366 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -27,41 +27,67 @@ import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.apache.spark.sql.catalyst.plans.PlanTest import org.apache.spark.sql.catalyst.plans.logical.{Range, SubqueryAlias, View} +class InMemorySessionCatalogSuite extends SessionCatalogSuite { + protected val utils = new CatalogTestUtils { +override val tableInputFormat: String = "com.fruit.eyephone.CameraInputFormat" +override val tableOutputFormat: String = "com.fruit.eyephone.CameraOutputFormat" +override val defaultProvider: String = "parquet" +override def newEmptyCatalog(): ExternalCatalog = new InMemoryCatalog + } +} + /** - * Tests for [[SessionCatalog]] that assume that [[InMemoryCatalog]] is correctly implemented. + * Tests for [[SessionCatalog]] * * Note: many of the methods here are very similar to the ones in [[ExternalCatalogSuite]]. * This is because [[SessionCatalog]] and [[ExternalCatalog]] share many similar method * signatures but do not extend a common parent. This is largely by design but * unfortunately leads to very similar test code in two places. */ -class SessionCatalogSuite extends PlanTest { - private val utils = new CatalogTestUtils { -override val tableInputFormat: String = "com.fruit.eyephone.CameraInputFormat" -override val tableOutputFormat: String = "com.fruit.eyephone.CameraOutputFormat" -override val defaultProvider: String = "parquet" -override def newEmptyCatalog(): ExternalCatalog = new InMemoryCatalog - } +abstract class SessionCatalogSuite extends PlanTest { + protected val utils: CatalogTestUtils + + protected val isHiveExternalCatalog = false import utils._ + private def withBasicCatalog(f: SessionCatalog => Unit): Unit = { +val catalog = new SessionCatalog(newBasicCatalog()) +catalog.createDatabase(newDb("default"), ignoreIfExists = true) +try { + f(catalog) +} finally { + catalog.reset() +} + } + + private def withEmptyCatalog(f: SessionCatalog => Unit): Unit = { +val catalog = new SessionCatalog(newEmptyCatalog()) +catalog.createDatabase(newDb("default"), ignoreIfExists = true) --- End diff -- Our `reset()` function assumes the existence of Default --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17354 **[Test build #74902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74902/testReport)** for PR 17354 at commit [`f23984d`](https://github.com/apache/spark/commit/f23984d2ab68bf15cd593106d052206f38dbc362). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17170 this is the error from appveyor build ``` [00:16:14] [ERROR] C:\projects\spark\mllib\src\main\scala\org\apache\spark\ml\r\FPGrowthWrapper.scala:50: value setItemsCol is not a member of org.apache.spark.ml.fpm.FPGrowth [00:16:14] possible cause: maybe a semicolon is missing before `value setItemsCol'? [00:16:14] [ERROR] .setItemsCol(itemsCol) [00:16:14] [ERROR]^ [00:16:29] [ERROR] one error found ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17342: [SPARK-12868][SQL] Allow adding jars from hdfs
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/17342#discussion_r107001274 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2767,3 +2767,24 @@ private[spark] class CircularBuffer(sizeInBytes: Int = 10240) extends java.io.Ou new String(nonCircularBuffer, StandardCharsets.UTF_8) } } + + +/** + * Factory for URL stream handlers. It relies on 'protocol' to choose the appropriate + * UrlStreamHandlerFactory to create URLStreamHandler. Adding new 'if' branches in + * 'createURLStreamHandler' like 'hdfsHandler' to support more protocols. + */ +private[spark] class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory { + private var hdfsHandler : URLStreamHandler = _ + + def createURLStreamHandler(protocol: String): URLStreamHandler = { +if (protocol.compareToIgnoreCase("hdfs") == 0) { --- End diff -- API? no, just fs.*.impl for the standard ones, discovery via META-INF/services and you don't want to go there. Probably better to have a core list of the hadoop redists (including the new 2.8+ adl & oss object stores), and the google cloud URL (gss ? ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17364: [SPARK-20038] [core]: move the currentWriter=null...
GitHub user steveloughran opened a pull request: https://github.com/apache/spark/pull/17364 [SPARK-20038] [core]: move the currentWriter=null assignments into finally {} … ## What changes were proposed in this pull request? have the`FileFormatWriter.ExecuteWriteTask.releaseResources()` implementations set `currentWriter=null` in a finally clause. This guarantees that if the first call to `currentWriter()` throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream. ## How was this patch tested? Tricky. I've been fixing the underlying cause when I saw the problem (HADOOP-14204), but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken. code review, however, should be straightforward You can merge this pull request into a Git repository by running: $ git pull https://github.com/steveloughran/spark stevel/SPARK-20038-close Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17364.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17364 commit 725740b49a2b37392699092b1b0e08c63a6152ff Author: Steve Loughran Date: 2017-03-20T19:54:51Z SPARK-20038: move the currentWriter=null assignments into finally {} clauses Change-Id: I1e07f5b90ba1a2b05978b1d65876d746d07d1f3c --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17250 **[Test build #74901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74901/testReport)** for PR 17250 at commit [`d6afaef`](https://github.com/apache/spark/commit/d6afaef4099d9b20d0d1257ec5942d1bb5b868af). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class Builder ` * ` class Builder ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17250 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17250 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74901/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17364: [SPARK-20038] [core]: FileFormatWriter.ExecuteWriteTask....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17364 **[Test build #74903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74903/testReport)** for PR 17364 at commit [`725740b`](https://github.com/apache/spark/commit/725740b49a2b37392699092b1b0e08c63a6152ff). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17250: [SPARK-19911][STREAMING] Add builder interface fo...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/17250#discussion_r107003143 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.scala --- @@ -71,7 +75,256 @@ private[kinesis] class KinesisInputDStream[T: ClassTag]( override def getReceiver(): Receiver[T] = { new KinesisReceiver(streamName, endpointUrl, regionName, initialPositionInStream, - checkpointAppName, checkpointInterval, storageLevel, messageHandler, - kinesisCredsProvider) + checkpointAppName, checkpointInterval, _storageLevel, messageHandler, + kinesisCredsProvider, dynamoDBCredsProvider, cloudWatchCredsProvider) } } + +@InterfaceStability.Stable +object KinesisInputDStream { + /** + * Builder for [[KinesisInputDStream]] instances. + * + * @since 2.2.0 + */ + @InterfaceStability.Stable + class Builder { +// Required params +private var streamingContext: Option[StreamingContext] = None +private var streamName: Option[String] = None +private var checkpointAppName: Option[String] = None + +// Params with defaults +private var endpointUrl: Option[String] = None +private var regionName: Option[String] = None +private var initialPositionInStream: Option[InitialPositionInStream] = None +private var checkpointInterval: Option[Duration] = None +private var storageLevel: Option[StorageLevel] = None +private var kinesisCredsProvider: Option[SerializableCredentialsProvider] = None +private var dynamoDBCredsProvider: Option[SerializableCredentialsProvider] = None +private var cloudWatchCredsProvider: Option[SerializableCredentialsProvider] = None + +/** + * Sets the StreamingContext that will be used to construct the Kinesis DStream. This is a + * required parameter. + * + * @param ssc [[StreamingContext]] used to construct Kinesis DStreams + * @return Reference to this [[KinesisInputDStream.Builder]] + */ +def streamingContext(ssc: StreamingContext): Builder = { + streamingContext = Option(ssc) + this +} + +/** + * Sets the StreamingContext that will be used to construct the Kinesis DStream. This is a + * required parameter. + * + * @param jssc [[JavaStreamingContext]] used to construct Kinesis DStreams + * @return Reference to this [[KinesisInputDStream.Builder]] + */ +def streamingContext(jssc: JavaStreamingContext): Builder = { + streamingContext = Option(jssc.ssc) + this +} + +/** + * Sets the name of the Kinesis stream that the DStream will read from. This is a required + * parameter. + * + * @param streamName Name of Kinesis stream that the DStream will read from + * @return Reference to this [[KinesisInputDStream.Builder]] + */ +def streamName(streamName: String): Builder = { + this.streamName = Option(streamName) + this +} + +/** + * Sets the KCL application name to use when checkpointing state to DynamoDB. This is a + * required parameter. + * + * @param appName Value to use for the KCL app name (used when creating the DynamoDB checkpoint + *table and when writing metrics to CloudWatch) + * @return Reference to this [[KinesisInputDStream.Builder]] + */ +def checkpointAppName(appName: String): Builder = { + checkpointAppName = Option(appName) + this +} + +/** + * Sets the AWS Kinesis endpoint URL. Defaults to "https://kinesis.us-east-1.amazonaws.com"; if + * no custom value is specified + * + * @param url Kinesis endpoint URL to use + * @return Reference to this [[KinesisInputDStream.Builder]] + */ +def endpointUrl(url: String): Builder = { + endpointUrl = Option(url) + this +} + +/** + * Sets the AWS region to construct clients for. Defaults to "us-east-1" if no custom value + * is specified. + * + * @param regionName Name of AWS region to use (e.g. "us-west-2") + * @return Reference to this [[KinesisInputDStream.Builder]] + */ +def regionName(regionName: String): Builder = { + this.regionName = Option(regionName) + this +} + +/** + * Sets the initial position data is read from in the Kinesis stream. Defaults to + * [[InitialPositionInStream.LATEST]] if no custom value is specified. + * + * @param initialPosition InitialPositionInStream value specifying where Spark Streaming
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17354 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17354 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74902/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17354 **[Test build #74902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74902/testReport)** for PR 17354 at commit [`f23984d`](https://github.com/apache/spark/commit/f23984d2ab68bf15cd593106d052206f38dbc362). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17354 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17354 **[Test build #74904 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74904/testReport)** for PR 17354 at commit [`f23984d`](https://github.com/apache/spark/commit/f23984d2ab68bf15cd593106d052206f38dbc362). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r107007132 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -56,6 +57,43 @@ private[spark] class BlockResult( val bytes: Long) /** + * Abstracts away how blocks are stored and provides different ways to read the underlying block + * data. The data for a BlockData instance can only be read once, since it may be backed by open + * file descriptors that change state as data is read. + */ +private[spark] trait BlockData { + + def toInputStream(): InputStream + + def toManagedBuffer(): ManagedBuffer + + def toByteBuffer(allocator: Int => ByteBuffer): ChunkedByteBuffer + + def size: Long + + def dispose(): Unit + +} + +private[spark] class ByteBufferBlockData( +val buffer: ChunkedByteBuffer, +autoDispose: Boolean = true) extends BlockData { + + override def toInputStream(): InputStream = buffer.toInputStream(dispose = autoDispose) + + override def toManagedBuffer(): ManagedBuffer = new NettyManagedBuffer(buffer.toNetty) + + override def toByteBuffer(allocator: Int => ByteBuffer): ChunkedByteBuffer = { +buffer.copy(allocator) + } --- End diff -- So I had traced through that stuff 2 or 3 times, and now I did it again and I think I finally understood all that's going on. Basically, the old code was really bad at explicitly disposing of the buffers, meaning a bunch of paths (like the ones that used managed buffers) didn't bother to do it and just left the work to the GC. I changed the code a bit to make the dispose more explicit and added comments in a few key places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17295 **[Test build #74905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74905/testReport)** for PR 17295 at commit [`1428fcd`](https://github.com/apache/spark/commit/1428fcd952ddcdb29a561d8c1c90d4f820955c15). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17361: [SPARK-20030][SS][WIP]Event-time-based timeout for MapGr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17361 **[Test build #74894 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74894/testReport)** for PR 17361 at commit [`6e9f408`](https://github.com/apache/spark/commit/6e9f408d73090429d7497840d6daa7a4e19439e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17361: [SPARK-20030][SS][WIP]Event-time-based timeout for MapGr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17361 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74894/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17361: [SPARK-20030][SS][WIP]Event-time-based timeout for MapGr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17361 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17354 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17354 **[Test build #74900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74900/testReport)** for PR 17354 at commit [`e860fd7`](https://github.com/apache/spark/commit/e860fd77a62dbec4ce11c89864a925223aba6727). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17354: [SPARK-20024] [SQL] [test-maven] SessionCatalog API setC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17354 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74900/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17315: [SPARK-19949][SQL] unify bad record handling in C...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17315#discussion_r107011371 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +class FailureSafeParser[IN]( +rawParser: IN => Seq[InternalRow], +mode: String, +schema: StructType, +columnNameOfCorruptRecord: String) { + + private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord) + private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord)) + private val resultRow = new GenericInternalRow(schema.length) + private val nullResult = new GenericInternalRow(schema.length) + + // This function takes 2 parameters: an optional partial result, and the bad record. If the given + // schema doesn't contain a field for corrupted record, we just return the partial result or a + // row with all fields null. If the given schema contains a field for corrupted record, we will + // set the bad record to this field, and set other fields according to the partial result or null. + private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = { +if (corruptFieldIndex.isDefined) { + (row, badRecord) => { +var i = 0 +while (i < actualSchema.length) { + val f = actualSchema(i) + resultRow(schema.fieldIndex(f.name)) = row.map(_.get(i, f.dataType)).orNull + i += 1 +} +resultRow(corruptFieldIndex.get) = badRecord() +resultRow + } +} else { + (row, _) => row.getOrElse(nullResult) +} + } + + def parse(input: IN): Iterator[InternalRow] = { +try { + rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null)) +} catch { + case e: BadRecordException if ParseModes.isPermissiveMode(mode) => +Iterator(toResultRow(e.partialResult(), e.record)) + case _: BadRecordException if ParseModes.isDropMalformedMode(mode) => +Iterator.empty + case e: BadRecordException => throw e.cause --- End diff -- This is `FAIL_FAST_MODE`, if my understanding is not wrong. Should we issue the error message including `FAILFAST`, like what we did before? This is also an behavior change? If users did not correctly spell the mode string, we treated it as the `PERMISSIVE` mode. Now, we changed it to the `FAILFAST ` mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17315: [SPARK-19949][SQL] unify bad record handling in C...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17315#discussion_r107012038 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -382,11 +383,17 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { } verifyColumnNameOfCorruptRecord(schema, parsedOptions.columnNameOfCorruptRecord) +val dataSchema = StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord)) --- End diff -- Nit: `dataSchema ` -> `actualSchema`? Be consistent what we did in the other place? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107011970 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala --- @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.fpm.{FPGrowth, FPGrowthModel} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FPGrowthWrapper private (val fpGrowthModel: FPGrowthModel) extends MLWritable { + def freqItemsets: DataFrame = fpGrowthModel.freqItemsets + def associationRules: DataFrame = fpGrowthModel.associationRules + + def transform(dataset: Dataset[_]): DataFrame = { +fpGrowthModel.transform(dataset) + } + + override def write: MLWriter = new FPGrowthWrapper.FPGrowthWrapperWriter(this) +} + +private[r] object FPGrowthWrapper extends MLReadable[FPGrowthWrapper] { + + def fit( + data: DataFrame, + minSupport: Double, + minConfidence: Double, + itemsCol: String, + numPartitions: Integer): FPGrowthWrapper = { +val fpGrowth = new FPGrowth() + .setMinSupport(minSupport) + .setMinConfidence(minConfidence) + .setItemsCol(itemsCol) + +if (numPartitions != null && numPartitions > 0) { + fpGrowth.setNumPartitions(numPartitions) +} + +val fpGrowthModel = fpGrowth.fit(data) + +new FPGrowthWrapper(fpGrowthModel) + } + + override def read: MLReader[FPGrowthWrapper] = new FPGrowthWrapperReader + + class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper] { +override def load(path: String): FPGrowthWrapper = { + val modelPath = new Path(path, "model").toString + val fPGrowthModel = FPGrowthModel.load(modelPath) + + new FPGrowthWrapper(fPGrowthModel) +} + } + + class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter { +override protected def saveImpl(path: String): Unit = { + val modelPath = new Path(path, "model").toString + val rMetadataPath = new Path(path, "rMetadata").toString --- End diff -- anything else we could add as metadata that is not in the model already? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107010057 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,153 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. +#' PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param itemsCol Items column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_items", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_items, ' ') as items") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(items = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(items, ',') as items") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(items, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' itemsCol = "baskets", numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} --- End diff -- we don't generally use this tag. Do you want to move to @seealso, or just link to in the description above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107011745 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala --- @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.fpm.{FPGrowth, FPGrowthModel} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FPGrowthWrapper private (val fpGrowthModel: FPGrowthModel) extends MLWritable { + def freqItemsets: DataFrame = fpGrowthModel.freqItemsets + def associationRules: DataFrame = fpGrowthModel.associationRules + + def transform(dataset: Dataset[_]): DataFrame = { +fpGrowthModel.transform(dataset) + } + + override def write: MLWriter = new FPGrowthWrapper.FPGrowthWrapperWriter(this) +} + +private[r] object FPGrowthWrapper extends MLReadable[FPGrowthWrapper] { + + def fit( + data: DataFrame, + minSupport: Double, + minConfidence: Double, + itemsCol: String, + numPartitions: Integer): FPGrowthWrapper = { +val fpGrowth = new FPGrowth() + .setMinSupport(minSupport) + .setMinConfidence(minConfidence) + .setItemsCol(itemsCol) + +if (numPartitions != null && numPartitions > 0) { --- End diff -- given the earlier suggestion, we should also check numPartition > 0 in R before passing to here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107009471 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,153 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. --- End diff -- can you check if this generate the doc properly `<\url{http://dx.doi.org/10.1145/1454008.1454027}>` generally it should be `\href{http://...}{Text}` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107009205 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,153 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth --- End diff -- I think we discussed this - let's make it `FP-Growth` or `Frequent Pattern Mining` (https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html) as the title --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17343 **[Test build #74893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74893/testReport)** for PR 17343 at commit [`5fe279e`](https://github.com/apache/spark/commit/5fe279ea3b5a70206be71ff9a5ebf2175cf0f8d1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17295 **[Test build #74906 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74906/testReport)** for PR 17295 at commit [`1b2a3e4`](https://github.com/apache/spark/commit/1b2a3e48e07118b03e9607bf0dc99cf53ae4d678). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107009625 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,153 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. +#' PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. --- End diff -- ditto here for url. In fact, I'm not sure we need to include all the links here but instead link to https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r107010797 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,153 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. +#' PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param itemsCol Items column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_items", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_items, ' ') as items") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(items = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(items, ',') as items") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(items, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' itemsCol = "baskets", numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + itemsCol = "items", numPartitions = -1) { --- End diff -- `numPartitions` by default is not set in Scala - let's default this to NULL instead here (but do not as.integer if value is NULL - something like numPartitions <- if (is.null(numPartitions)) NULL else as.integer(numPartitions) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17343 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74893/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17343 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org