[GitHub] spark issue #11956: [SPARK-14098][SQL] Generate Java code that gets a float/...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11956 **[Test build #61915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61915/consoleFull)** for PR 11956 at commit [`d035c42`](https://github.com/apache/spark/commit/d035c42db42b6ecbc252b6972419451aabd6e06d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14071 **[Test build #61914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61914/consoleFull)** for PR 14071 at commit [`4d65609`](https://github.com/apache/spark/commit/4d65609ae71b2e30cea7b39e1b5a1a9ecfdd2de4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14090 cc @felixcheung @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13778 From another point of view, is it necessary to propagate the python UDF from python side to jvm side? IIUC the serialization of python UDT happens at python side, and the jvm side can only see binary for python data, there is nothing we can do at java side. Correct me if I am wrong, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @rdblue uh, I see. Thank you for your explanation! My above suggestion is to confirm what you said in @viirya test cases. We expect to see the same results as what you mentioned. It sounds like dictionary filtering is available in Parquet 1.9. Really look forwarding to it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14028: [SPARK-16351][SQL] Avoid per-record type dispatch...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14028#discussion_r69936170 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala --- @@ -17,74 +17,180 @@ package org.apache.spark.sql.execution.datasources.json +import java.io.Writer + import com.fasterxml.jackson.core._ import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecializedGetters import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData} import org.apache.spark.sql.types._ -private[sql] object JacksonGenerator { - /** Transforms a single InternalRow to JSON using Jackson - * - * TODO: make the code shared with the other apply method. - * - * @param rowSchema the schema object used for conversion - * @param gen a JsonGenerator object - * @param row The row to convert - */ - def apply(rowSchema: StructType, gen: JsonGenerator)(row: InternalRow): Unit = { -def valWriter: (DataType, Any) => Unit = { - case (_, null) | (NullType, _) => gen.writeNull() - case (StringType, v) => gen.writeString(v.toString) - case (TimestampType, v: Long) => gen.writeString(DateTimeUtils.toJavaTimestamp(v).toString) - case (IntegerType, v: Int) => gen.writeNumber(v) - case (ShortType, v: Short) => gen.writeNumber(v) - case (FloatType, v: Float) => gen.writeNumber(v) - case (DoubleType, v: Double) => gen.writeNumber(v) - case (LongType, v: Long) => gen.writeNumber(v) - case (DecimalType(), v: Decimal) => gen.writeNumber(v.toJavaBigDecimal) - case (ByteType, v: Byte) => gen.writeNumber(v.toInt) - case (BinaryType, v: Array[Byte]) => gen.writeBinary(v) - case (BooleanType, v: Boolean) => gen.writeBoolean(v) - case (DateType, v: Int) => gen.writeString(DateTimeUtils.toJavaDate(v).toString) - // For UDT values, they should be in the SQL type's corresponding value type. - // We should not see values in the user-defined class at here. - // For example, VectorUDT's SQL type is an array of double. So, we should expect that v is - // an ArrayData at here, instead of a Vector. - case (udt: UserDefinedType[_], v) => valWriter(udt.sqlType, v) - - case (ArrayType(ty, _), v: ArrayData) => -gen.writeStartArray() -v.foreach(ty, (_, value) => valWriter(ty, value)) -gen.writeEndArray() - - case (MapType(kt, vt, _), v: MapData) => -gen.writeStartObject() -v.foreach(kt, vt, { (k, v) => - gen.writeFieldName(k.toString) - valWriter(vt, v) -}) -gen.writeEndObject() - - case (StructType(ty), v: InternalRow) => -gen.writeStartObject() -var i = 0 -while (i < ty.length) { - val field = ty(i) - val value = v.get(i, field.dataType) - if (value != null) { -gen.writeFieldName(field.name) -valWriter(field.dataType, value) - } - i += 1 +private[sql] class JacksonGenerator(schema: StructType, writer: Writer) { + // A `ValueWriter` is responsible for writing a field of an `InternalRow` to appropriate + // JSON data. Here we are using `SpecializedGetters` rather than `InternalRow` so that + // we can directly access data in `ArrayData` without the help of `SpecificMutableRow`. + private type ValueWriter = (SpecializedGetters, Int) => Unit + + // `ValueWriter`s for all fields of the schema + private val rootFieldWriters: Seq[ValueWriter] = schema.map(_.dataType).map(makeWriter) + + private val gen = new JsonFactory().createGenerator(writer).setRootValueSeparator(null) + + private def makeWriter(dataType: DataType): ValueWriter = dataType match { +case NullType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNull() + +case BooleanType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeBoolean(row.getBoolean(ordinal)) + +case ByteType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNumber(row.getByte(ordinal)) + +case ShortType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNumber(row.getShort(ordinal)) + +case IntegerType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNumber(row.getInt(ordinal)) + +case LongType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNumber(row.getLong(ordinal)) + +case FloatType => + (row:
[GitHub] spark pull request #14028: [SPARK-16351][SQL] Avoid per-record type dispatch...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14028#discussion_r69936226 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala --- @@ -17,74 +17,180 @@ package org.apache.spark.sql.execution.datasources.json +import java.io.Writer + import com.fasterxml.jackson.core._ import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecializedGetters import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData} import org.apache.spark.sql.types._ -private[sql] object JacksonGenerator { - /** Transforms a single InternalRow to JSON using Jackson - * - * TODO: make the code shared with the other apply method. - * - * @param rowSchema the schema object used for conversion - * @param gen a JsonGenerator object - * @param row The row to convert - */ - def apply(rowSchema: StructType, gen: JsonGenerator)(row: InternalRow): Unit = { -def valWriter: (DataType, Any) => Unit = { - case (_, null) | (NullType, _) => gen.writeNull() - case (StringType, v) => gen.writeString(v.toString) - case (TimestampType, v: Long) => gen.writeString(DateTimeUtils.toJavaTimestamp(v).toString) - case (IntegerType, v: Int) => gen.writeNumber(v) - case (ShortType, v: Short) => gen.writeNumber(v) - case (FloatType, v: Float) => gen.writeNumber(v) - case (DoubleType, v: Double) => gen.writeNumber(v) - case (LongType, v: Long) => gen.writeNumber(v) - case (DecimalType(), v: Decimal) => gen.writeNumber(v.toJavaBigDecimal) - case (ByteType, v: Byte) => gen.writeNumber(v.toInt) - case (BinaryType, v: Array[Byte]) => gen.writeBinary(v) - case (BooleanType, v: Boolean) => gen.writeBoolean(v) - case (DateType, v: Int) => gen.writeString(DateTimeUtils.toJavaDate(v).toString) - // For UDT values, they should be in the SQL type's corresponding value type. - // We should not see values in the user-defined class at here. - // For example, VectorUDT's SQL type is an array of double. So, we should expect that v is - // an ArrayData at here, instead of a Vector. - case (udt: UserDefinedType[_], v) => valWriter(udt.sqlType, v) - - case (ArrayType(ty, _), v: ArrayData) => -gen.writeStartArray() -v.foreach(ty, (_, value) => valWriter(ty, value)) -gen.writeEndArray() - - case (MapType(kt, vt, _), v: MapData) => -gen.writeStartObject() -v.foreach(kt, vt, { (k, v) => - gen.writeFieldName(k.toString) - valWriter(vt, v) -}) -gen.writeEndObject() - - case (StructType(ty), v: InternalRow) => -gen.writeStartObject() -var i = 0 -while (i < ty.length) { - val field = ty(i) - val value = v.get(i, field.dataType) - if (value != null) { -gen.writeFieldName(field.name) -valWriter(field.dataType, value) - } - i += 1 +private[sql] class JacksonGenerator(schema: StructType, writer: Writer) { + // A `ValueWriter` is responsible for writing a field of an `InternalRow` to appropriate + // JSON data. Here we are using `SpecializedGetters` rather than `InternalRow` so that + // we can directly access data in `ArrayData` without the help of `SpecificMutableRow`. + private type ValueWriter = (SpecializedGetters, Int) => Unit + + // `ValueWriter`s for all fields of the schema + private val rootFieldWriters: Seq[ValueWriter] = schema.map(_.dataType).map(makeWriter) --- End diff -- Let's use an array. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14028: [SPARK-16351][SQL] Avoid per-record type dispatch...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14028#discussion_r69936163 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala --- @@ -17,74 +17,180 @@ package org.apache.spark.sql.execution.datasources.json +import java.io.Writer + import com.fasterxml.jackson.core._ import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.SpecializedGetters import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData} import org.apache.spark.sql.types._ -private[sql] object JacksonGenerator { - /** Transforms a single InternalRow to JSON using Jackson - * - * TODO: make the code shared with the other apply method. - * - * @param rowSchema the schema object used for conversion - * @param gen a JsonGenerator object - * @param row The row to convert - */ - def apply(rowSchema: StructType, gen: JsonGenerator)(row: InternalRow): Unit = { -def valWriter: (DataType, Any) => Unit = { - case (_, null) | (NullType, _) => gen.writeNull() - case (StringType, v) => gen.writeString(v.toString) - case (TimestampType, v: Long) => gen.writeString(DateTimeUtils.toJavaTimestamp(v).toString) - case (IntegerType, v: Int) => gen.writeNumber(v) - case (ShortType, v: Short) => gen.writeNumber(v) - case (FloatType, v: Float) => gen.writeNumber(v) - case (DoubleType, v: Double) => gen.writeNumber(v) - case (LongType, v: Long) => gen.writeNumber(v) - case (DecimalType(), v: Decimal) => gen.writeNumber(v.toJavaBigDecimal) - case (ByteType, v: Byte) => gen.writeNumber(v.toInt) - case (BinaryType, v: Array[Byte]) => gen.writeBinary(v) - case (BooleanType, v: Boolean) => gen.writeBoolean(v) - case (DateType, v: Int) => gen.writeString(DateTimeUtils.toJavaDate(v).toString) - // For UDT values, they should be in the SQL type's corresponding value type. - // We should not see values in the user-defined class at here. - // For example, VectorUDT's SQL type is an array of double. So, we should expect that v is - // an ArrayData at here, instead of a Vector. - case (udt: UserDefinedType[_], v) => valWriter(udt.sqlType, v) - - case (ArrayType(ty, _), v: ArrayData) => -gen.writeStartArray() -v.foreach(ty, (_, value) => valWriter(ty, value)) -gen.writeEndArray() - - case (MapType(kt, vt, _), v: MapData) => -gen.writeStartObject() -v.foreach(kt, vt, { (k, v) => - gen.writeFieldName(k.toString) - valWriter(vt, v) -}) -gen.writeEndObject() - - case (StructType(ty), v: InternalRow) => -gen.writeStartObject() -var i = 0 -while (i < ty.length) { - val field = ty(i) - val value = v.get(i, field.dataType) - if (value != null) { -gen.writeFieldName(field.name) -valWriter(field.dataType, value) - } - i += 1 +private[sql] class JacksonGenerator(schema: StructType, writer: Writer) { + // A `ValueWriter` is responsible for writing a field of an `InternalRow` to appropriate + // JSON data. Here we are using `SpecializedGetters` rather than `InternalRow` so that + // we can directly access data in `ArrayData` without the help of `SpecificMutableRow`. + private type ValueWriter = (SpecializedGetters, Int) => Unit + + // `ValueWriter`s for all fields of the schema + private val rootFieldWriters: Seq[ValueWriter] = schema.map(_.dataType).map(makeWriter) + + private val gen = new JsonFactory().createGenerator(writer).setRootValueSeparator(null) + + private def makeWriter(dataType: DataType): ValueWriter = dataType match { +case NullType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNull() + +case BooleanType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeBoolean(row.getBoolean(ordinal)) + +case ByteType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNumber(row.getByte(ordinal)) + +case ShortType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNumber(row.getShort(ordinal)) + +case IntegerType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNumber(row.getInt(ordinal)) + +case LongType => + (row: SpecializedGetters, ordinal: Int) => +gen.writeNumber(row.getLong(ordinal)) + +case FloatType => + (row:
[GitHub] spark issue #14071: [SPARK-16397][SQL] make CatalogTable more general and le...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14071 cc @yhuai @gatorsmile @liancheng @clockfly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile, we've not seen a penalty from running row group level tests when no row groups are filtered and we've decided to turn on dictionary filtering by default. You may see a penalty from using Parquet's internal record-level filter rather than a codegened filter. My recommendation (which is being discussed on the Parquet list) is to add the ability to filter row groups without turning on record-level filters. Should be easy and would solve your problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14077: [SPARK-16402] [SQL] JDBC Source: Implement save API of D...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14077 @JustinPihony How about you first moving the `copy` function in your PR now? Then, we can review your PR before the SPARK-16401 is resolved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @viirya Maybe you have not read my discussion with @rdblue . @rdblue already explained how Parquet internally works. Like what I said above, I think we still need a test for confirming whether there does not exist a noticeable extra penalty when no row is filtered out. Could you do a quick check based on your existing test? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user janplus commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69933267 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,145 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = children(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = children(2) match { +case Literal(key: UTF8String, _) => getPattern(key) --- End diff -- Good point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14052: [SPARK-15440] [Core] [Deploy] Add CSRF Filter for...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14052#discussion_r69933155 --- Diff: core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala --- @@ -93,6 +94,14 @@ private[spark] abstract class RestSubmissionServer( contextToServlet.foreach { case (prefix, servlet) => mainHandler.addServlet(new ServletHolder(servlet), prefix) } +if(masterConf.getBoolean("spark.rest.csrf.enable", false)) { --- End diff -- I'm not familiar with this but the question is should the config be documented? I assume this is something you want the end user to have the option of using? if this is only being used by the private Rest apis should it be true by default or what exactly are the ramifications of that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user janplus commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69932808 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,145 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") --- End diff -- OK, I'll fix this, thank you, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...
Github user vlad17 commented on the issue: https://github.com/apache/spark/pull/13778 LGTM +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user janplus commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69932567 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,160 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = stringExprs(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = stringExprs(2) match { +case Literal(key: UTF8String, _) => getPattern(key) +case _ => null + } + + private lazy val stringExprs = children.toArray + import ParseUrl._ + + override def checkInputDataTypes(): TypeCheckResult = { +if (children.size > 3 || children.size < 2) { + TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two or three arguments") +} else { + super[ImplicitCastInputTypes].checkInputDataTypes() +} + } + + private def getPattern(key: UTF8String): Pattern = { +if (key != null) { + Pattern.compile(REGEXPREFIX + key.toString + REGEXSUBFIX) +} else { + null +} + } + + private def getUrl(url: UTF8String): URL = { +try { + new URL(url.toString) +} catch { + case e: MalformedURLException => null +} + } + + private def extractValueFromQuery(query: UTF8String, pattern: Pattern): UTF8String = { +val m = pattern.matcher(query.toString) +if (m.find()) { + UTF8String.fromString(m.group(2)) +} else { + null +} + } + + private def extractFromUrl(url: URL, partToExtract: UTF8String): UTF8String = { +if (partToExtract.equals(HOST)) { + UTF8String.fromString(url.getHost) +} else if (partToExtract.equals(PATH)) { + UTF8String.fromString(url.getPath) +} else if (partToExtract.equals(QUERY)) { + UTF8String.fromString(url.getQuery) +} else if (partToExtract.equals(REF)) { + UTF8String.fromString(url.getRef) +} else if (partToExtract.equals(PROTOCOL)) { + UTF8String.fromString(url.getProtocol) +} else if (partToExtract.equals(FILE)) { + UTF8String.fromString(url.getFile) +} else if (partToExtract.equals(AUTHORITY)) { + UTF8String.fromString(url.getAuthority) +} else if (partToExtract.equals(USERINFO)) { + UTF8String.fromString(url.getUserInfo) +} else { + null --- End diff -- Since check it here is at Excutor side, it will be of no difference with current implementation. I think the point is whether we can assume that in almost all cases `part` is `Literal`. --- If your project is set up for it, you can reply to this email and have your reply
[GitHub] spark issue #14092: [SPARK-16419][SQL] EnsureRequirements adds extra Sort to...
Github user MasterDDT commented on the issue: https://github.com/apache/spark/pull/14092 cc @JoshRosen @rxin I wasn't sure if the right fix here is that `Expression` should override `equals` and use `semanticEquals`, that would be a bigger change but I think would work. Also I noticed the `EquivalentExpressions` class but that seemed like it was only for codegen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14092: [SPARK-16419][SQL] EnsureRequirements adds extra Sort to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14092 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` optimiz...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13765 Under what circumstances will a user use 2 or more adjacent re-partitioning operators? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13765#discussion_r69930648 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -537,12 +537,19 @@ object CollapseProject extends Rule[LogicalPlan] { } /** - * Combines adjacent [[Repartition]] operators by keeping only the last one. + * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] operator combinations + * by keeping only the one. --- End diff -- > ... by only keeping the top-level one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14092: [SPARK-16419][SQL] EnsureRequirements adds extra ...
GitHub user MasterDDT opened a pull request: https://github.com/apache/spark/pull/14092 [SPARK-16419][SQL] EnsureRequirements adds extra Sort to already sorted cached table ## What changes were proposed in this pull request? EnsureRequirements compares the required and given sort ordering, but uses Scala equals instead of a semantic equals, so column capitalization isn't considered, and also fails for a cached table. This results in a SortMergeJoin of a cached already-sorted table to add an extra sort. Using semanticEquals to do the compare instead of scala equals on 2 `Seq[SortOrder]` ## How was this patch tested? Added 3 tests, the last 2 tests break without the fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ActionIQ/spark SPARK-16419 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14092.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14092 commit ab48ca6464f1c05cf58b4d0e0f1b7e617fdcb5fb Author: MasterDDTDate: 2016-07-05T20:51:24Z Add tests commit 372466bf4eda8aa7f8a7319ce682df9cdd61d666 Author: MasterDDT Date: 2016-07-06T17:36:04Z Add tests commit b4b02bf3879daf9a4532b61a019ea33b0f3ff835 Author: MasterDDT Date: 2016-07-07T15:30:58Z Add fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13765: [SPARK-16052][SQL] Improve `CollapseRepartition` ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13765#discussion_r69930213 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala --- @@ -370,8 +370,11 @@ package object dsl { case plan => SubqueryAlias(alias, plan) } - def distribute(exprs: Expression*): LogicalPlan = -RepartitionByExpression(exprs, logicalPlan) + def repartition(num: Integer): LogicalPlan = +Repartition(num, shuffle = true, logicalPlan) + + def distribute(exprs: Expression*)(n: Int = -1): LogicalPlan = +RepartitionByExpression(exprs, logicalPlan, numPartitions = if (n < 0) None else Some(n)) --- End diff -- Seems that adding a `distribute(n: Int, exprs: Expression*)` overloading method is simpler? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14065: [SPARK-14743][YARN][WIP] Add a configurable token manage...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14065 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14065: [SPARK-14743][YARN][WIP] Add a configurable token manage...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14065 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61912/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14065: [SPARK-14743][YARN][WIP] Add a configurable token manage...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14065 **[Test build #61912 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61912/consoleFull)** for PR 14065 at commit [`c9d9ed0`](https://github.com/apache/spark/commit/c9d9ed0cd0aef6b8017c04e635f27ef123a48887). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69928758 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,160 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = stringExprs(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = stringExprs(2) match { +case Literal(key: UTF8String, _) => getPattern(key) +case _ => null + } + + private lazy val stringExprs = children.toArray + import ParseUrl._ + + override def checkInputDataTypes(): TypeCheckResult = { +if (children.size > 3 || children.size < 2) { + TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two or three arguments") +} else { + super[ImplicitCastInputTypes].checkInputDataTypes() +} + } + + private def getPattern(key: UTF8String): Pattern = { +if (key != null) { + Pattern.compile(REGEXPREFIX + key.toString + REGEXSUBFIX) +} else { + null +} + } + + private def getUrl(url: UTF8String): URL = { +try { + new URL(url.toString) +} catch { + case e: MalformedURLException => null +} + } + + private def extractValueFromQuery(query: UTF8String, pattern: Pattern): UTF8String = { +val m = pattern.matcher(query.toString) +if (m.find()) { + UTF8String.fromString(m.group(2)) +} else { + null +} + } + + private def extractFromUrl(url: URL, partToExtract: UTF8String): UTF8String = { +if (partToExtract.equals(HOST)) { + UTF8String.fromString(url.getHost) +} else if (partToExtract.equals(PATH)) { + UTF8String.fromString(url.getPath) +} else if (partToExtract.equals(QUERY)) { + UTF8String.fromString(url.getQuery) +} else if (partToExtract.equals(REF)) { + UTF8String.fromString(url.getRef) +} else if (partToExtract.equals(PROTOCOL)) { + UTF8String.fromString(url.getProtocol) +} else if (partToExtract.equals(FILE)) { + UTF8String.fromString(url.getFile) +} else if (partToExtract.equals(AUTHORITY)) { + UTF8String.fromString(url.getAuthority) +} else if (partToExtract.equals(USERINFO)) { + UTF8String.fromString(url.getUserInfo) +} else { + null --- End diff -- Can we add a boolean field to indicate whether `part` is foldable and check it here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69928094 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,145 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = children(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = children(2) match { +case Literal(key: UTF8String, _) => getPattern(key) --- End diff -- The need for verifying the behavior in Scala REPL probably indicates that we should check for null explicitly to make it more readabie. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/14088 Can you please fix the description? "Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf in other places"." doesn't make sense to me. Where exactly is SparkHadoopUtil being instantiated before the ApplicationMaster?What cases does this apply to (client mode, cluster mode, etc..) How do I reproduce? Also you say existing unit tests cover this, was one failing because of this? If not perhaps we should add one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r69927398 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import scala.util.Random + +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeArrayWriter} +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData + * To run this: + * 1. replace ignore(...) with test(...) + * 2. build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark" + * + * Benchmarks in this file are skipped in normal builds. + */ +class UnsafeArrayDataBenchmark extends BenchmarkBase { + + def calculateHeaderPortionInBytes(count: Int) : Int = { +// Use this assignment for SPARK-15962 +// val size = 4 + 4 * count +val size = UnsafeArrayData.calculateHeaderPortionInBytes(count) +size + } + + def readUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 +val rand = new Random(42) + +var intResult: Int = 0 +val intBuffer = Array.fill[Int](count) { rand.nextInt } +val intEncoder = ExpressionEncoder[Array[Int]].resolveAndBind() +val intInternalRow = intEncoder.toRow(intBuffer) +val intUnsafeArray = intInternalRow.getArray(0) +val readIntArray = { i: Int => + var n = 0 + while (n < iters) { +val len = intUnsafeArray.numElements +var sum = 0.toInt +var i = 0 +while (i < len) { + sum += intUnsafeArray.getInt(i) + i += 1 +} +intResult = sum +n += 1 + } +} + +var doubleResult: Double = 0 +val doubleBuffer = Array.fill[Double](count) { rand.nextDouble } +val doubleEncoder = ExpressionEncoder[Array[Double]].resolveAndBind() +val doubleInternalRow = doubleEncoder.toRow(doubleBuffer) +val doubleUnsafeArray = doubleInternalRow.getArray(0) +val readDoubleArray = { i: Int => + var n = 0 + while (n < iters) { +val len = doubleUnsafeArray.numElements +var sum = 0.toDouble +var i = 0 +while (i < len) { + sum += doubleUnsafeArray.getDouble(i) + i += 1 +} +doubleResult = sum +n += 1 + } +} + +val benchmark = new Benchmark("Read UnsafeArrayData", count * iters) +benchmark.addCase("Int")(readIntArray) +benchmark.addCase("Double")(readDoubleArray) +benchmark.run +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4 +Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz + +Read UnsafeArrayData:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +Int279 / 294600.4 1.7 1.0X +Double 296 / 303567.0 1.8 0.9X +*/ + } + + def writeUnsafeArray(iters: Int): Unit = { +val count = 1024 * 1024 * 16 + +val intUnsafeRow = new UnsafeRow(1) +val intUnsafeArrayWriter = new UnsafeArrayWriter --- End diff -- Got it. My interpretation was to use an `UnsafeArray` generated by `encoder.toRow(array)` for benchmark. I will update `writeUnsafeArray` to measure the elapsed time of `encoder.toRow(array)` --- If your project is set up for it, you can reply to this email and have your
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69927073 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,145 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") --- End diff -- We should probably `.stripMargin` here: ```scala """... |... """.stripMargin ``` Otherwise all leading white spaces are included in the extended description string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14077: [SPARK-16402] [SQL] JDBC Source: Implement save API of D...
Github user JustinPihony commented on the issue: https://github.com/apache/spark/pull/14077 Thanks. I will have to wait until SPARK-16401 is resolved or else the code will not pass tests, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14065: [SPARK-14743][YARN][WIP] Add a configurable token manage...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/14065 I took a quick look through. It might be nice to think about how we could handle other credentials. For instance Apache Kafka currently doesn't have tokens so you need keytab or TGT and jaas conf file. Yes they are adding tokens but in in the mean time how does that work. Are there other services similar to that. Can we handle things other then Tokens? it does appear that I could implement my own ServiceTokenProvider that goes off to really any service and I can put things into the Credentials object as Token or as a Secret so perhaps we are covered here. But perhaps that means we should rename things to be obtainCredentials rather then obtainTokens. Are there specific services you were thinking about here? We could atleast use those as examples to make sure interface fits those. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13123: [SPARK-15422] [Core] Remove unnecessary calculation of s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13123 I think this change is made by another PR #13677. We can close it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14065#discussion_r69918095 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -390,8 +390,9 @@ private[spark] class Client( // Upload Spark and the application JAR to the remote file system if necessary, // and add them as local resources to the application master. val fs = destDir.getFileSystem(hadoopConf) -val nns = YarnSparkHadoopUtil.get.getNameNodesToAccess(sparkConf) + destDir -YarnSparkHadoopUtil.get.obtainTokensForNamenodes(nns, hadoopConf, credentials) +hdfsTokenProvider(sparkConf).setNameNodesToAccess(sparkConf, Set(destDir)) --- End diff -- it seems a bit odd to me that we are doing these extra things for hdfs outside of the token provider. what happens if a user needs to do something similar for some other service they want to implement their own class to handle? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14065#discussion_r69917965 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/token/HDFSTokenProvider.scala --- @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.yarn.token + +import java.io.{ByteArrayInputStream, DataInputStream} + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier +import org.apache.hadoop.mapred.Master +import org.apache.hadoop.security.Credentials +import org.apache.hadoop.security.token.Token + +import org.apache.spark.{SparkConf, SparkException} +import org.apache.spark.deploy.yarn.config._ +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config._ + +private[yarn] class HDFSTokenProvider + extends ServiceTokenProvider with ServiceTokenRenewable with Logging { + + private var nnsToAccess: Set[Path] = Set.empty + private var tokenRenewer: Option[String] = None + + override val serviceName: String = "hdfs" + + override def obtainTokensFromService( + sparkConf: SparkConf, + serviceConf: Configuration, + creds: Credentials) +: Array[Token[_]] = { +val tokens = ArrayBuffer[Token[_]]() +val renewer = tokenRenewer.getOrElse(getTokenRenewer(serviceConf)) +nnsToAccess.foreach { dst => + val dstFs = dst.getFileSystem(serviceConf) + logInfo("getting token for namenode: " + dst) + tokens ++= dstFs.addDelegationTokens(renewer, creds) +} + +tokens.toArray + } + + override def getTokenRenewalInterval(sparkConf: SparkConf, serviceConf: Configuration): Long = { +// We cannot use the tokens generated above since those have renewer yarn. Trying to renew --- End diff -- update comment to be more specific then "above" since this has moved --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @yhuai BTW, when reading more row groups, the performance improvement is much more. Before this patch: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz Parquet reader: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative reading Parquet file 1416 / 2168 1.4 691.3 1.0X After this patch: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz Parquet reader: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative reading Parquet file 246 / 334 8.3 120.3 1.0X --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14065#discussion_r69916415 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/token/AMDelegationTokenRenewer.scala --- @@ -171,10 +174,9 @@ private[yarn] class AMDelegationTokenRenewer( keytabLoggedInUGI.doAs(new PrivilegedExceptionAction[Void] { // Get a copy of the credentials override def run(): Void = { -val nns = YarnSparkHadoopUtil.get.getNameNodesToAccess(sparkConf) + dst -hadoopUtil.obtainTokensForNamenodes(nns, freshHadoopConf, tempCreds) -hadoopUtil.obtainTokenForHiveMetastore(sparkConf, freshHadoopConf, tempCreds) -hadoopUtil.obtainTokenForHBase(sparkConf, freshHadoopConf, tempCreds) +hdfsTokenProvider(sparkConf).setNameNodesToAccess(sparkConf, Set(dst)) +hdfsTokenProvider(sparkConf).setTokenRenewer(null) --- End diff -- why are we setting this to null here? is it supposed to indicate that its not supposed to be renewed internally? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13620: [SPARK-15590] [WEBUI] Paginate Job Table in Jobs tab
Github user nblintao commented on the issue: https://github.com/apache/spark/pull/13620 I believe this commit has resolved the bugs reported by @ajbozarth. It looks well on history server pages now, and it could keep the status of other tables while changing one. Could you please help test or review it if you are available? Thanks! @andrewor14 @zsxwing @ajbozarth --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14091: [SPARK-16412][SQL] Generate Java code that gets an array...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14091 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14091: [SPARK-16412][SQL] Generate Java code that gets an array...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14091 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61913/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14091: [SPARK-16412][SQL] Generate Java code that gets an array...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14091 **[Test build #61913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61913/consoleFull)** for PR 14091 at commit [`54df41c`](https://github.com/apache/spark/commit/54df41c8691f02dd9eac3eef3d816a130b87a5c9). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14091: [SPARK-16412][SQL] Generate Java code that gets an array...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14091 **[Test build #61913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61913/consoleFull)** for PR 14091 at commit [`54df41c`](https://github.com/apache/spark/commit/54df41c8691f02dd9eac3eef3d816a130b87a5c9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14091: [SPARK-16412][SQL] Generate Java code that gets a...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/14091 [SPARK-16412][SQL] Generate Java code that gets an array in each column of CachedBatch when DataFrame.cache() is called ## What changes were proposed in this pull request? Waiting #11956 to be merged. This PR generates Java code to directly get an array of each column from CachedBatch when DataFrame.cache() is called. This is done in whole stage code generation. When DataFrame.cache() is called, data is stored as column-oriented storage (columnar cache) in CachedBatch. This PR avoid conversion from column-oriented storage to row-oriented storage. This PR handles an array type that is stored into a column. This PR generates code both for row-oriented storage and column-oriented storage only if - InMemoryColumnarTableScan exists in a plan sub-tree. A decision is performed by checking an given iterator is ColumnaIterator at runtime - Sort or join does not exist in a plan sub-tree. This PR generates Java code for columnar cache only if types in all columns, which are accessed in operations, are primitive or an array I will add benchmark suites into [here](https://github.com/kiszk/spark/blob/SPARK-14098/sql/core/src/test/scala/org/apache/spark/sql/DataFrameCacheBenchmark.scala) ## How was this patch tested? Added new tests into `DataFrameCacheSuite.scala` You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-16412 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14091.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14091 commit 09af5a5851786b918f45c6f997b1c357745fe883 Author: Kazuaki IshizakiDate: 2016-07-07T10:36:14Z support codegen for an array in CachedBatch commit 8e218e38d5acb6c04db221fcd3cd6d2483926552 Author: Kazuaki Ishizaki Date: 2016-07-07T10:36:34Z update test suites commit 54df41c8691f02dd9eac3eef3d816a130b87a5c9 Author: Kazuaki Ishizaki Date: 2016-07-07T13:18:58Z remove debug print --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13620: [SPARK-15590] [WEBUI] Paginate Job Table in Jobs tab
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13620 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13620: [SPARK-15590] [WEBUI] Paginate Job Table in Jobs tab
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13620 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61910/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13620: [SPARK-15590] [WEBUI] Paginate Job Table in Jobs tab
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13620 **[Test build #61910 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61910/consoleFull)** for PR 13620 at commit [`649bb19`](https://github.com/apache/spark/commit/649bb195ac0eaf6cee4a84dd8ff1198900e8789a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14089: [SPARK-16415][SQL] fix catalog string error
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14089 **[Test build #61909 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61909/consoleFull)** for PR 14089 at commit [`eb10181`](https://github.com/apache/spark/commit/eb1018108a879a06701c3dff539ef8d10ab2b118). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14089: [SPARK-16415][SQL] fix catalog string error
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14089 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r69910685 --- Diff: yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -125,8 +125,11 @@ private[spark] abstract class YarnSchedulerBackend( * This includes executors already pending or running. */ override def doRequestTotalExecutors(requestedTotal: Int): Boolean = { + +val nodeBlacklist: Set[String] = scheduler.blacklistTracker.nodeBlacklist() --- End diff -- this is safe to call without a lock on the task scheduler because the nodeBlacklist is stored in an AtomicReference, which is set whenever the node blacklist changes. We could just get a lock on the task scheduler here -- but I felt that it would just be harder to ensure correctness (not just in this change, but making sure the right lock was always held through future changes), and the only real cost is that we duplicate the set of nodes in the blacklist, which is hopefully very small. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61911/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14090 **[Test build #61911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61911/consoleFull)** for PR 14090 at commit [`7781d1c`](https://github.com/apache/spark/commit/7781d1c111f38e3608d5ebd468e6d344d52efa5c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11956: [SPARK-14098][SQL] Generate Java code that gets a float/...
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/11956 @robbinspg and I are evaluating this from a functional and performance perspective, full disclosure: we both work for IBM with @kiszk. All unit tests pass including the new ones Ishizaki has added, we've tested this on a variety of platforms, both big and little-endian. This is with IBM Java 8 and tested on three different architectures. We can run the benchmark with ``` bin/spark-submit --class org.apache.spark.sql.DataFrameCacheBenchmark sql/core/target/spark-sql_2.11-2.0.0-tests.jar ``` or can be run against branch-2.0 (Spark 2.0.1 snapshot) with ``` bin/spark-submit --class org.apache.spark.sql.DataFrameCacheBenchmark sql/core/target/spark-sql_2.11-2.0.1-SNAPSHOT-tests.jar ``` Performance results on a few low powered testing systems are promising. Linux on Intel: 5.3x increase ``` Stopped after 15 iterations, 2127 ms IBM J9 VM pxa6480sr3-20160428_01 (SR3) on Linux 3.13.0-65-generic Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz Float Sum with PassThrough cache:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative InternalRow codegen669 / 829 47.1 21.3 1.0X ColumnVector codegen 127 / 142248.2 4.0 5.3X ``` Linux on Z: 2.7x increase ``` Stopped after 5 iterations, 2068 ms IBM J9 VM pxz6480sr3-20160428_01 (SR3) on Linux 3.12.43-52.6-default 16/07/07 09:48:15 ERROR Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: Unknown processor Float Sum with PassThrough cache:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative InternalRow codegen997 / 1134 31.5 31.7 1.0X ColumnVector codegen 371 / 414 84.7 11.8 2.7X ``` Linux on Power: 6.4x increase ``` Stopped after 7 iterations, 2099 ms IBM J9 VM pxl6480sr3-20160428_01 (SR3) on Linux 3.13.0-61-generic 16/07/07 14:33:40 ERROR Utils: Process List(/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: Unknown processor Float Sum with PassThrough cache:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative InternalRow codegen 1199 / 1212 26.2 38.1 1.0X ColumnVector codegen 186 / 300168.8 5.9 6.4X ``` So the performance increase and functionality is solid across platforms, Ishizaki has tested this with OpenJDK 8 also. One improvement would be add a scale factor parameter so we can use more data than: ``` doubleSumBenchmark(1024 * 1024 * 15) floatSumBenchmark(1024 * 1024 * 30) ``` and with no parameter we'd use the above as a standard/baseline. Would also be useful to have the master url as a parameter so we can easily run this using many machines or with more cores to see the performance/functional impact when we scale (exercising various JIT levels for example) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14065: [SPARK-14743][YARN][WIP] Add a configurable token manage...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14065 **[Test build #61912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61912/consoleFull)** for PR 14065 at commit [`c9d9ed0`](https://github.com/apache/spark/commit/c9d9ed0cd0aef6b8017c04e635f27ef123a48887). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14089: [SPARK-16415][SQL] fix catalog string error
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14089 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61909/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14090: [SPARK-16112][SparkR] Programming guide for gapply/gappl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14090 **[Test build #61911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61911/consoleFull)** for PR 14090 at commit [`7781d1c`](https://github.com/apache/spark/commit/7781d1c111f38e3608d5ebd468e6d344d52efa5c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14090: [SPARK-16112][SparkR] Programming guide for gappl...
GitHub user NarineK opened a pull request: https://github.com/apache/spark/pull/14090 [SPARK-16112][SparkR] Programming guide for gapply/gapplyCollect ## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used faithful dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R You can merge this pull request into a Git repository by running: $ git pull https://github.com/NarineK/spark gapplyProgGuide Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14090.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14090 commit 29d8a5c6c22202cdf7d6cc44f1d6cbeca5946918 Author: Narine KokhlikyanDate: 2016-06-20T22:12:11Z Fixed duplicated documentation problem + separated documentation for dapply and dapplyCollect commit 698c4331d2a8bfe7f4b372ebc8123b6c27a57e68 Author: Narine Kokhlikyan Date: 2016-06-23T18:51:48Z merge with master commit 85a4493a03b3601a93c25ebc1eafb2868efec8d8 Author: Narine Kokhlikyan Date: 2016-07-07T13:18:49Z Adding programming guide for gapply/gapplyCollect commit 7781d1c111f38e3608d5ebd468e6d344d52efa5c Author: Narine Kokhlikyan Date: 2016-07-07T13:27:35Z removing output format --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user janplus commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69905666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,145 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = children(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = children(2) match { +case Literal(key: UTF8String, _) => getPattern(key) --- End diff -- Test in 2.10.5 too. The result is the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor port rest...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/11157 WIP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11157: [SPARK-11714][Mesos] Make Spark on Mesos honor po...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/11157#discussion_r69898133 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -356,4 +374,233 @@ private[mesos] trait MesosSchedulerUtils extends Logging { sc.conf.getTimeAsSeconds("spark.mesos.rejectOfferDurationForReachedMaxCores", "120s") } + /** + * Checks executor ports if they are within some range of the offered list of ports ranges, + * + * @param sc the Spark Context + * @param ports the list of ports to check + * @return true if ports are within range false otherwise + */ + protected def checkPorts(sc: SparkContext, ports: List[(Long, Long)]): Boolean = { + +def checkIfInRange(port: Long, ps: List[(Long, Long)]): Boolean = { + ps.exists(r => r._1 <= port & r._2 >= port) +} + +val portsToCheck = ManagedPorts.getPortValues(sc.conf) +val nonZeroPorts = portsToCheck.filter(_ != 0) +val withinRange = nonZeroPorts.forall(p => checkIfInRange(p, ports)) +// make sure we have enough ports to allocate per offer +ports.map(r => r._2 - r._1 + 1).sum >= portsToCheck.size && withinRange + } + + /** + * Partitions port resources. + * + * @param conf the spark config + * @param ports the ports offered + * @return resources left, port resources to be used and the list of assigned ports + */ + def partitionPorts( + conf: SparkConf, --- End diff -- Ok. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69895008 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,160 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = stringExprs(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = stringExprs(2) match { +case Literal(key: UTF8String, _) => getPattern(key) +case _ => null + } + + private lazy val stringExprs = children.toArray + import ParseUrl._ + + override def checkInputDataTypes(): TypeCheckResult = { +if (children.size > 3 || children.size < 2) { + TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two or three arguments") +} else { + super[ImplicitCastInputTypes].checkInputDataTypes() +} + } + + private def getPattern(key: UTF8String): Pattern = { +if (key != null) { + Pattern.compile(REGEXPREFIX + key.toString + REGEXSUBFIX) +} else { + null +} + } + + private def getUrl(url: UTF8String): URL = { +try { + new URL(url.toString) +} catch { + case e: MalformedURLException => null +} + } + + private def extractValueFromQuery(query: UTF8String, pattern: Pattern): UTF8String = { +val m = pattern.matcher(query.toString) +if (m.find()) { + UTF8String.fromString(m.group(2)) +} else { + null +} + } + + private def extractFromUrl(url: URL, partToExtract: UTF8String): UTF8String = { +if (partToExtract.equals(HOST)) { + UTF8String.fromString(url.getHost) +} else if (partToExtract.equals(PATH)) { + UTF8String.fromString(url.getPath) +} else if (partToExtract.equals(QUERY)) { + UTF8String.fromString(url.getQuery) +} else if (partToExtract.equals(REF)) { + UTF8String.fromString(url.getRef) +} else if (partToExtract.equals(PROTOCOL)) { + UTF8String.fromString(url.getProtocol) +} else if (partToExtract.equals(FILE)) { + UTF8String.fromString(url.getFile) +} else if (partToExtract.equals(AUTHORITY)) { + UTF8String.fromString(url.getAuthority) +} else if (partToExtract.equals(USERINFO)) { + UTF8String.fromString(url.getUserInfo) +} else { + null --- End diff -- It depends on how many users will call `parse_url` in this way. Personally I think using literal `part` is more natural. But we need more thoughts here, cc @liancheng @clockfly --- If your project is set up for it, you can reply to this email and have your reply
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user janplus commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69894466 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,160 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = stringExprs(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = stringExprs(2) match { +case Literal(key: UTF8String, _) => getPattern(key) +case _ => null + } + + private lazy val stringExprs = children.toArray + import ParseUrl._ + + override def checkInputDataTypes(): TypeCheckResult = { +if (children.size > 3 || children.size < 2) { + TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two or three arguments") +} else { + super[ImplicitCastInputTypes].checkInputDataTypes() +} + } + + private def getPattern(key: UTF8String): Pattern = { +if (key != null) { + Pattern.compile(REGEXPREFIX + key.toString + REGEXSUBFIX) +} else { + null +} + } + + private def getUrl(url: UTF8String): URL = { +try { + new URL(url.toString) +} catch { + case e: MalformedURLException => null +} + } + + private def extractValueFromQuery(query: UTF8String, pattern: Pattern): UTF8String = { +val m = pattern.matcher(query.toString) +if (m.find()) { + UTF8String.fromString(m.group(2)) +} else { + null +} + } + + private def extractFromUrl(url: URL, partToExtract: UTF8String): UTF8String = { +if (partToExtract.equals(HOST)) { + UTF8String.fromString(url.getHost) +} else if (partToExtract.equals(PATH)) { + UTF8String.fromString(url.getPath) +} else if (partToExtract.equals(QUERY)) { + UTF8String.fromString(url.getQuery) +} else if (partToExtract.equals(REF)) { + UTF8String.fromString(url.getRef) +} else if (partToExtract.equals(PROTOCOL)) { + UTF8String.fromString(url.getProtocol) +} else if (partToExtract.equals(FILE)) { + UTF8String.fromString(url.getFile) +} else if (partToExtract.equals(AUTHORITY)) { + UTF8String.fromString(url.getAuthority) +} else if (partToExtract.equals(USERINFO)) { + UTF8String.fromString(url.getUserInfo) +} else { + null --- End diff -- I've though this again, row level check seems inevitable. Since we can not limit `part` to be a `Literal`. eg. `select parse_url(url, part) from url_data`. Thus, throw `AnalysisException` here seems not suitable. How do you think? --- If your project is set up
[GitHub] spark issue #13620: [SPARK-15590] [WEBUI] Paginate Job Table in Jobs tab
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13620 **[Test build #61910 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61910/consoleFull)** for PR 13620 at commit [`649bb19`](https://github.com/apache/spark/commit/649bb195ac0eaf6cee4a84dd8ff1198900e8789a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69893880 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,160 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = stringExprs(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = stringExprs(2) match { +case Literal(key: UTF8String, _) => getPattern(key) +case _ => null + } + + private lazy val stringExprs = children.toArray + import ParseUrl._ + + override def checkInputDataTypes(): TypeCheckResult = { +if (children.size > 3 || children.size < 2) { + TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two or three arguments") +} else { + super[ImplicitCastInputTypes].checkInputDataTypes() +} + } + + private def getPattern(key: UTF8String): Pattern = { +if (key != null) { + Pattern.compile(REGEXPREFIX + key.toString + REGEXSUBFIX) +} else { + null +} + } + + private def getUrl(url: UTF8String): URL = { +try { + new URL(url.toString) +} catch { + case e: MalformedURLException => null +} + } + + private def extractValueFromQuery(query: UTF8String, pattern: Pattern): UTF8String = { +val m = pattern.matcher(query.toString) +if (m.find()) { + UTF8String.fromString(m.group(2)) +} else { + null +} + } + + private def extractFromUrl(url: URL, partToExtract: UTF8String): UTF8String = { +if (partToExtract.equals(HOST)) { + UTF8String.fromString(url.getHost) +} else if (partToExtract.equals(PATH)) { + UTF8String.fromString(url.getPath) +} else if (partToExtract.equals(QUERY)) { + UTF8String.fromString(url.getQuery) +} else if (partToExtract.equals(REF)) { + UTF8String.fromString(url.getRef) +} else if (partToExtract.equals(PROTOCOL)) { + UTF8String.fromString(url.getProtocol) +} else if (partToExtract.equals(FILE)) { + UTF8String.fromString(url.getFile) +} else if (partToExtract.equals(AUTHORITY)) { + UTF8String.fromString(url.getAuthority) +} else if (partToExtract.equals(USERINFO)) { + UTF8String.fromString(url.getUserInfo) +} else { + null --- End diff -- yup. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r69893800 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -310,12 +342,38 @@ private[spark] class TaskSchedulerImpl( } } -// Randomly shuffle offers to avoid always placing tasks on the same set of workers. -val shuffledOffers = Random.shuffle(offers) +// ensure that we periodically check if executors can be removed from the blacklist, without +// requiring a separate thread and added synchronization overhead +blacklistTracker.expireExecutorsInBlacklist() + +val sortedTaskSets = rootPool.getSortedTaskSetQueue +val filteredOffers: IndexedSeq[WorkerOffer] = offers.filter { offer => + !blacklistTracker.isNodeBlacklisted(offer.host) && +!blacklistTracker.isExecutorBlacklisted(offer.executorId) +} match { +// toIndexedSeq always makes an *immutable* IndexedSeq, though we don't care if its mutable +// or immutable. So we do this to avoid making a pointless copy + case is: IndexedSeq[WorkerOffer] => is + case other: Seq[WorkerOffer] => other.toIndexedSeq +} --- End diff -- this business about `IndexedSeq[WorkerOffer]` vs `Seq[WorkerOffer]` is also unrelated to blacklisting, but I ran into it accidentally while doing some performance tests. While `resourceOffer` accepts a `Seq`, it really ought to be an `IndexedSeq` given how its used internally (eg. given 500 offers, there is a 5x performance difference in the scheduler). I made this change just because its more locally contained ... alternatively we could change the method signature and the callsites appropriately. It *happens* to be an IndexedSeq in the [important callsite in CoarseGrainedSchedulerBackend](https://github.com/apache/spark/blob/a04cab8f17fcac05f86d2c472558ab98923f91e3/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L217), but that is more by chance than design (a call to `.toSeq` just happens to return an `IndexedSeq` for the particular types used) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14008#discussion_r69893790 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -652,6 +654,145 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) override def prettyName: String = "rpad" } +object ParseUrl { + private val HOST = UTF8String.fromString("HOST") + private val PATH = UTF8String.fromString("PATH") + private val QUERY = UTF8String.fromString("QUERY") + private val REF = UTF8String.fromString("REF") + private val PROTOCOL = UTF8String.fromString("PROTOCOL") + private val FILE = UTF8String.fromString("FILE") + private val AUTHORITY = UTF8String.fromString("AUTHORITY") + private val USERINFO = UTF8String.fromString("USERINFO") + private val REGEXPREFIX = "(&|^)" + private val REGEXSUBFIX = "=([^&]*)" +} + +/** + * Extracts a part from a URL + */ +@ExpressionDescription( + usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", + extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. +Key specifies which query to extract. +Examples: + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') + 'spark.apache.org' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') + 'query=1' + > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') + '1'""") +case class ParseUrl(children: Seq[Expression]) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + override def nullable: Boolean = true + override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) + override def dataType: DataType = StringType + override def prettyName: String = "parse_url" + + // If the url is a constant, cache the URL object so that we don't need to convert url + // from UTF8String to String to URL for every row. + @transient private lazy val cachedUrl = children(0) match { +case Literal(url: UTF8String, _) => getUrl(url) +case _ => null + } + + // If the key is a constant, cache the Pattern object so that we don't need to convert key + // from UTF8String to String to StringBuilder to String to Pattern for every row. + @transient private lazy val cachedPattern = children(2) match { +case Literal(key: UTF8String, _) => getPattern(key) --- End diff -- can you also try scala 2.10? It's also an officially supported scala version for spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14004 and one comment for the old thread: https://github.com/apache/spark/pull/14004#discussion_r69893328 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14004#discussion_r69893328 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala --- @@ -198,6 +203,66 @@ case class StringSplit(str: Expression, pattern: Expression) override def prettyName: String = "split" } +/** + * Splits a string into arrays of sentences, where each sentence is an array of words. + * The 'lang' and 'country' arguments are optional, and if omitted, the default locale is used. + */ +@ExpressionDescription( + usage = "_FUNC_(str, lang, country) - Splits str into an array of array of words.", + extended = "> SELECT _FUNC_('Hi there! Good morning.');\n [['Hi','there'], ['Good','morning']]") +case class Sentences( +str: Expression, +language: Expression = Literal(""), +country: Expression = Literal("")) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + def this(str: Expression) = this(str, Literal(""), Literal("")) + def this(str: Expression, language: Expression) = this(str, language, Literal("")) + + override def nullable: Boolean = true + override def dataType: DataType = +ArrayType(ArrayType(StringType, containsNull = false), containsNull = false) + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, StringType) + override def children: Seq[Expression] = str :: language :: country :: Nil + + override def eval(input: InternalRow): Any = { +val string = str.eval(input) +if (string == null) { + null +} else { + val locale = try { +new Locale(language.eval(input).asInstanceOf[UTF8String].toString, + country.eval(input).asInstanceOf[UTF8String].toString) + } catch { +case _: NullPointerException | _: ClassCastException => Locale.getDefault --- End diff -- what do you mean by `ignored`? returning null or returning the default locale? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14004#discussion_r69893166 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala --- @@ -198,6 +203,67 @@ case class StringSplit(str: Expression, pattern: Expression) override def prettyName: String = "split" } +/** + * Splits a string into arrays of sentences, where each sentence is an array of words. + * The 'lang' and 'country' arguments are optional, and if omitted, the default locale is used. + */ +@ExpressionDescription( + usage = "_FUNC_(str, lang, country) - Splits str into an array of array of words.", + extended = "> SELECT _FUNC_('Hi there! Good morning.');\n [['Hi','there'], ['Good','morning']]") +case class Sentences( +str: Expression, +language: Expression = Literal(""), +country: Expression = Literal("")) + extends Expression with ImplicitCastInputTypes with CodegenFallback { + + def this(str: Expression) = this(str, Literal(""), Literal("")) + def this(str: Expression, language: Expression) = this(str, language, Literal("")) + + override def nullable: Boolean = true + override def dataType: DataType = +ArrayType(ArrayType(StringType, containsNull = false), containsNull = false) + override def inputTypes: Seq[AbstractDataType] = Seq(StringType, StringType, StringType) + override def children: Seq[Expression] = str :: language :: country :: Nil + + override def eval(input: InternalRow): Any = { +val string = str.eval(input) +if (string == null) { + null +} else { + var locale = Locale.getDefault + val lang = language.eval(input) + val coun = country.eval(input) + if (lang != null && coun != null) { --- End diff -- I'd like to write: ``` val languageStr = language.eval(input).asInstanceOf[UTF8String] val countryStr = country.eval(input).asInstanceOf[UTF8String] val locale = if (languageStr != null && countryStr != null) { new Locale(languageStr, countryStr) } else { Locale.getDefault } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13876: [SPARK-16174][SQL] Improve `OptimizeIn` optimizer...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13876 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13876: [SPARK-16174][SQL] Improve `OptimizeIn` optimizer to rem...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13876 thanks, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14089: [SPARK-16415][SQL] fix catalog string error
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14089 **[Test build #61909 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61909/consoleFull)** for PR 14089 at commit [`eb10181`](https://github.com/apache/spark/commit/eb1018108a879a06701c3dff539ef8d10ab2b118). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13894: [SPARK-15254][DOC] Improve ML pipeline Cross Vali...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/13894#discussion_r69892610 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -56,7 +56,10 @@ private[ml] trait CrossValidatorParams extends ValidatorParams { /** * :: Experimental :: - * K-fold cross validation. + * CrossValidator begins by splitting the dataset into a set of non-overlapping randomly + * partitioned folds as separate training and test datasets e.g., with k=3 folds, --- End diff -- I think we can bring back the "folds, which are used as ..." part --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14089: [SPARK-16415][SQL] fix catalog string error
GitHub user adrian-wang opened a pull request: https://github.com/apache/spark/pull/14089 [SPARK-16415][SQL] fix catalog string error ## What changes were proposed in this pull request? In #13537 we truncate `simpleString` if it is a long `StructType`. But sometimes we need `catalogString` to reconstruct `TypeInfo`, for example in description of [SPARK-16415 ](https://issues.apache.org/jira/browse/SPARK-16415). So we need to keep the implementation of `catalogString` not affected by our truncate. ## How was this patch tested? added a test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/adrian-wang/spark catalogstring Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14089.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14089 commit dc81b385a68c842715b3377f15f7b3009e45f0ce Author: Daoyuan WangDate: 2016-07-06T03:18:07Z fix catalog string commit eb1018108a879a06701c3dff539ef8d10ab2b118 Author: Daoyuan Wang Date: 2016-07-07T11:41:39Z add a unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14088 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14088: Fix bugs for "Can not get user config when callin...
GitHub user sharkdtu opened a pull request: https://github.com/apache/spark/pull/14088 Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf in other places" ## What changes were proposed in this pull request? Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf in other places". The `SparkHadoopUtil` singleton was instantiated before `ApplicationMaster`, So the `sparkConf` and `conf` in the `SparkHadoopUtil` singleton didn't include user's configuration. But other places, such as `DataSourceStrategy`, use `hadoopConf` in `SparkHadoopUtil`: ```scala ... case PhysicalOperation(projects, filters, l @ LogicalRelation(t: HadoopFsRelation, _)) => // See buildPartitionedTableScan for the reason that we need to create a shard // broadcast HadoopConf. val sharedHadoopConf = SparkHadoopUtil.get.conf val confBroadcast = t.sqlContext.sparkContext.broadcast(new SerializableConfiguration(sharedHadoopConf)) ... ``` ## How was this patch tested? Use exist test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/sharkdtu/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14088.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14088 commit 55e66b21cdcd68861db0f1045186048c54b13153 Author: sharkdtuDate: 2016-07-07T11:04:11Z Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf in other places, such as DataSourceStrategy" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Structured...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14087 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61908/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Structured...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14087 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Structured...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14087 **[Test build #61908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61908/consoleFull)** for PR 14087 at commit [`ac82232`](https://github.com/apache/spark/commit/ac822323f35122b99c6aa4d9fce5874160266909). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14086: [SPARK-16410][SQL] Support `truncate` option in Overwrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14086 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61907/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14086: [SPARK-16410][SQL] Support `truncate` option in Overwrit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14086 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14086: [SPARK-16410][SQL] Support `truncate` option in Overwrit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14086 **[Test build #61907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61907/consoleFull)** for PR 14086 at commit [`c1e4c41`](https://github.com/apache/spark/commit/c1e4c411c04458622a09c010feb8a8a5204f89c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14016: [SPARK-16399] [PYSPARK] Force PYSPARK_PYTHON to p...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14016 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14016: [SPARK-16399] [PYSPARK] Force PYSPARK_PYTHON to python
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14016 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14049: [SPARK-16369][MLlib] tallSkinnyQR of RowMatrix should aw...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14049 @yinxusen if you resolve the conflicts I'll merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14051: [SPARK-16372][MLlib] Retag RDD to tallSkinnyQR of...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14051 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14051: [SPARK-16372][MLlib] Retag RDD to tallSkinnyQR of RowMat...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14051 Merged to master/2.0/1.6. I think it's a reasonably important bug fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13494: [SPARK-15752] [SQL] Optimize metadata only query that ha...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13494 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61906/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13494: [SPARK-15752] [SQL] Optimize metadata only query that ha...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13494 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13494: [SPARK-15752] [SQL] Optimize metadata only query that ha...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13494 **[Test build #61906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61906/consoleFull)** for PR 13494 at commit [`67211be`](https://github.com/apache/spark/commit/67211beb80c4d84fb70c6037cc53044f86f094d5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14076: [SPARK-16400][SQL] Remove InSet filter pushdown f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14076 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14076: [SPARK-16400][SQL] Remove InSet filter pushdown from Par...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14076 LGTM. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13778 ping @cloud-fan @vlad17 Any thing else? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13778 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61903/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13778 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13778: [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13778 **[Test build #61903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61903/consoleFull)** for PR 13778 at commit [`87a0953`](https://github.com/apache/spark/commit/87a0953ec36d6beacb4665a94da834d0a4615baa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61904/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61904 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61904/consoleFull)** for PR 13701 at commit [`687d75b`](https://github.com/apache/spark/commit/687d75b2e12d45107600037955e8afca63128094). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org