date:20160313

[GitHub] spark pull request: [SPARK-13658][SQL] BooleanSimplification rule ...

2016-03-13 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11647#discussion_r55958696
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -598,3 +601,61 @@ abstract class TernaryExpression extends Expression {
 }
   }
 }
+
+/**
+ * Rewrites an expression using rules that are guaranteed preserve the 
result while attempting
+ * to remove cosmetic variations. Deterministic expressions that are 
`equal` after canonicalization
+ * will always return the same answer given the same input (i.e. false 
positives should not be
+ * possible). However, it is possible that two canonical expressions that 
are not equal will in fact
+ * return the same answer given any input (i.e. false negatives are 
possible).
+ *
+ * The following rules are applied:
+ *  - Names and nullability hints for 
[[org.apache.spark.sql.types.DataType]]s are stripped.
+ *  - Commutative and associative operations ([[Add]] and [[Multiply]]) 
have their children ordered
+ *by `hashCode`.
+*   - [[EqualTo]] and [[EqualNullSafe]] are reordered by `hashCode`.
+ *  - Other comparisons ([[GreaterThan]], [[LessThan]]) are reversed by 
`hashCode`.
+ */
+object Canonicalize extends {
--- End diff --

fwiw, if you are keeping this class, i think we should just have it in its 
own file. The Expression.scala file is getting pretty long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12893][YARN] Fix history URL redirect e...

2016-03-13 Thread jerryshao

Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/10821#issuecomment-196154044

Hi @vanzin , thanks a lot for your response. I just checked the branch-1.6,
looks like the behavior (attempt id) is actually changed, and this change is
introduced in #9182.

Originally `attemptId` is gotten from `spark.yarn.app.attemptId` which is
set in `ApplicationMaster`.

And in `ApplicationMaster`, the way to get `attemptId` is
`appAttemptId.getAttemptId().toString()`, so here the attemptId is "1", "2".

But this behavior is changed in master branch. Here we use the full
`attemptId` rather than attempt counter
[here](https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClusterSchedulerBackend.scala#L34).

So this affects not only the file name of event log, also the url of
history server's each application.

Here if we accept the new way of `attempId`, then this url link should be
updated to the new one. Oppositely if we treat this new way of `attemptId` as a
regression, then there's no issue here, all we should change is to loop back
the original `attemptId`.

What's your opinion? @vanzin .

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2...

2016-03-13 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/11687#discussion_r55957415
  
--- Diff: python/pyspark/streaming/flume.py ---
@@ -111,13 +111,9 @@ def func(event):
 @staticmethod
 def _get_helper(sc):
 try:
-helperClass = 
sc._jvm.java.lang.Thread.currentThread().getContextClassLoader() \
-
.loadClass("org.apache.spark.streaming.flume.FlumeUtilsPythonHelper")
-return helperClass.newInstance()
-except Py4JJavaError as e:
-# TODO: use --jar once it also work on driver
-if 'ClassNotFoundException' in str(e.java_exception):
--- End diff --

I made this change because the call now fails with a different set of 
exceptions (such as "attempting to call a package") and wanted to err on the 
side of over-displaying the warning message. Let me try to figure out a 
narrower exception pattern match.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Branch 1.6

2016-03-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/11668


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-XX] [SQL] fast serialization for collec...

2016-03-13 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11664#discussion_r55956646
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala ---
@@ -220,7 +222,61 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging with Serializ
* Runs this query returning the result as an array.
*/
   def executeCollect(): Array[InternalRow] = {
-execute().map(_.copy()).collect()
+// Packing the UnsafeRows into byte array for faster serialization.
+// The byte arrays are in the following format:
+// [size] [bytes of UnsafeRow] [size] [bytes of UnsafeRow] ... [-1]
+val byteArrayRdd = execute().mapPartitionsInternal { iter =>
+  new Iterator[Array[Byte]] {
--- End diff --

i also find this more understandable if you just write it imperatively 
within the map partitions; something like

```scala
execute().mapPartitionsInternal { iter =>
  while (iter.hasNext) {
// write each row to a buffer
  }
  Iterator(buffer)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-XX] [SQL] fast serialization for collec...

2016-03-13 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11664#discussion_r55956582
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala ---
@@ -220,7 +222,61 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging with Serializ
* Runs this query returning the result as an array.
*/
   def executeCollect(): Array[InternalRow] = {
-execute().map(_.copy()).collect()
+// Packing the UnsafeRows into byte array for faster serialization.
+// The byte arrays are in the following format:
+// [size] [bytes of UnsafeRow] [size] [bytes of UnsafeRow] ... [-1]
+val byteArrayRdd = execute().mapPartitionsInternal { iter =>
+  new Iterator[Array[Byte]] {
+private var row: UnsafeRow = _
+override def hasNext: Boolean = row != null || iter.hasNext
+override def next: Array[Byte] = {
--- End diff --

next() rather than next


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-XX] [SQL] fast serialization for collec...

2016-03-13 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11664#discussion_r55956568
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala ---
@@ -220,7 +222,61 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging with Serializ
* Runs this query returning the result as an array.
*/
   def executeCollect(): Array[InternalRow] = {
-execute().map(_.copy()).collect()
+// Packing the UnsafeRows into byte array for faster serialization.
+// The byte arrays are in the following format:
+// [size] [bytes of UnsafeRow] [size] [bytes of UnsafeRow] ... [-1]
+val byteArrayRdd = execute().mapPartitionsInternal { iter =>
+  new Iterator[Array[Byte]] {
+private var row: UnsafeRow = _
+override def hasNext: Boolean = row != null || iter.hasNext
+override def next: Array[Byte] = {
+  var cap = 1 << 20  // 1 MB
+  if (row != null) {
+// the buffered row could be larger than default buffer size
+cap = Math.max(cap, 4 + row.getSizeInBytes + 4) // reverse 4 
bytes for ending mark (-1).
+  }
+  val buffer = ByteBuffer.allocate(cap)
+  if (row != null) {
+buffer.putInt(row.getSizeInBytes)
+row.writeTo(buffer)
+row = null
+  }
+  while (iter.hasNext) {
+row = iter.next().asInstanceOf[UnsafeRow]
--- End diff --

are we always taking UnsafeRow now?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-XX] [SQL] fast serialization for collec...

2016-03-13 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/11664#discussion_r55956440
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala ---
@@ -220,7 +222,61 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging with Serializ
* Runs this query returning the result as an array.
*/
   def executeCollect(): Array[InternalRow] = {
-execute().map(_.copy()).collect()
+// Packing the UnsafeRows into byte array for faster serialization.
+// The byte arrays are in the following format:
+// [size] [bytes of UnsafeRow] [size] [bytes of UnsafeRow] ... [-1]
+val byteArrayRdd = execute().mapPartitionsInternal { iter =>
+  new Iterator[Array[Byte]] {
+private var row: UnsafeRow = _
+override def hasNext: Boolean = row != null || iter.hasNext
+override def next: Array[Byte] = {
+  var cap = 1 << 20  // 1 MB
+  if (row != null) {
+// the buffered row could be larger than default buffer size
+cap = Math.max(cap, 4 + row.getSizeInBytes + 4) // reverse 4 
bytes for ending mark (-1).
+  }
+  val buffer = ByteBuffer.allocate(cap)
+  if (row != null) {
+buffer.putInt(row.getSizeInBytes)
+row.writeTo(buffer)
+row = null
+  }
+  while (iter.hasNext) {
+row = iter.next().asInstanceOf[UnsafeRow]
+// Reserve last 4 bytes for ending mark
+if (4 + row.getSizeInBytes + 4 <= buffer.remaining()) {
+  buffer.putInt(row.getSizeInBytes)
+  row.writeTo(buffer)
+  row = null
+} else {
+  buffer.putInt(-1)
+  return buffer.array()
+}
+  }
+  buffer.putInt(-1)
+  // copy the used bytes to make it smaller
+  val bytes = new Array[Byte](buffer.limit())
+  System.arraycopy(buffer.array(), 0, bytes, 0, buffer.limit())
+  bytes
+}
+  }
+}
+// Collect the byte arrays back to driver, then decode them as 
UnsafeRows.
+val nFields = schema.length
+byteArrayRdd.collect().flatMap { bytes =>
--- End diff --

i think this block would be more readable if we just write it imperatively, 
e.g.

```scala

val results = new ArrayBuffer

byteArrayRdd.collect().foreach { bytes =>
  var sizeOfNextRow = bytes.getInt()
  while (sizeOfNextRow >= 0) {
val row = new UnsafeRow(nFields)
row.pointTo(buffer.array(), Platform.BYTE_ARRAY_OFFSET + 
buffer.position(), sizeInBytes)
buffer.position(buffer.position() + sizeOfNextRow)
results += row
  }
}
results.toArray
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 169 matches

Mail list logo