date:20140822

[jira] [Commented] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal

2014-08-22 Thread pengyanhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106557#comment-14106557
 ] 

pengyanhong commented on SPARK-3033:


I changed the file {quote}
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala
{quote}
changed the method eval in the class HiveGenericUdf as below:
{quote}
while (i  children.length) {
  val idx = i
  deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
children(idx).eval(input)
  })
  if (deferedObjects(i).get().isInstanceOf[java.math.BigDecimal] == true) {
val decimal = deferedObjects(i).get().asInstanceOf[java.math.BigDecimal]
val data = new 
org.apache.hadoop.hive.common.`type`.HiveDecimal(decimal).asInstanceOf[EvaluatedType]
deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
  data.asInstanceOf[EvaluatedType]
})
  }
  i += 1
}
{quote}
also, changed the method wrap in the trait HiveInspectors, add line:{quote}
case b: org.apache.hadoop.hive.common.`type`.HiveDecimal = b
{quote}

So this issue has been fixed.


 [Hive] java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
 

 Key: SPARK-3033
 URL: https://issues.apache.org/jira/browse/SPARK-3033
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker

 run a complex HiveQL via yarn-cluster, got error as below:
 {quote}
 14/08/14 15:05:24 WARN 
 org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
 java.lang.ClassCastException
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal

2014-08-22 Thread pengyanhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106557#comment-14106557
 ] 

pengyanhong edited comment on SPARK-3033 at 8/22/14 6:54 AM:
-

I changed the file {quote}
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala
{quote}
changed the method eval in the class HiveGenericUdf as below:
{code}
while (i  children.length) {
  val idx = i
  deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
children(idx).eval(input)
  })
  if (deferedObjects(i).get().isInstanceOf[java.math.BigDecimal] == true) {
val decimal = deferedObjects(i).get().asInstanceOf[java.math.BigDecimal]
val data = new 
org.apache.hadoop.hive.common.`type`.HiveDecimal(decimal).asInstanceOf[EvaluatedType]
deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
  data.asInstanceOf[EvaluatedType]
})
  }
  i += 1
}
{code}
also, changed the method wrap in the trait HiveInspectors, add line:{code}
case b: org.apache.hadoop.hive.common.`type`.HiveDecimal = b
{code}

So this issue has been fixed.



was (Author: pengyanhong):
I changed the file {quote}
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala
{quote}
changed the method eval in the class HiveGenericUdf as below:
{quote}
while (i  children.length) {
  val idx = i
  deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
children(idx).eval(input)
  })
  if (deferedObjects(i).get().isInstanceOf[java.math.BigDecimal] == true) {
val decimal = deferedObjects(i).get().asInstanceOf[java.math.BigDecimal]
val data = new 
org.apache.hadoop.hive.common.`type`.HiveDecimal(decimal).asInstanceOf[EvaluatedType]
deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
  data.asInstanceOf[EvaluatedType]
})
  }
  i += 1
}
{quote}
also, changed the method wrap in the trait HiveInspectors, add line:{quote}
case b: org.apache.hadoop.hive.common.`type`.HiveDecimal = b
{quote}

So this issue has been fixed.


 [Hive] java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
 

 Key: SPARK-3033
 URL: https://issues.apache.org/jira/browse/SPARK-3033
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker

 run a complex HiveQL via yarn-cluster, got error as below:
 {quote}
 14/08/14 15:05:24 WARN 
 org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
 java.lang.ClassCastException
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at

[jira] [Comment Edited] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal

2014-08-22 Thread pengyanhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106557#comment-14106557
 ] 

pengyanhong edited comment on SPARK-3033 at 8/22/14 6:58 AM:
-

I changed the file {quote}
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala
{quote}
changed the method eval in the class HiveGenericUdf as below:
{code}
while (i  children.length) {
  val idx = i
  deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
children(idx).eval(input)
  })
  if (deferedObjects(i).get().isInstanceOf[java.math.BigDecimal] == true) {
val decimal = deferedObjects(i).get().asInstanceOf[java.math.BigDecimal]
deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
  new 
org.apache.hadoop.hive.common.`type`.HiveDecimal(decimal).asInstanceOf[EvaluatedType]
})
  }
  i += 1
}
{code}
also, changed the method wrap in the trait HiveInspectors, add line:{code}
case b: org.apache.hadoop.hive.common.`type`.HiveDecimal = b
{code}

So this issue has been fixed.



was (Author: pengyanhong):
I changed the file {quote}
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala
{quote}
changed the method eval in the class HiveGenericUdf as below:
{code}
while (i  children.length) {
  val idx = i
  deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
children(idx).eval(input)
  })
  if (deferedObjects(i).get().isInstanceOf[java.math.BigDecimal] == true) {
val decimal = deferedObjects(i).get().asInstanceOf[java.math.BigDecimal]
val data = new 
org.apache.hadoop.hive.common.`type`.HiveDecimal(decimal).asInstanceOf[EvaluatedType]
deferedObjects(i).asInstanceOf[DeferredObjectAdapter].set(() = {
  data.asInstanceOf[EvaluatedType]
})
  }
  i += 1
}
{code}
also, changed the method wrap in the trait HiveInspectors, add line:{code}
case b: org.apache.hadoop.hive.common.`type`.HiveDecimal = b
{code}

So this issue has been fixed.


 [Hive] java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
 

 Key: SPARK-3033
 URL: https://issues.apache.org/jira/browse/SPARK-3033
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker

 run a complex HiveQL via yarn-cluster, got error as below:
 {quote}
 14/08/14 15:05:24 WARN 
 org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
 java.lang.ClassCastException
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at

[jira] [Created] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-08-22 Thread Fan Jiang (JIRA)

Fan Jiang created SPARK-3181:


 Summary: Add Robust Regression Algorithm with Huber Estimator
 Key: SPARK-3181
 URL: https://issues.apache.org/jira/browse/SPARK-3181
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression  to employ a fitting 
criterion that is not as vulnerable as least square.

In 1973, Huber introduced M-estimation for regression which stands for maximum 
likelihood type. The method is resistant to outliers in the response variable 
and has been widely used.

The new feature for MLlib will contain 3 new files
/main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
/test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
/main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala

and one new class HuberRobustGradient in 
/main/scala/org/apache/spark/mllib/optimization/Gradient.scala




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106679#comment-14106679
 ] 

Apache Spark commented on SPARK-3181:
-

User 'fjiang6' has created a pull request for this issue:
https://github.com/apache/spark/pull/2096

 Add Robust Regression Algorithm with Huber Estimator
 

 Key: SPARK-3181
 URL: https://issues.apache.org/jira/browse/SPARK-3181
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression  to employ a 
 fitting criterion that is not as vulnerable as least square.
 In 1973, Huber introduced M-estimation for regression which stands for 
 maximum likelihood type. The method is resistant to outliers in the 
 response variable and has been widely used.
 The new feature for MLlib will contain 3 new files
 /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
 /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
 /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
 and one new class HuberRobustGradient in 
 /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3182) Twitter Streaming Geoloaction Filter

2014-08-22 Thread Daniel Kershaw (JIRA)

Daniel Kershaw created SPARK-3182:
-

 Summary: Twitter Streaming Geoloaction Filter
 Key: SPARK-3182
 URL: https://issues.apache.org/jira/browse/SPARK-3182
 Project: Spark
  Issue Type: Wish
  Components: Streaming
Affects Versions: 1.0.2, 1.0.0
Reporter: Daniel Kershaw
 Fix For: 1.2.0


Add a geolocation filter to the Twitter Streaming Component. 

This should take a sequence of double to indicate the bounding box for the 
stream. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-22 Thread Hingorani, Vineet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106851#comment-14106851
 ] 

Hingorani, Vineet commented on SPARK-2360:
--

Hello Michael,

I saw your comment thread on a mail archive regarding having to be able to 
manipulate csv files using spark. Could you please give some information as to 
do have this functionality now in the latest release of Spark? I have installed 
the lates version as of now and running it on my local machine.

Thank you

Regards,

Vineet Hingorani
Developer Associate
Custom Development  Strategic Projects group (CDSP)
Products  Innovation (PI)
SAP SE
WDF 03, C3.03
E vineet.hingor...@sap.commailto:vineet.hingor...@sap.com



 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Hossein Falaki

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2742) The variable inputFormatInfo and inputFormatMap never used

2014-08-22 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-2742.
--

   Resolution: Fixed
Fix Version/s: 1.2.0

 The variable inputFormatInfo and inputFormatMap never used
 --

 Key: SPARK-2742
 URL: https://issues.apache.org/jira/browse/SPARK-2742
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: meiyoula
Priority: Minor
 Fix For: 1.2.0


 the ClientArguments class has two never used variables, one is 
 inputFormatInfo, the other is inputFormatMap



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2742) The variable inputFormatInfo and inputFormatMap never used

2014-08-22 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106997#comment-14106997
 ] 

Thomas Graves commented on SPARK-2742:
--

https://github.com/apache/spark/pull/1614

 The variable inputFormatInfo and inputFormatMap never used
 --

 Key: SPARK-2742
 URL: https://issues.apache.org/jira/browse/SPARK-2742
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: meiyoula
Priority: Minor
 Fix For: 1.2.0


 the ClientArguments class has two never used variables, one is 
 inputFormatInfo, the other is inputFormatMap



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3183) Add option for requesting full YARN cluster

2014-08-22 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-3183:
-

 Summary: Add option for requesting full YARN cluster
 Key: SPARK-3183
 URL: https://issues.apache.org/jira/browse/SPARK-3183
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza


This could possibly be in the form of --executor-cores ALL --executor-memory 
ALL --num-executors ALL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3184) Allow user to specify num tasks to use for a table

2014-08-22 Thread Andy Konwinski (JIRA)

Andy Konwinski created SPARK-3184:
-

 Summary: Allow user to specify num tasks to use for a table
 Key: SPARK-3184
 URL: https://issues.apache.org/jira/browse/SPARK-3184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andy Konwinski






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-08-22 Thread Jonathan Kelly (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107131#comment-14107131
 ] 

Jonathan Kelly commented on SPARK-1981:
---

The code here is a cleaned up version of the code from that article that Chris 
took over from Parviz and integrated into Spark itself, and it will be 
available whenever Spark 1.1 is released.

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.1.0


 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2921) Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things)

2014-08-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2921:
---

Priority: Blocker  (was: Critical)

 Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other 
 things)
 ---

 Key: SPARK-2921
 URL: https://issues.apache.org/jira/browse/SPARK-2921
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.2
Reporter: Andrew Or
Priority: Blocker
 Fix For: 1.1.0


 The code path to handle this exists only for the coarse grained mode, and 
 even in this mode the java options aren't passed to the executors properly. 
 We currently pass the entire value of spark.executor.extraJavaOptions to the 
 executors as a string without splitting it. We need to use 
 Utils.splitCommandString as in standalone mode.
 I have not confirmed this, but I would assume spark.executor.extraClassPath 
 and spark.executor.extraLibraryPath are also not propagated correctly in 
 either mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-22 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107155#comment-14107155
 ] 

Hossein Falaki commented on SPARK-2360:
---

There is a pull request for this issue: 
https://github.com/apache/spark/pull/1351

It did not make it to Spark 1.1, due to last minute API changes. It will make 
it to the next release. The API will provide a very easy (default) way of 
reading common CSV (e.g., comma delimited) into SchemaRDDs. Users will be able 
to specify delimiter and quotation characters.

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Hossein Falaki

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-22 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107233#comment-14107233
 ] 

Erik Erlandson commented on SPARK-2360:
---

It appears that this is not a pure lazy transform, as it invokes 'first()` when 
inferring schema from headers.
I wrote up some ideas on this, pertaining to SPARK-2315, here:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/


 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Hossein Falaki

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3176) Implement 'POWER', 'ABS and 'LAST' for sql

2014-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107237#comment-14107237
 ] 

Apache Spark commented on SPARK-3176:
-

User 'xinyunh' has created a pull request for this issue:
https://github.com/apache/spark/pull/2099

 Implement 'POWER', 'ABS and 'LAST' for sql
 --

 Key: SPARK-3176
 URL: https://issues.apache.org/jira/browse/SPARK-3176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 Add support for the mathematical function POWER and ABS and  the analytic 
 function last to return a subset of the rows satisfying a query within 
 spark sql.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2014-08-22 Thread Jeremy Chambers (JIRA)

Jeremy Chambers created SPARK-3185:
--

 Summary: SPARK launch on Hadoop 2 in EC2 throws Tachyon exception 
when Formatting JOURNAL_FOLDER
 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI

[ec2-user@ip-172-30-1-145 ~]$ uname -a
Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/

The build I used (and MD5 verified):
[ec2-user@ip-172-30-1-145 ~]$ wget 
http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz


Reporter: Jeremy Chambers


org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate 
with client version 4

When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
exception is thrown when Formatting JOURNAL_FOLDER.

No exception occurs when I launch on Hadoop 1.

Launch used:
./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
--zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
sparkProd

log snippet
Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
Exception in thread main java.lang.RuntimeException: 
org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate 
with client version 4
at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
at tachyon.Format.main(Format.java:54)
Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
... 3 more
Killed 0 processes
Killed 0 processes
ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
---end snippet---

*** I don't have this problem when I launch without the 
--hadoop-major-version=2 (which defaults to Hadoop 1.x)




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2014-08-22 Thread Jeremy Chambers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107300#comment-14107300
 ] 

Jeremy Chambers commented on SPARK-3185:


Cross reference: 
http://apache-spark-user-list.1001560.n3.nabble.com/Server-IPC-version-7-cannot-communicate-with-client-version-4-with-Spark-Streaming-1-0-0-in-Java-ande-tp9908p9914.html
 

 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 *** I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2014-08-22 Thread Jeremy Chambers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107388#comment-14107388
 ] 

Jeremy Chambers commented on SPARK-3185:


Working on rebuilding client with Hadoop 2.

 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 *** I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3107) Don't pass null jar to executor in yarn-client mode

2014-08-22 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107476#comment-14107476
 ] 

Marcelo Vanzin commented on SPARK-3107:
---

The {{--jar}} issue is fixed in SPARK-2933.

Where do you see the sys properties issue? Setting a system property to an 
empty value is semantically different from not setting it, although I'm 
sceptical it would make a difference here.

 Don't pass null jar to executor in yarn-client mode
 ---

 Key: SPARK-3107
 URL: https://issues.apache.org/jira/browse/SPARK-3107
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 In the following line, ExecutorLauncher's `--jar` takes in null.
 {code}
 14/08/18 20:52:43 INFO yarn.Client:   command: $JAVA_HOME/bin/java -server 
 -Xmx512m ... org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' 
 --jar null  --arg  'ip-172-31-0-12.us-west-2.compute.internal:56838' 
 --executor-memory 1024 --executor-cores 1 --num-executors  2
 {code}
 Also it appears that we set a bunch of system properties to empty strings 
 (not shown). We should avoid setting these if they don't actually contain 
 values.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3101) Missing volatile annotation in ApplicationMaster

2014-08-22 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107478#comment-14107478
 ] 

Marcelo Vanzin commented on SPARK-3101:
---

I covered this in the PR for SPARK-2933 also.

 Missing volatile annotation in ApplicationMaster
 

 Key: SPARK-3101
 URL: https://issues.apache.org/jira/browse/SPARK-3101
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 In ApplicationMaster, a field variable 'isLastAMRetry' is used as a flag but 
 it's not declared as volatile though it's used from multiple threads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3102) Add tests for yarn-client mode

2014-08-22 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107480#comment-14107480
 ] 

Marcelo Vanzin commented on SPARK-3102:
---

SPARK-2778?

 Add tests for yarn-client mode
 --

 Key: SPARK-3102
 URL: https://issues.apache.org/jira/browse/SPARK-3102
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Josh Rosen

 It looks like some of the {{yarn-client}} code paths aren't exercised by any 
 of the existing tests because my pull request was able to introduce a bug 
 that wasn't caught by Jenkins: 
 https://github.com/apache/spark/pull/2002#discussion-diff-16331781
 We should eventually add tests for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3099) Staging Directory is never deleted when we run job with YARN Client Mode

2014-08-22 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107481#comment-14107481
 ] 

Marcelo Vanzin commented on SPARK-3099:
---

Pretty sure I covered this in the PR for SPARK-2933.

 Staging Directory is never deleted when we run job with YARN Client Mode
 

 Key: SPARK-3099
 URL: https://issues.apache.org/jira/browse/SPARK-3099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 When we run application with YARN Cluster mode, the class 'ApplicationMaster' 
 is used as ApplicationMaster, which has shutdown hook to cleanup stagind 
 directory (~/.sparkStaging).
 But, when we run application with YARN Client mode, the class 
 'ExecutorLauncher' as an ApplicationMaster doesn't cleanup staging directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3090) Avoid not stopping SparkContext with YARN Client mode

2014-08-22 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107488#comment-14107488
 ] 

Marcelo Vanzin commented on SPARK-3090:
---

I think that if we want to add this, it would be better to do so for all modes, 
not just yarn-client. Basically have SparkContext itself register a shutdown 
hook to shut itself down, and publish the priority of the hook so that apps / 
backends can register hooks that run before it (see Hadoop's 
ShutdownHookManager for the priority thing - http://goo.gl/BQ1bjk).

That way the code in the yarn-cluster backend can be removed too.

  Avoid not stopping SparkContext with YARN Client mode
 --

 Key: SPARK-3090
 URL: https://issues.apache.org/jira/browse/SPARK-3090
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 When we use YARN Cluster mode, ApplicationMaser register a shutdown hook, 
 stopping SparkContext.
 Thanks to this, SparkContext can stop even if Application forgets to stop 
 SparkContext itself.
 But, unfortunately, YARN Client mode doesn't have such mechanism.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2140) yarn stable client doesn't properly handle MEMORY_OVERHEAD for AM

2014-08-22 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107499#comment-14107499
 ] 

Marcelo Vanzin commented on SPARK-2140:
---

Same as SPARK-1287?

 yarn stable client doesn't properly handle MEMORY_OVERHEAD for AM
 -

 Key: SPARK-2140
 URL: https://issues.apache.org/jira/browse/SPARK-2140
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
 Fix For: 1.0.1, 1.1.0


 The yarn stable client doesn't properly remove the MEMORY_OVERHEAD amount 
 from the java heap size, the code to handle that is commented out (see 
 function calculateAMMemory).  We should fix this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3186) Enable parallelism for Reduce Side Join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)

Szehon Ho created SPARK-3186:


 Summary: Enable parallelism for Reduce Side Join [Spark Branch] 
 Key: SPARK-3186
 URL: https://issues.apache.org/jira/browse/SPARK-3186
 Project: Spark
  Issue Type: Bug
Reporter: Szehon Ho


Blocked by SPARK-2978.  See parent JIRA for design details.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3186) Enable parallelism for Reduce Side Join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated SPARK-3186:
-

Description: (was: Blocked by SPARK-2978.  See parent JIRA for design 
details.)

 Enable parallelism for Reduce Side Join [Spark Branch] 
 ---

 Key: SPARK-3186
 URL: https://issues.apache.org/jira/browse/SPARK-3186
 Project: Spark
  Issue Type: Bug
Reporter: Szehon Ho





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3186) Enable parallelism for Reduce Side Join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho resolved SPARK-3186.
--

Resolution: Invalid

Sorry please ignore, meant to file this in Hive project.

 Enable parallelism for Reduce Side Join [Spark Branch] 
 ---

 Key: SPARK-3186
 URL: https://issues.apache.org/jira/browse/SPARK-3186
 Project: Spark
  Issue Type: Bug
Reporter: Szehon Ho

 Blocked by SPARK-2978.  See parent JIRA for design details.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3186) Enable parallelism for Reduce Side Join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho closed SPARK-3186.



 Enable parallelism for Reduce Side Join [Spark Branch] 
 ---

 Key: SPARK-3186
 URL: https://issues.apache.org/jira/browse/SPARK-3186
 Project: Spark
  Issue Type: Bug
Reporter: Szehon Ho





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3187) Refactor and cleanup Yarn allocator code

2014-08-22 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-3187:
-

 Summary: Refactor and cleanup Yarn allocator code
 Key: SPARK-3187
 URL: https://issues.apache.org/jira/browse/SPARK-3187
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Marcelo Vanzin
Priority: Minor


This is a follow-up to SPARK-2933, which dealt with the ApplicationMaster code.

There's a lot of logic in the container allocation code in alpha/stable that 
could probably be merged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3102) Add tests for yarn-client mode

2014-08-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3102:
--

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-2778

 Add tests for yarn-client mode
 --

 Key: SPARK-3102
 URL: https://issues.apache.org/jira/browse/SPARK-3102
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.1.0
Reporter: Josh Rosen

 It looks like some of the {{yarn-client}} code paths aren't exercised by any 
 of the existing tests because my pull request was able to introduce a bug 
 that wasn't caught by Jenkins: 
 https://github.com/apache/spark/pull/2002#discussion-diff-16331781
 We should eventually add tests for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3102) Add tests for yarn-client mode

2014-08-22 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107719#comment-14107719
 ] 

Josh Rosen commented on SPARK-3102:
---

Good point; I'll convert this to a subtask of that JIRA.

 Add tests for yarn-client mode
 --

 Key: SPARK-3102
 URL: https://issues.apache.org/jira/browse/SPARK-3102
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Josh Rosen

 It looks like some of the {{yarn-client}} code paths aren't exercised by any 
 of the existing tests because my pull request was able to introduce a bug 
 that wasn't caught by Jenkins: 
 https://github.com/apache/spark/pull/2002#discussion-diff-16331781
 We should eventually add tests for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3188) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)

Fan Jiang created SPARK-3188:


 Summary: Add Robust Regression Algorithm with Turkey bisquare 
weight  function (Biweight Estimates) 
 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces and M-estimator that is more resistant to regression outliers than the 
Huber M-estimator ()Andersen 2008: 19).





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang closed SPARK-3189.


Resolution: Duplicate

 Add Robust Regression Algorithm with Turkey bisquare weight  function 
 (Biweight Estimates) 
 ---

 Key: SPARK-3189
 URL: https://issues.apache.org/jira/browse/SPARK-3189
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Turkey bisquare weight function, also referred to as the biweight 
 function, produces and M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator ()Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)

Fan Jiang created SPARK-3189:


 Summary: Add Robust Regression Algorithm with Turkey bisquare 
weight  function (Biweight Estimates) 
 Key: SPARK-3189
 URL: https://issues.apache.org/jira/browse/SPARK-3189
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces and M-estimator that is more resistant to regression outliers than the 
Huber M-estimator ()Andersen 2008: 19).





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3184) Allow user to specify num tasks to use for a table

2014-08-22 Thread Andy Konwinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107748#comment-14107748
 ] 

Andy Konwinski commented on SPARK-3184:
---

[~marmbrus], did we figure out if this feature is in fact missing right now?

 Allow user to specify num tasks to use for a table
 --

 Key: SPARK-3184
 URL: https://issues.apache.org/jira/browse/SPARK-3184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andy Konwinski





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3190) Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow somewhere

2014-08-22 Thread npanj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-3190:
-

Description: 
While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset  2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf














 

  was:
While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset  2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf













 


 Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow 
 somewhere
 ---

 Key: SPARK-3190
 URL: https://issues.apache.org/jira/browse/SPARK-3190
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.3
 Environment: Standalone mode running on EC2 
Reporter: npanj
Priority: Critical

 While creating a graph with 6B nodes and 12B edges, I noticed that 
 'numVertices' api returns incorrect result; 'numEdges' reports correct 
 number. For few times(with different dataset  2.5B nodes) I have also 
 notices that numVertices is returned as -ive number; so I suspect that there 
 is some overflow (may be we are using Int for some field?).
 Here is some details of experiments  I have done so far: 
 1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ;  numEdges=12163784626
 2. Input : numNodes=2157586441 ; noEdges=2747322705
Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
 3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ;  numEdges=2041768213
 You can find the code to generate this bug here: 
 https://gist.github.com/npanj/92e949d86d08715bf4bf
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3190) Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow somewhere

2014-08-22 Thread npanj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-3190:
-

Environment: Standalone mode running on EC2 . Using latest code from master 
branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .  (was: 
Standalone mode running on EC2 )

 Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow 
 somewhere
 ---

 Key: SPARK-3190
 URL: https://issues.apache.org/jira/browse/SPARK-3190
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.3
 Environment: Standalone mode running on EC2 . Using latest code from 
 master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
Reporter: npanj
Priority: Critical

 While creating a graph with 6B nodes and 12B edges, I noticed that 
 'numVertices' api returns incorrect result; 'numEdges' reports correct 
 number. For few times(with different dataset  2.5B nodes) I have also 
 notices that numVertices is returned as -ive number; so I suspect that there 
 is some overflow (may be we are using Int for some field?).
 Here is some details of experiments  I have done so far: 
 1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ;  numEdges=12163784626
 2. Input : numNodes=2157586441 ; noEdges=2747322705
Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
 3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ;  numEdges=2041768213
 You can find the code to generate this bug here: 
 https://gist.github.com/npanj/92e949d86d08715bf4bf
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3190) Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow somewhere

2014-08-22 Thread npanj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-3190:
-

Description: 
While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset  2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf

Note: Nodes are labeled are 1...6B .














 

  was:
While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset  2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf














 


 Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow 
 somewhere
 ---

 Key: SPARK-3190
 URL: https://issues.apache.org/jira/browse/SPARK-3190
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.3
 Environment: Standalone mode running on EC2 . Using latest code from 
 master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
Reporter: npanj
Priority: Critical

 While creating a graph with 6B nodes and 12B edges, I noticed that 
 'numVertices' api returns incorrect result; 'numEdges' reports correct 
 number. For few times(with different dataset  2.5B nodes) I have also 
 notices that numVertices is returned as -ive number; so I suspect that there 
 is some overflow (may be we are using Int for some field?).
 Here is some details of experiments  I have done so far: 
 1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ;  numEdges=12163784626
 2. Input : numNodes=2157586441 ; noEdges=2747322705
Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
 3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ;  numEdges=2041768213
 You can find the code to generate this bug here: 
 https://gist.github.com/npanj/92e949d86d08715bf4bf
 Note: Nodes are labeled are 1...6B .
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3169) make-distribution.sh failed

2014-08-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107805#comment-14107805
 ] 

Apache Spark commented on SPARK-3169:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/2101

 make-distribution.sh failed
 ---

 Key: SPARK-3169
 URL: https://issues.apache.org/jira/browse/SPARK-3169
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Guoqiang Li
Priority: Blocker

 {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
 -Dhadoop.version=2.3.0 
 {code}
  =
 {noformat}
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
 Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A 
 signature in TestSuiteBase.class refers to term dstream
 in package org.apache.spark.streaming which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 TestSuiteBase.class.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-08-22 Thread Richard W. Eggert II (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107827#comment-14107827
 ] 

Richard W. Eggert II commented on SPARK-2707:
-

Upgrading to Akka 2.3 will also allow SparkContexts to be created within other 
applications that use Akka 2.3, especially Play 2.3 web applications. Akka 2.2 
and 2.3 appear to be binary incompatible, which means that Spark cannot 
currently be used within a Play 2.3 application.

 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3169) make-distribution.sh failed

2014-08-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3169.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 2101
[https://github.com/apache/spark/pull/2101]

 make-distribution.sh failed
 ---

 Key: SPARK-3169
 URL: https://issues.apache.org/jira/browse/SPARK-3169
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Guoqiang Li
Priority: Blocker
 Fix For: 1.1.0


 {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
 -Dhadoop.version=2.3.0 
 {code}
  =
 {noformat}
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
 Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A 
 signature in TestSuiteBase.class refers to term dstream
 in package org.apache.spark.streaming which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 TestSuiteBase.class.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3169) make-distribution.sh failed

2014-08-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3169:
---

Assignee: Tathagata Das

 make-distribution.sh failed
 ---

 Key: SPARK-3169
 URL: https://issues.apache.org/jira/browse/SPARK-3169
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Guoqiang Li
Assignee: Tathagata Das
Priority: Blocker
 Fix For: 1.1.0


 {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
 -Dhadoop.version=2.3.0 
 {code}
  =
 {noformat}
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
 Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A 
 signature in TestSuiteBase.class refers to term dstream
 in package org.apache.spark.streaming which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 TestSuiteBase.class.
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3175) Branch-1.1 SBT build failed for Yarn-Alpha

2014-08-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3175.


Resolution: Won't Fix

We have to keep these versions slightly out of sync when we are making release 
candidates due to the way that our maven publishing plug-in works. If you 
checkout the specific release snapshots though (e.g. snapshot1, rc1, etc). Then 
it will work. This issue is only relevant for the older YARN build.

 Branch-1.1 SBT build failed for Yarn-Alpha
 --

 Key: SPARK-3175
 URL: https://issues.apache.org/jira/browse/SPARK-3175
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.1
Reporter: Chester
  Labels: build
 Fix For: 1.1.1

   Original Estimate: 1h
  Remaining Estimate: 1h

 When trying to build yarn-alpha on branch-1.1
 ᚛ |branch-1.1|$  sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha projects
 [info] Loading project definition from /Users/chester/projects/spark/project
 org.apache.maven.model.building.ModelBuildingException: 1 problem was 
 encountered while building the effective model for 
 org.apache.spark:spark-yarn-alpha_2.10:1.1.0
 [FATAL] Non-resolvable parent POM: Could not find artifact 
 org.apache.spark:yarn-parent_2.10:pom:1.1.0 in central ( 
 http://repo.maven.apache.org/maven2) and 'parent.relativePath' points at 
 wrong local POM @ line 20, column 11



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3170) Bug Fix in Storage UI

2014-08-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3170:
---

Priority: Critical  (was: Minor)

 Bug Fix in Storage UI
 -

 Key: SPARK-3170
 URL: https://issues.apache.org/jira/browse/SPARK-3170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
Reporter: uncleGen
Priority: Critical

 current compeleted stage only need to remove its own partitions that are no 
 longer cached. Currently, Storage in Spark UI may lost some rdds which are 
 cached actually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2963) The description about how to build for using CLI and Thrift JDBC server is absent in proper document

2014-08-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2963.


Resolution: Fixed
  Assignee: Kousuke Saruta

Thanks - I've merged your fix.

 The description about how to build for using CLI and Thrift JDBC server is 
 absent in proper document 
 -

 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.1.0


 Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
 -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

46 matches

Mail list logo