[jira] [Created] (SPARK-10840) SparkSQL doesn't work well with JSON

2015-09-25 Thread Ankit Sarraf (JIRA)
Ankit Sarraf created SPARK-10840:


 Summary: SparkSQL doesn't work well with JSON
 Key: SPARK-10840
 URL: https://issues.apache.org/jira/browse/SPARK-10840
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Ankit Sarraf
Priority: Trivial


Well formed JSON doesn't work with the 1.5.1 version while using 
sqlContext.read.json(""):

{
  "employees": {
"employee": [
  {
"name": "Mia",
"surname": "Radison",
"mobile": "7295913821",
"email": "miaradi...@sparky.com"
  },
  {
"name": "Thor",
"surname": "Kovaskz",
"mobile": "8829177193",
"email": "tkova...@sparky.com"
  },
  {
"name": "Bindy",
"surname": "Kvuls",
"mobile": "5033828845",
"email": "bind...@sparky.com"
  }
]
  }
}

Where as, this works because all are in the same line:

[
  {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
"miaradi...@sparky.com"},
  {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
"tkova...@sparky.com"},
  {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
"bind...@sparky.com"}
]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON

2015-09-25 Thread Ankit Sarraf (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Sarraf updated SPARK-10840:
-
Description: 
Well formed JSON doesn't work with the 1.5.1 version while using 
sqlContext.read.json(""):

{
  "employees": {
"employee": [
  {
"name": "Mia",

"surname": "Radison",

"mobile": "7295913821",

"email": "miaradi...@sparky.com"
  },
  {
"name": "Thor",

"surname": "Kovaskz",

"mobile": "8829177193",

"email": "tkova...@sparky.com"
  },
  {
"name": "Bindy",

"surname": "Kvuls",

"mobile": "5033828845",

"email": "bind...@sparky.com"
  }
]
  }
}

Where as, this works because all are in the same line:

[
  {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
"miaradi...@sparky.com"},
  {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
"tkova...@sparky.com"},
  {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
"bind...@sparky.com"}
]

  was:
Well formed JSON doesn't work with the 1.5.1 version while using 
sqlContext.read.json(""):

{
  "employees": {
"employee": [
  {
"name": "Mia",
"surname": "Radison",
"mobile": "7295913821",
"email": "miaradi...@sparky.com"
  },
  {
"name": "Thor",
"surname": "Kovaskz",
"mobile": "8829177193",
"email": "tkova...@sparky.com"
  },
  {
"name": "Bindy",
"surname": "Kvuls",
"mobile": "5033828845",
"email": "bind...@sparky.com"
  }
]
  }
}

Where as, this works because all are in the same line:

[
  {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
"miaradi...@sparky.com"},
  {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
"tkova...@sparky.com"},
  {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
"bind...@sparky.com"}
]


> SparkSQL doesn't work well with JSON
> 
>
> Key: SPARK-10840
> URL: https://issues.apache.org/jira/browse/SPARK-10840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ankit Sarraf
>Priority: Trivial
>  Labels: JSON, Scala, SparkSQL
>
> Well formed JSON doesn't work with the 1.5.1 version while using 
> sqlContext.read.json(""):
> {
>   "employees": {
> "employee": [
>   {
> "name": "Mia",
> "surname": "Radison",
> "mobile": "7295913821",
> "email": "miaradi...@sparky.com"
>   },
>   {
> "name": "Thor",
> "surname": "Kovaskz",
> "mobile": "8829177193",
> "email": "tkova...@sparky.com"
>   },
>   {
> "name": "Bindy",
> "surname": "Kvuls",
> "mobile": "5033828845",
> "email": "bind...@sparky.com"
>   }
> ]
>   }
> }
> Where as, this works because all are in the same line:
> [
>   {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
> "miaradi...@sparky.com"},
>   {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
> "tkova...@sparky.com"},
>   {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
> "bind...@sparky.com"}
> ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON

2015-09-25 Thread Ankit Sarraf (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Sarraf updated SPARK-10840:
-
Description: 
Well formed JSON doesn't work with the 1.5.1 version while using 
sqlContext.read.json(""):

{
  "employees": {
"employee": [
  {
"name": "Mia",

"surname": "Radison",

"mobile": "7295913821",

"email": "miaradi...@sparky.com"
  },
  {
"name": "Thor",

"surname": "Kovaskz",

"mobile": "8829177193",

"email": "tkova...@sparky.com"
  },
  {
"name": "Bindy",

"surname": "Kvuls",

"mobile": "5033828845",

"email": "bind...@sparky.com"
  }
]
  }
}

Where as, this works because all are in the same line:

[
  {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
"miaradi...@sparky.com"},
  {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
"tkova...@sparky.com"},
  {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
"bind...@sparky.com"}
]

  was:
Well formed JSON doesn't work with the 1.5.1 version while using 
sqlContext.read.json(""):

{
  "employees": {
"employee": [
  {
"name": "Mia",

"surname": "Radison",

"mobile": "7295913821",

"email": "miaradi...@sparky.com"
  },
  {
"name": "Thor",

"surname": "Kovaskz",

"mobile": "8829177193",

"email": "tkova...@sparky.com"
  },
  {
"name": "Bindy",

"surname": "Kvuls",

"mobile": "5033828845",

"email": "bind...@sparky.com"
  }
]
  }
}

Where as, this works because all are in the same line:

[
  {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
"miaradi...@sparky.com"},
  {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
"tkova...@sparky.com"},
  {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
"bind...@sparky.com"}
]


> SparkSQL doesn't work well with JSON
> 
>
> Key: SPARK-10840
> URL: https://issues.apache.org/jira/browse/SPARK-10840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ankit Sarraf
>Priority: Trivial
>  Labels: JSON, Scala, SparkSQL
>
> Well formed JSON doesn't work with the 1.5.1 version while using 
> sqlContext.read.json(""):
> 
> {
>   "employees": {
> "employee": [
>   {
> "name": "Mia",
> "surname": "Radison",
> "mobile": "7295913821",
> "email": "miaradi...@sparky.com"
>   },
>   {
> "name": "Thor",
> "surname": "Kovaskz",
> "mobile": "8829177193",
> "email": "tkova...@sparky.com"
>   },
>   {
> "name": "Bindy",
> "surname": "Kvuls",
> "mobile": "5033828845",
> "email": "bind...@sparky.com"
>   }
> ]
>   }
> }
> 
> Where as, this works because all are in the same line:
> [
>   {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
> "miaradi...@sparky.com"},
>   {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
> "tkova...@sparky.com"},
>   {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
> "bind...@sparky.com"}
> ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON

2015-09-25 Thread Ankit Sarraf (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Sarraf updated SPARK-10840:
-
Description: 
Well formed JSON doesn't work with the 1.5.1 version while using 
sqlContext.read.json(""):

{
  "employees": {
"employee": [
  {
"name": "Mia", 

"surname": "Radison",

"mobile": "7295913821",

"email": "miaradi...@sparky.com"
  },
  {
"name": "Thor",

"surname": "Kovaskz",

"mobile": "8829177193",

"email": "tkova...@sparky.com"
  },
  {
"name": "Bindy",

"surname": "Kvuls",

"mobile": "5033828845",

"email": "bind...@sparky.com"
  }
]
  }
}

Where as, this works because all are in the same line:

[
  {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
"miaradi...@sparky.com"},
  {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
"tkova...@sparky.com"},
  {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
"bind...@sparky.com"}
]

  was:
Well formed JSON doesn't work with the 1.5.1 version while using 
sqlContext.read.json(""):

{
  "employees": {
"employee": [
  {
"name": "Mia",

"surname": "Radison",

"mobile": "7295913821",

"email": "miaradi...@sparky.com"
  },
  {
"name": "Thor",

"surname": "Kovaskz",

"mobile": "8829177193",

"email": "tkova...@sparky.com"
  },
  {
"name": "Bindy",

"surname": "Kvuls",

"mobile": "5033828845",

"email": "bind...@sparky.com"
  }
]
  }
}

Where as, this works because all are in the same line:

[
  {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
"miaradi...@sparky.com"},
  {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
"tkova...@sparky.com"},
  {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
"bind...@sparky.com"}
]


> SparkSQL doesn't work well with JSON
> 
>
> Key: SPARK-10840
> URL: https://issues.apache.org/jira/browse/SPARK-10840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ankit Sarraf
>Priority: Trivial
>  Labels: JSON, Scala, SparkSQL
>
> Well formed JSON doesn't work with the 1.5.1 version while using 
> sqlContext.read.json(""):
> 
> {
>   "employees": {
> "employee": [
>   {
> "name": "Mia", 
> "surname": "Radison",
> "mobile": "7295913821",
> "email": "miaradi...@sparky.com"
>   },
>   {
> "name": "Thor",
> "surname": "Kovaskz",
> "mobile": "8829177193",
> "email": "tkova...@sparky.com"
>   },
>   {
> "name": "Bindy",
> "surname": "Kvuls",
> "mobile": "5033828845",
> "email": "bind...@sparky.com"
>   }
> ]
>   }
> }
> 
> Where as, this works because all are in the same line:
> [
>   {"name": "Mia","surname": "Radison","mobile": "7295913821","email": 
> "miaradi...@sparky.com"},
>   {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": 
> "tkova...@sparky.com"},
>   {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": 
> "bind...@sparky.com"}
> ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10808) LDA user guide: discuss running time of LDA

2015-09-25 Thread Mohamed Baddar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908198#comment-14908198
 ] 

Mohamed Baddar commented on SPARK-10808:


Thanks [~josephkb] , working on it

> LDA user guide: discuss running time of LDA
> ---
>
> Key: SPARK-10808
> URL: https://issues.apache.org/jira/browse/SPARK-10808
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Based on feedback like [SPARK-10791], we should discuss the computational and 
> communication complexity of LDA and its optimizers in the MLlib Programming 
> Guide.  E.g.:
> * Online LDA can be faster than EM.
> * To make online LDA run faster, you can use a smaller miniBatchFraction.
> * Communication
> ** For EM, communication on each iteration is on the order of # topics * 
> (vocabSize + # docs).
> ** For online LDA, communication on each iteration is on the order of # 
> topics * vocabSize.
> * Decreasing vocabSize and # topics can speed things up.  It's often fine to 
> eliminate uncommon words, unless you are trying to create a very large number 
> of topics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907913#comment-14907913
 ] 

Zoltán Zvara commented on SPARK-10390:
--

Problem still exists. To reproduce, simply clone the latest snapshot from 
GitHub, build and setup as I've wrote in the description, open iPython and 
issue {{sc.textFile("random.text.file").collect()}}.

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907913#comment-14907913
 ] 

Zoltán Zvara edited comment on SPARK-10390 at 9/25/15 10:59 AM:


Problem still exists. To reproduce, simply clone the latest snapshot from 
GitHub, build and setup as I've wrote in the description, open iPython and 
issue {{sc.textFile(...).collect()}}.


was (Author: ehnalis):
Problem still exists. To reproduce, simply clone the latest snapshot from 
GitHub, build and setup as I've wrote in the description, open iPython and 
issue {{sc.textFile("random.text.file").collect()}}.

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907913#comment-14907913
 ] 

Zoltán Zvara edited comment on SPARK-10390 at 9/25/15 11:05 AM:


Problem still exists. To reproduce, simply clone the latest snapshot from 
GitHub, build and setup as I've wrote in the description, open iPython and 
issue {{sc.textFile(...).collect()}}.

(Start iPython with {{sudo bin/pyspark}})


was (Author: ehnalis):
Problem still exists. To reproduce, simply clone the latest snapshot from 
GitHub, build and setup as I've wrote in the description, open iPython and 
issue {{sc.textFile(...).collect()}}.

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907935#comment-14907935
 ] 

Sean Owen commented on SPARK-10390:
---

Yes, but as I say, it appears to work with the build of reference. Use Maven.

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907948#comment-14907948
 ] 

Zoltán Zvara commented on SPARK-10390:
--

What could be the problem that causes SBT to pack a wrong Guava version? Thanks!

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907949#comment-14907949
 ] 

Ondřej Smola commented on SPARK-8734:
-

I think parameters can be supported as comma separated key=value pairs under 
spark.mesos.executor.docker.parameters,
from what i can see in mesos source code only long parameter names are 
supported.

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread Chris Heller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907952#comment-14907952
 ] 

Chris Heller commented on SPARK-8734:
-

Responsibilities have sort of pulled me away from focusing on this. I did 
managed to get the network code in my branch. 

I was thinking about parameters, and considered a scheme such as:

spark.mesos.executor.docker.parameter. = 

This follows from how you set environment variables on the executor. Would this 
scheme be reasonable?

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9103) Tracking spark's memory usage

2015-09-25 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908309#comment-14908309
 ] 

Imran Rashid commented on SPARK-9103:
-

Hi [~liyezhang556520], thanks for posting the design doc.  Looks good, just a 
couple of questions.

1) Will the proposed design cover SPARK-9111, getting the memory when the 
executor dies abnormally, (esp when killed by yarn)?  It seems to me the answer 
is "no", which is fine, that can be tackled separately, I just wanted to 
clarify.

2) I see the complexity of having overlapping stages, but I wonder if it could 
be simplified somewhat.  It seems to me you just need to maintain a 
{{executorToLatestMetrics: Map[executor, metrics]}}, and then on every stage 
complete, you just log them all?  Maybe this is what you are already describing 
in the doc, but it seems like there is more state & a bit more logging going 
on.  Eg., I don't fully understand why you need to log both "CHB1" and "HB3" in 
your example.

thanks

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10835) Change Output of NGram to Array(String, True)

2015-09-25 Thread Sumit Chawla (JIRA)
Sumit Chawla created SPARK-10835:


 Summary: Change Output of NGram to Array(String, True)
 Key: SPARK-10835
 URL: https://issues.apache.org/jira/browse/SPARK-10835
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Sumit Chawla
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.5.0


Currently output type of Tokenizer is Array(String, false), which is not 
compatible with Word2Vec and Other transformers since their input type is 
Array(String, true). Seq[String] in udf will be treated as Array(String, true) 
by default. 

I'm also thinking for Nullable columns, maybe tokenizer should return 
Array(null) for null value in the input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908651#comment-14908651
 ] 

Ondřej Smola commented on SPARK-8734:
-

No it wont as spark config is internally stored in hashmap - i realized this 
when walking home :). 

What about this

spark.mesos.executor.docker.parameter.abc abc
spark.mesos.executor.docker.parameters.envFOO=BAR, ENV1=VAL1

I have simple working solution with tests - i need to write some docs and i 
will send link to my fork for further discussion 

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-09-25 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908705#comment-14908705
 ] 

Imran Rashid commented on SPARK-5928:
-

[~ariskk] The workaround is to increase the number of partitions.  All of the 
operations which trigger a shuffle take an optional second argument with the 
number of partitions, eg., {{reduceByKey( reduceFunc, numPartitions)}}.  In 
general, its best to err on the side of too many partitions, rather than too 
few.  My rule of thumb is to try to size partitions to to have roughly 100 MB 
of data (I have heard others throw around numbers in roughly the same 
ballpark).  Note that means you use a lot of partitions if you have say 1 TB of 
data you are shuffling.

Its worth noting that if you have very skewed data, just increasing the number 
of partitions in the function that triggers the shuffle might not help.  That 
controls the number of partitions on the shuffle-read (aka reduce) side, but 
not the shuffle-write (aka map) side.  If one map task writes out 2GB of data 
for one key, increasing the number of reduce partitions won't help you, since 
no matter how many reduce partitions, you will still write 2GB into one shuffle 
block.  (A shuffle block corresponds to one map task / reduce task pair.)  In 
that case, you may want to increase the number of partitions for your *map* 
stage, so that it is writing less data to one particular key.  You control the 
number of partitions for the map-stage either at the previous operation that 
triggered a shuffle (eg., a preceding {{reduceByKey}}), or the operation that 
loaded the data (eg, {{sc.textFile}}).  Eg:

{noformat}
val rawData = sc.textFile(..., numPartitionsFirstStage) // control the "map" 
partitions here
val afterShuffle = rawData.map{...}.reduceByKey( ..., numPartitionsSecondStage) 
// control the "reduce" partitions here
{noformat}

My general recommendation, if you want to re-use your code, and have it work on 
a data sets of varying sizes, is to make the number of partitions at *every* 
stage some easily controllable parameter (eg., via the command line), so you 
can tweak things without having to recompile your code.

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at 

[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908647#comment-14908647
 ] 

Alan Braithwaite commented on SPARK-8734:
-

Will this work with multiple instances of the same property?  My concern is 
that there are some arguments which can be repeated and this scheme doesn't 
allow for that.

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908651#comment-14908651
 ] 

Ondřej Smola edited comment on SPARK-8734 at 9/25/15 9:01 PM:
--

No it wont as spark config is internally stored in hashmap - i realized this 
when walking home :). 

What about this

spark.mesos.executor.docker.parameter.abc abc
spark.mesos.executor.docker.parameters.envFOO=BAR, ENV1=VAL1

I have simple working solution with tests - i need to write some docs and i 
will send link to my fork for further discussion 

Edit: repo https://github.com/ondrej-smola/spark/tree/feature/SPARK-8734


was (Author: ondrej.smola):
No it wont as spark config is internally stored in hashmap - i realized this 
when walking home :). 

What about this

spark.mesos.executor.docker.parameter.abc abc
spark.mesos.executor.docker.parameters.envFOO=BAR, ENV1=VAL1

I have simple working solution with tests - i need to write some docs and i 
will send link to my fork for further discussion 

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9103) Tracking spark's memory usage

2015-09-25 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908644#comment-14908644
 ] 

Imran Rashid commented on SPARK-9103:
-

ah, of course, sorry I made a big mistake.  I was thinking that you only need 
to keep the latest max value per executor.  But of course if that max occurred 
before the latest stage started, then you need to reset your counter.  And with 
concurrent stages, you can't simply reset one global counter, since you need 
the max within every window.

Thanks for explaining it to me again!

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10824) DataFrame show method - show(df) should show first N number of rows, similar to R

2015-09-25 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908380#comment-14908380
 ] 

Shivaram Venkataraman commented on SPARK-10824:
---

[~Narine] We discussed this in https://issues.apache.org/jira/browse/SPARK-9317 
and specifically in the github PR 
https://github.com/apache/spark/pull/8360#issuecomment-133516179

As described in the PR comment this needs a more involved change on the SQL 
side to see if the data frame is cheap to print or as we don't want to trigger 
expensive computation in this case.

cc [~rxin]

> DataFrame show method - show(df) should show first N number of rows, similar 
> to R
> -
>
> Key: SPARK-10824
> URL: https://issues.apache.org/jira/browse/SPARK-10824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Hi everyone,
> currently, show(dataframe) method shows some information about the columns 
> and their datatypes, however R shows the first N number of rows in dataframe. 
> Basically, the same as showDF. Right now I changed so that show calls showDF.
> Also, the default number of rows was hard coded in DataFrame.R, I set it as 
> environment variable in sparkR.R. We can change it if you have other better 
> suggestions.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-09-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908448#comment-14908448
 ] 

Joseph K. Bradley commented on SPARK-10791:
---

Oh, OK, I'll comment there as needed.  Thanks

> Optimize MLlib LDA topic distribution query performance
> ---
>
> Key: SPARK-10791
> URL: https://issues.apache.org/jira/browse/SPARK-10791
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
> Environment: Ubuntu 13.10, Oracle Java 8
>Reporter: Marko Asplund
>
> I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size 
> and ~3.4 M documents using EMLDAOptimizer.
> Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
> training with the same data and on the same system set took ~5 minutes. 
> Loading the persisted model from disk (~2 minutes), as well as querying LDA 
> model topic distributions (~4 seconds for one document) are also quite slow 
> operations.
> Our application is querying LDA model topic distribution (for one doc at a 
> time) as part of end-user operation execution flow, so a ~4 second execution 
> time is very problematic.
> The log includes the following message, which AFAIK, should mean that 
> netlib-java is using machine optimised native implementation: 
> "com.github.fommil.jni.JniLoader - successfully loaded 
> /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"
> My test code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57
> I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable 
> change in training performance. Model loading time was reduced to ~ 5 seconds 
> from ~ 2 minutes (now persisted as LocalLDAModel). However, query / 
> prediction time was unchanged.
> Unfortunately, this is the critical performance characteristic in our case.
> I did some profiling for my LDA prototype code that requests topic 
> distributions from a model. According to Java Mission Control more than 80 % 
> of execution time during sample interval is spent in the following methods:
> - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
> - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
> - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
> 6.98%
> - java.lang.Double.valueOf(double); count: 31; 4.33%
> Is there any way of using the API more optimally?
> Are there any opportunities for optimising the "topicDistributions" code
> path in MLlib?
> My query test code looks like this essentially:
> // executed once
> val model = LocalLDAModel.load(ctx, ModelFileName)
> // executed four times
> val samples = Transformers.toSparseVectors(vocabularySize,
> ctx.parallelize(Seq(input))) // fast
> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
> seems to take about 4 seconds to execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10760) SparkR glm: the documentation in examples - family argument is missing

2015-09-25 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10760.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8870
[https://github.com/apache/spark/pull/8870]

> SparkR glm: the documentation in examples - family argument is missing
> --
>
> Key: SPARK-10760
> URL: https://issues.apache.org/jira/browse/SPARK-10760
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
> Fix For: 1.6.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hi everyone,
> Since the family argument is required for the glm function, the execution of:
> model <- glm(Sepal_Length ~ Sepal_Width, df) 
> is failing.
> I've fixed the documentation by adding the family argument and also added the 
> summay(model) which will show the coefficients for the model. 
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10760) SparkR glm: the documentation in examples - family argument is missing

2015-09-25 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10760:
--
Assignee: Narine Kokhlikyan

> SparkR glm: the documentation in examples - family argument is missing
> --
>
> Key: SPARK-10760
> URL: https://issues.apache.org/jira/browse/SPARK-10760
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Narine Kokhlikyan
>Priority: Minor
> Fix For: 1.6.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hi everyone,
> Since the family argument is required for the glm function, the execution of:
> model <- glm(Sepal_Length ~ Sepal_Width, df) 
> is failing.
> I've fixed the documentation by adding the family argument and also added the 
> summay(model) which will show the coefficients for the model. 
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9883) Distance to each cluster given a point

2015-09-25 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908598#comment-14908598
 ] 

Bertrand Dechoux commented on SPARK-9883:
-

The patch is now ready for MLlib and is waiting for a technical review.
I will see about Pipelines API for the next step.

> Distance to each cluster given a point
> --
>
> Key: SPARK-9883
> URL: https://issues.apache.org/jira/browse/SPARK-9883
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Bertrand Dechoux
>Priority: Minor
>
> Right now KMeansModel provides only a 'predict 'method which returns the 
> index of the closest cluster.
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)
> It would be nice to have a method giving the distance to all clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted

2015-09-25 Thread Rick Hillegas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908609#comment-14908609
 ] 

Rick Hillegas commented on SPARK-6649:
--

Hi Fred,

The backtick syntax seems to be a feature of HiveQL according to this 
discussion on the developer list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/column-identifiers-in-Spark-SQL-td14280.html

Thanks,
-Rick

> DataFrame created through SQLContext.jdbc() failed if columns table must be 
> quoted
> --
>
> Key: SPARK-6649
> URL: https://issues.apache.org/jira/browse/SPARK-6649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Frédéric Blanc
>Priority: Minor
>
> If I want to import the content a table from oracle, that contains a column 
> with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all 
> the columns of this table.
> {code:title=ddl.sql|borderStyle=solid}
> CREATE TABLE TEST_TABLE (
> "COMMENT" VARCHAR2(10)
> );
> {code}
> {code:title=test.java|borderStyle=solid}
> SQLContext sqlContext = ...
> DataFrame df = sqlContext.jdbc(databaseURL, "TEST_TABLE");
> df.rdd();   // => failed if the table contains a column with a reserved 
> keyword
> {code}
> The same problem can be encounter if reserved keyword are used on table name.
> The JDBCRDD scala class could be improved, if the columnList initializer 
> append the double-quote for each column. (line : 225)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10561) Provide tooling for auto-generating Spark SQL reference manual

2015-09-25 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-10561:
---
Description: 
Here is the discussion thread:
http://search-hadoop.com/m/q3RTtcD20F1o62xE

Richard Hillegas made the following suggestion:

A machine-generated BNF, however, is easy to imagine. But perhaps not so easy 
to implement. Spark's SQL grammar is implemented in Scala, extending the DSL 
support provided by the Scala language. I am new to programming in Scala, so I 
don't know whether the Scala ecosystem provides any good tools for 
reverse-engineering a BNF from a class which extends 
scala.util.parsing.combinator.syntactical.StandardTokenParsers.


  was:
Here is the discussion thread:
http://search-hadoop.com/m/q3RTtcD20F1o62xE

Richard Hillegas made the following suggestion:

A machine-generated BNF, however, is easy to imagine. But perhaps not so easy 
to implement. Spark's SQL grammar is implemented in Scala, extending the DSL 
support provided by the Scala language. I am new to programming in Scala, so I 
don't know whether the Scala ecosystem provides any good tools for 
reverse-engineering a BNF from a class which extends 
scala.util.parsing.combinator.syntactical.StandardTokenParsers.


> Provide tooling for auto-generating Spark SQL reference manual
> --
>
> Key: SPARK-10561
> URL: https://issues.apache.org/jira/browse/SPARK-10561
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Reporter: Ted Yu
>
> Here is the discussion thread:
> http://search-hadoop.com/m/q3RTtcD20F1o62xE
> Richard Hillegas made the following suggestion:
> A machine-generated BNF, however, is easy to imagine. But perhaps not so easy 
> to implement. Spark's SQL grammar is implemented in Scala, extending the DSL 
> support provided by the Scala language. I am new to programming in Scala, so 
> I don't know whether the Scala ecosystem provides any good tools for 
> reverse-engineering a BNF from a class which extends 
> scala.util.parsing.combinator.syntactical.StandardTokenParsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10834) SPARK SQL doesn't support INSERT INTO ... VALUES

2015-09-25 Thread Antonio Piccolboni (JIRA)
Antonio Piccolboni created SPARK-10834:
--

 Summary: SPARK SQL doesn't support INSERT INTO ... VALUES
 Key: SPARK-10834
 URL: https://issues.apache.org/jira/browse/SPARK-10834
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Antonio Piccolboni


I guess the Summary has most of it. I am testing from a custom JDBC client, but 
others have run into this. Because of the way the thrift server is written, I 
think this happens in a HiveContext. Surprisingly though, Hive server 2 does 
support this syntax, at least tested in HDP2 sandbox at defaults. This issues 
was created as a fork in the discussion of SPARK-10804



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

2015-09-25 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908296#comment-14908296
 ] 

Antonio Piccolboni commented on SPARK-10804:


Good suggestion

SPARK-10834

> "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
> 
>
> Key: SPARK-10804
> URL: https://issues.apache.org/jira/browse/SPARK-10804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Antonio Piccolboni
>
> Connecting with a remote thriftserver with a custom JDBC client or beeline, 
> load data local inpath fails. Hiveserver2 docs explain in a quick comment 
> that local now means local to the server. I think this is just a 
> rationalization for a bug. When a user types "local" 
> # it needs to be local to him, not some server 
> # Failing 1., one needs to have a way to determine what local means and 
> create a "local" item under the new definition. 
> With the thirftserver, I have a host to connect to, but I don't have any way 
> to create a file local to that host, at least in spark. It may not be 
> desirable to create user directories on the thriftserver host or running file 
> transfer services like scp. Moreover, it appears that this syntax is unique 
> to Hive and Spark but its origin can be traced to  LOAD DATA LOCAL INFILE in 
> Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is 
> specified, the file is read by the client program on the client host and sent 
> to the server. The file can be given as a full path name to specify its exact 
> location. If given as a relative path name, the name is interpreted relative 
> to the directory in which the client program was started". This is not to say 
> that the spark or hive teams are bound to what Oracle and Mysql do, but to 
> support the idea that the meaning of LOCAL is settled. For instance, the 
> Impala documentation says: "Currently, the Impala LOAD DATA statement only 
> imports files from HDFS, not from the local filesystem. It does not support 
> the LOCAL keyword of the Hive LOAD DATA statement." I think this is a better 
> solution. The way things are in thriftserver, I developed a client under the 
> assumption that I could use LOAD DATA LOCAL INPATH and all tests where 
> passing in standalone mode, only to find with the first distributed test that 
> # LOCAL means "local to server", a.k.a. "remote"
> # INSERT INTO ... VALUES is not supported
> # There is really no workaround unless one assumes access what data store 
> spark is running against , like HDFS, and that the user can upload data to 
> it. 
> In the space of workarounds it is not terrible, but if you are trying to 
> write a self-contained spark package, that's a defeat and makes writing tests 
> particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9103) Tracking spark's memory usage

2015-09-25 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908404#comment-14908404
 ] 

Zhang, Liye commented on SPARK-9103:


Hi @Imran Rashid, thanks for reviewing the doc. 
{quote}
1) Will the proposed design cover SPARK-9111, getting the memory when the 
executor dies abnormally, (esp when killed by yarn)? It seems to me the answer 
is "no", which is fine, that can be tackled separately, I just wanted to 
clarify.
{quote}
You are right, the answer is "no". This design is for phase 1, we can move it 
on later to cover [SPARK-9111|https://issues.apache.org/jira/browse/SPARK-9111].

{quote}
I see the complexity of having overlapping stages, but I wonder if it could be 
simplified somewhat. It seems to me you just need to maintain a 
executorToLatestMetrics: Map[executor, metrics], and then on every stage 
complete, you just log them all?
{quote}
Since we want to reduce the number of events to log, I didn't find a way to 
simplify this for overlapping stages. And in the current implementation, we log 
all the ExectorMetrics of all the executors when executor complete. I think 
this can be simplified by only log ExecutorMetrics of executors that is related 
to the stage instead of all the executors. This will reduce a lot of events to 
log if there are many stages running on different executors.

{quote}
but it seems like there is more state & a bit more logging going on
{quote}
I don't quite understand, what do you mean about "*more state and more logging 
going on*", can you explain it further?

{quote}
 I don't fully understand why you need to log both "CHB1" and "HB3" in your 
example.
{quote}
That is because the "CHB1" is the combined event, and "HB3" is the real event, 
we have to log "HB3" because there might be no heartbeat received for the stage 
that after "HB3" (just like stage2 in figure-1 described in the doc). And for 
that stage, it will use "HB3" instead of "CHB1" because "CHB1" is not the 
correct event it should refer to. 

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-09-25 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908389#comment-14908389
 ] 

Shivaram Venkataraman commented on SPARK-7736:
--

[~ztoth] Could you open a new JIRA for the SparkR problem ?

> Exception not failing Python applications (in yarn cluster mode)
> 
>
> Key: SPARK-7736
> URL: https://issues.apache.org/jira/browse/SPARK-7736
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
> Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
>Reporter: Shay Rojansky
>Assignee: Marcelo Vanzin
> Fix For: 1.5.1, 1.6.0
>
>
> It seems that exceptions thrown in Python spark apps after the SparkContext 
> is instantiated don't cause the application to fail, at least in Yarn: the 
> application is marked as SUCCEEDED.
> Note that any exception right before the SparkContext correctly places the 
> application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-25 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908395#comment-14908395
 ] 

Meihua Wu commented on SPARK-7129:
--

[~josephkb] [~sethah]
I have compile a doc for AdaBoost. 
https://docs.google.com/document/d/1Neo5_6po9ap7dZuT3fwT6ptJa_XvkUUdRgCqB51lcy4/edit#heading=h.d4mq6f37je6x

Thank you very much for reviewing them. I am look forward to your comments.

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10828) Can we use the accumulo data RDD created from JAVA in spark, in sparkR?Is there any other way to proceed with it to create RRDD from a source RDD other than text RDD?O

2015-09-25 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908399#comment-14908399
 ] 

Shivaram Venkataraman commented on SPARK-10828:
---

I don't think we want to support new ways to read in HDFS formats into SparkR 
-- IMHO The DataSource API is the right way to solve this problem as its well 
established now and works across Python, Scala, R etc. 

You can check with the Accumulo project to see if they have plans to add a 
DataSource implementation. Also the DataSource implementation does not need 
live in the Spark source tree (See http://github.com/databricks/spark-avro for 
an example), so we don't need a JIRA in Spark to track this.

> Can we use the accumulo data RDD created from JAVA in spark, in sparkR?Is 
> there any other way to proceed with it to create RRDD from a source RDD other 
> than text RDD?Or to use any other format of data stored in HDFS in sparkR?
> --
>
> Key: SPARK-10828
> URL: https://issues.apache.org/jira/browse/SPARK-10828
> Project: Spark
>  Issue Type: Question
>  Components: R
>Affects Versions: 1.5.0
> Environment: ubuntu 12.04,8GB RAM,accumulo 1.6.3,hadoop 2.6
>Reporter: madhvi gupta
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-25 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908455#comment-14908455
 ] 

Seth Hendrickson commented on SPARK-7129:
-

A couple of quick comments I have:
* The design doc implies that we will have several different boosting 
predictors, whereas I initially thought this JIRA called for a single generic 
boosting predictor. So it seems like we'll have {{AdaBoostClassifier}}, 
{{LogitBoostClassifier}}, {{GradientBoostClassifier}} all separate boosting 
implementations instead of a single {{BoostedClassifier}} implementation that 
has a param like {{setAlgo("AdaBoost")}}. Personally think that a single 
generic implementation doesn't make as much sense, and so I like the separation 
of different algorithms better, but I wanted to clarify.
* What are the base learners in the design doc? It looks like you propose to 
create a new {{Learner}} class. How will that interact with existing 
predictors? 
* I think {{AdaBoostClassifier}} is better than {{SAMMEClassifier}} since it is 
the classification analogy of {{AdaBoostRegressor}}, plus we'll keep in line 
with the sci-kit api. 
* Is {{setNumberOfBaseLearners}} equivalent to setting the number of boosting 
iterations? I ask because in R mboost package, they accept a set of P candidate 
base learners where, at each boosting iteration, they train each one and select 
only the "best" base learner. If this were the case, we would want to allow the 
user to specify multiple base learners. It seems as if we will not be doing 
that under the proposed architecture. Just want to clarify

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6281) Support incremental updates for Graph

2015-09-25 Thread Rohit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908349#comment-14908349
 ] 

Rohit commented on SPARK-6281:
--

Hi I see the issue as resolved and the resolution as wont fix!
I am currently using GraphX and I also have this requirement of updating the 
graph incrementally specially adding new edges and deletion of old edges. Is it 
possible in current version some how? currently As i understood this could be 
done only by doing a union of the new edge RDDs with the graph.Edge RDDs and 
creating the graph again which does not seem very efficient as the new edges 
arrives in stream at quit frequent interval. Any plans of supporting this in 
future?

> Support incremental updates for Graph
> -
>
> Key: SPARK-6281
> URL: https://issues.apache.org/jira/browse/SPARK-6281
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> Add api to efficiently append new vertices and edges into existing Graph,
> e.g., Graph#append(newVerts: RDD[(VertexId, VD)], newEdges: RDD[Edge[ED]], 
> defaultVertexAttr: VD)
> This is useful for time-evolving graphs; new vertices and edges are built from
> streaming data thru Spark Streaming, and then incrementally appended
> into a existing graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10835) Change Output of NGram to Array(String, True)

2015-09-25 Thread Sumit Chawla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumit Chawla updated SPARK-10835:
-
Description: 
Currently output type of NGram is Array(String, false), which is not compatible 
with LDA  since their input type is Array(String, true). 



  was:
Currently output type of Tokenizer is Array(String, false), which is not 
compatible with Word2Vec and Other transformers since their input type is 
Array(String, true). Seq[String] in udf will be treated as Array(String, true) 
by default. 

I'm also thinking for Nullable columns, maybe tokenizer should return 
Array(null) for null value in the input.


> Change Output of NGram to Array(String, True)
> -
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9103) Tracking spark's memory usage

2015-09-25 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908404#comment-14908404
 ] 

Zhang, Liye edited comment on SPARK-9103 at 9/25/15 5:56 PM:
-

Hi [~irashid], thanks for reviewing the doc. 
{quote}
1) Will the proposed design cover SPARK-9111, getting the memory when the 
executor dies abnormally, (esp when killed by yarn)? It seems to me the answer 
is "no", which is fine, that can be tackled separately, I just wanted to 
clarify.
{quote}
You are right, the answer is "no". This design is for phase 1, we can move it 
on later to cover [SPARK-9111|https://issues.apache.org/jira/browse/SPARK-9111].

{quote}
I see the complexity of having overlapping stages, but I wonder if it could be 
simplified somewhat. It seems to me you just need to maintain a 
executorToLatestMetrics: Map[executor, metrics], and then on every stage 
complete, you just log them all?
{quote}
Since we want to reduce the number of events to log, I didn't find a way to 
simplify this for overlapping stages. And in the current implementation, we log 
all the ExectorMetrics of all the executors when executor complete. I think 
this can be simplified by only log ExecutorMetrics of executors that is related 
to the stage instead of all the executors. This will reduce a lot of events to 
log if there are many stages running on different executors.

{quote}
but it seems like there is more state & a bit more logging going on
{quote}
I don't quite understand, what do you mean about "*more state and more logging 
going on*", can you explain it further?

{quote}
 I don't fully understand why you need to log both "CHB1" and "HB3" in your 
example.
{quote}
That is because the "CHB1" is the combined event, and "HB3" is the real event, 
we have to log "HB3" because there might be no heartbeat received for the stage 
that after "HB3" (just like stage2 in figure-1 described in the doc). And for 
that stage, it will use "HB3" instead of "CHB1" because "CHB1" is not the 
correct event it should refer to. 


was (Author: liyezhang556520):
Hi @Imran Rashid, thanks for reviewing the doc. 
{quote}
1) Will the proposed design cover SPARK-9111, getting the memory when the 
executor dies abnormally, (esp when killed by yarn)? It seems to me the answer 
is "no", which is fine, that can be tackled separately, I just wanted to 
clarify.
{quote}
You are right, the answer is "no". This design is for phase 1, we can move it 
on later to cover [SPARK-9111|https://issues.apache.org/jira/browse/SPARK-9111].

{quote}
I see the complexity of having overlapping stages, but I wonder if it could be 
simplified somewhat. It seems to me you just need to maintain a 
executorToLatestMetrics: Map[executor, metrics], and then on every stage 
complete, you just log them all?
{quote}
Since we want to reduce the number of events to log, I didn't find a way to 
simplify this for overlapping stages. And in the current implementation, we log 
all the ExectorMetrics of all the executors when executor complete. I think 
this can be simplified by only log ExecutorMetrics of executors that is related 
to the stage instead of all the executors. This will reduce a lot of events to 
log if there are many stages running on different executors.

{quote}
but it seems like there is more state & a bit more logging going on
{quote}
I don't quite understand, what do you mean about "*more state and more logging 
going on*", can you explain it further?

{quote}
 I don't fully understand why you need to log both "CHB1" and "HB3" in your 
example.
{quote}
That is because the "CHB1" is the combined event, and "HB3" is the real event, 
we have to log "HB3" because there might be no heartbeat received for the stage 
that after "HB3" (just like stage2 in figure-1 described in the doc). And for 
that stage, it will use "HB3" instead of "CHB1" because "CHB1" is not the 
correct event it should refer to. 

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to 

[jira] [Commented] (SPARK-10824) DataFrame show method - show(df) should show first N number of rows, similar to R

2015-09-25 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908431#comment-14908431
 ] 

Narine Kokhlikyan commented on SPARK-10824:
---

Thanks Shivaram, I see! I'll look at it and watch the jira. 

> DataFrame show method - show(df) should show first N number of rows, similar 
> to R
> -
>
> Key: SPARK-10824
> URL: https://issues.apache.org/jira/browse/SPARK-10824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Hi everyone,
> currently, show(dataframe) method shows some information about the columns 
> and their datatypes, however R shows the first N number of rows in dataframe. 
> Basically, the same as showDF. Right now I changed so that show calls showDF.
> Also, the default number of rows was hard coded in DataFrame.R, I set it as 
> environment variable in sparkR.R. We can change it if you have other better 
> suggestions.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9883) Distance to each cluster given a point

2015-09-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908884#comment-14908884
 ] 

Joseph K. Bradley commented on SPARK-9883:
--

OK!  I'll be traveling next week, but I'll try to take a look soon.

> Distance to each cluster given a point
> --
>
> Key: SPARK-9883
> URL: https://issues.apache.org/jira/browse/SPARK-9883
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Bertrand Dechoux
>Priority: Minor
>
> Right now KMeansModel provides only a 'predict 'method which returns the 
> index of the closest cluster.
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)
> It would be nice to have a method giving the distance to all clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10635) pyspark - running on a different host

2015-09-25 Thread Ben Duffield (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908738#comment-14908738
 ] 

Ben Duffield commented on SPARK-10635:
--

Ok good flag that there are other places this'd need to be considered.
How open would you be to a PR which addresses this? I.e. sure it's an 
assumption now - could we move away from that?

> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10821) RandomForest serialization OOM during findBestSplits

2015-09-25 Thread Jay Luan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Luan updated SPARK-10821:
-
Description: 
I am getting OOM during serialization for a relatively small dataset for a 
RandomForest. Even with spark.serializer.objectStreamReset at 1, It is still 
running out of memory when attempting to serialize my data.

Stack Trace:
Traceback (most recent call last):
  File "/root/random_forest/random_forest_spark.py", line 198, in 
main()
  File "/root/random_forest/random_forest_spark.py", line 166, in main
trainModel(dset)
  File "/root/random_forest/random_forest_spark.py", line 191, in trainModel
impurity='gini', maxDepth=4, maxBins=32)
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 352, in 
trainClassifier
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, in 
_train
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, 
in callMLlibFunc
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, 
in callJavaFunc
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
py4j.protocol.Py4JJavaError15/09/25 00:44:41 DEBUG BlockManagerSlaveEndpoint: 
Done removing RDD 7, response is 0
15/09/25 00:44:41 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 
AkkaRpcEndpointRef(Actor[akka://sparkDriver/temp/$Mj])
: An error occurred while calling o89.trainRandomForestModel.
: java.lang.OutOfMemoryError
at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at 
org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
at 
org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
at 
org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Details:

My RDD is type MLLIB LabeledPoint objects, with each holding sparse vectors 
inside. This RDD has a total size of roughly 45MB. My sparse vector has a total 
length of ~15 million while only about 3000 or so are non-zeros. Works fine for 
up to sparse vector size 10 million. 

My cluster is setup on AWS such that my master is a r3.8xlarge along with two 
r3.4xlarge workers. Driver has ~190GB allocated to it while my 

[jira] [Assigned] (SPARK-10836) Add SparkR sort function to dataframe

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10836:


Assignee: Apache Spark

> Add SparkR sort function to dataframe
> -
>
> Key: SPARK-10836
> URL: https://issues.apache.org/jira/browse/SPARK-10836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Apache Spark
>Priority: Minor
>
> Hi everyone,
> the sort function can be used as an alternative to arrange(... ).
> As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of 
> orderings for columns and the list of columns, represented as string names
> for example: 
>  sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to 
> sort some of the columns in the same order
>  sort(df, decreasing=TRUE, "col1")
>  sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10836) Add SparkR sort function to dataframe

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10836:


Assignee: (was: Apache Spark)

> Add SparkR sort function to dataframe
> -
>
> Key: SPARK-10836
> URL: https://issues.apache.org/jira/browse/SPARK-10836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Hi everyone,
> the sort function can be used as an alternative to arrange(... ).
> As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of 
> orderings for columns and the list of columns, represented as string names
> for example: 
>  sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to 
> sort some of the columns in the same order
>  sort(df, decreasing=TRUE, "col1")
>  sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10837) TimeStamp could not work on sparksql very well

2015-09-25 Thread Yun Zhao (JIRA)
Yun Zhao created SPARK-10837:


 Summary: TimeStamp could not work on sparksql very well
 Key: SPARK-10837
 URL: https://issues.apache.org/jira/browse/SPARK-10837
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yun Zhao


create a file as follows:
{quote}
2015-09-02 09:06:00.000
2015-09-02 09:06:00.001
2015-09-02 09:06:00.100
2015-09-02 09:06:01.000
{quote}

Then upload it to hdfs, for example,put it to /test/testTable.

create table:
{quote}
CREATE EXTERNAL TABLE `testTable`(`createtime` timestamp) LOCATION 
'/test/testTable';
{quote}

process sqls:
{quote}
select * from testTable where createtime = "2015-09-02 09:06:00.000";
select * from testTable where createtime > "2015-09-02 09:06:00.000";
select * from testTable where createtime >= "2015-09-02 09:06:00.000";
{quote}
The set of ">=" is not union set of "=" and ">".

but if process sqls as follows:
{quote}
select * from testTable where createtime = timestamp("2015-09-02 09:06:00.000");
select * from testTable where createtime > timestamp("2015-09-02 09:06:00.000");
select * from testTable where createtime >= timestamp("2015-09-02 
09:06:00.000");
{quote}
There's no such former problem. 

User *explain extended* to find the difference of sqls:
When uses "=","2015-09-02 09:06:00.000" is transfered to timestamp.
When uses ">" or ">=",createtime is transfered to String.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10839) SPARK_DAEMON_MEMORY has effect on heap size of thriftserver

2015-09-25 Thread Yun Zhao (JIRA)
Yun Zhao created SPARK-10839:


 Summary: SPARK_DAEMON_MEMORY has effect on heap size of 
thriftserver
 Key: SPARK-10839
 URL: https://issues.apache.org/jira/browse/SPARK-10839
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.5.0, 1.4.1
Reporter: Yun Zhao


When SPARK_DAEMON_MEMORY in spark-env.sh is setted to modify memory of Master 
or Worker, there's an effect on heap size of thriftserver, further, this effect 
cannot be modified by spark.driver.memory or --driver-memory. Version 1.3.1 
does not have the same problem. 

in org.apache.spark.launcher.SparkSubmitCommandBuilder:
{quote}
String tsMemory =
isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : null;
  String memory = firstNonEmpty(tsMemory,
firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props),
System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), 
DEFAULT_MEM);
  cmd.add("-Xms" + memory);
  cmd.add("-Xmx" + memory);
{quote}   

SPARK_DAEMON_MEMORY has the highest priority.

It can be modified like this:
{quote}
String memory = firstNonEmpty(firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, 
conf, props),
  System.getenv("SPARK_DRIVER_MEMORY"), tsMemory, 
System.getenv("SPARK_MEM"), DEFAULT_MEM);
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10836) Add SparkR sort function to dataframe

2015-09-25 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-10836:
-

 Summary: Add SparkR sort function to dataframe
 Key: SPARK-10836
 URL: https://issues.apache.org/jira/browse/SPARK-10836
 Project: Spark
  Issue Type: Sub-task
Reporter: Narine Kokhlikyan
Priority: Minor


Hi everyone,

the sort function can be used as an alternative to arrange(... ).
As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of 
orderings for columns and the list of columns, represented as string names

for example: 
 sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to sort 
some of the columns in the same order

 sort(df, decreasing=TRUE, "col1")
 sort(df, decreasing=c(TRUE,FALSE), "col1","col2")


Thanks,
Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10836) Add SparkR sort function to dataframe

2015-09-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908907#comment-14908907
 ] 

Apache Spark commented on SPARK-10836:
--

User 'NarineK' has created a pull request for this issue:
https://github.com/apache/spark/pull/8920

> Add SparkR sort function to dataframe
> -
>
> Key: SPARK-10836
> URL: https://issues.apache.org/jira/browse/SPARK-10836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Hi everyone,
> the sort function can be used as an alternative to arrange(... ).
> As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of 
> orderings for columns and the list of columns, represented as string names
> for example: 
>  sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to 
> sort some of the columns in the same order
>  sort(df, decreasing=TRUE, "col1")
>  sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10838) Repeat to join one DataFrame twice,there will be AnalysisException.

2015-09-25 Thread Yun Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Zhao updated SPARK-10838:
-
Description: 
The detail of exception is:
{quote}
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
attribute(s) col_a#1 missing from col_a#0,col_b#2,col_a#3,col_b#4 in operator 
!Join Inner, Some((col_b#2 = col_a#1));
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.join(DataFrame.scala:554)
at org.apache.spark.sql.DataFrame.join(DataFrame.scala:521)
{quote}

The related codes are:
{quote}
import org.apache.spark.sql.SQLContext
import org.apache.spark.\{SparkContext, SparkConf}

object DFJoinTest extends App \{

  case class Foo(col_a: String)

  case class Bar(col_a: String, col_b: String)

  val sc = new SparkContext(new 
SparkConf().setMaster("local").setAppName("DFJoinTest"))
  val sqlContext = new SQLContext(sc)

  import sqlContext.implicits._

  val df1 = sc.parallelize(Array("1")).map(_.split(",")).map(p => 
Foo(p(0))).toDF()
  val df2 = sc.parallelize(Array("1,1")).map(_.split(",")).map(p => Bar(p(0), 
p(1))).toDF()

  val df3 = df1.join(df2, df1("col_a") === df2("col_a")).select(df1("col_a"), 
$"col_b")

  df3.join(df2, df3("col_b") === df2("col_a")).show()

  //  val df4 = df2.as("df4")
  //  df3.join(df4, df3("col_b") === df4("col_a")).show()

  //  df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show()

  sc.stop()
}
{quote}

When uses
{quote}
val df4 = df2.as("df4")
df3.join(df4, df3("col_b") === df4("col_a")).show()
{quote}
there's errors,but when uses
{quote}
df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show()
{quote}
it's normal.

  was:
The detail of exception is:
{quote}
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
attribute(s) col_a#1 missing from col_a#0,col_b#2,col_a#3,col_b#4 in operator 
!Join Inner, Some((col_b#2 = col_a#1));
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.join(DataFrame.scala:554)
at org.apache.spark.sql.DataFrame.join(DataFrame.scala:521)
{quote}

The related codes are:
{quote}
object DFJoinTest extends App {

  case class Foo(col_a: String)

  case class Bar(col_a: String, col_b: String)

  val sc = new SparkContext(new 
SparkConf().setMaster("local").setAppName("DFJoinTest"))
  val sqlContext = new SQLContext(sc)

  import sqlContext.implicits._

  val df1 = sc.parallelize(Array("1")).map(_.split(",")).map(p => 
Foo(p(0))).toDF()
  val df2 = sc.parallelize(Array("1,1")).map(_.split(",")).map(p => Bar(p(0), 
p(1))).toDF()

  val df3 = df1.join(df2, df1("col_a") === df2("col_a")).select(df1("col_a"), 
$"col_b")

  df3.join(df2, df3("col_b") === df2("col_a")).show()

  //  val df4 = df2.as("df4")
  //  df3.join(df4, df3("col_b") === df4("col_a")).show()

  //  df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show()

  sc.stop()
}
{quote}

When uses
{quote}
val df4 = df2.as("df4")
df3.join(df4, df3("col_b") === 

[jira] [Created] (SPARK-10838) Repeat to join one DataFrame twice,there will be AnalysisException.

2015-09-25 Thread Yun Zhao (JIRA)
Yun Zhao created SPARK-10838:


 Summary: Repeat to join one DataFrame twice,there will be 
AnalysisException.
 Key: SPARK-10838
 URL: https://issues.apache.org/jira/browse/SPARK-10838
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Yun Zhao


The detail of exception is:
{quote}
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
attribute(s) col_a#1 missing from col_a#0,col_b#2,col_a#3,col_b#4 in operator 
!Join Inner, Some((col_b#2 = col_a#1));
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.join(DataFrame.scala:554)
at org.apache.spark.sql.DataFrame.join(DataFrame.scala:521)
{quote}

The related codes are:
{quote}
object DFJoinTest extends App {

  case class Foo(col_a: String)

  case class Bar(col_a: String, col_b: String)

  val sc = new SparkContext(new 
SparkConf().setMaster("local").setAppName("DFJoinTest"))
  val sqlContext = new SQLContext(sc)

  import sqlContext.implicits._

  val df1 = sc.parallelize(Array("1")).map(_.split(",")).map(p => 
Foo(p(0))).toDF()
  val df2 = sc.parallelize(Array("1,1")).map(_.split(",")).map(p => Bar(p(0), 
p(1))).toDF()

  val df3 = df1.join(df2, df1("col_a") === df2("col_a")).select(df1("col_a"), 
$"col_b")

  df3.join(df2, df3("col_b") === df2("col_a")).show()

  //  val df4 = df2.as("df4")
  //  df3.join(df4, df3("col_b") === df4("col_a")).show()

  //  df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show()

  sc.stop()
}
{quote}

When uses
{quote}
val df4 = df2.as("df4")
df3.join(df4, df3("col_b") === df4("col_a")).show()
{quote}
there's errors,but when uses
{quote}
df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show()
{quote}
it's normal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10839) SPARK_DAEMON_MEMORY has effect on heap size of thriftserver

2015-09-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908998#comment-14908998
 ] 

Apache Spark commented on SPARK-10839:
--

User 'xiaowen147' has created a pull request for this issue:
https://github.com/apache/spark/pull/8921

> SPARK_DAEMON_MEMORY has effect on heap size of thriftserver
> ---
>
> Key: SPARK-10839
> URL: https://issues.apache.org/jira/browse/SPARK-10839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Yun Zhao
>
> When SPARK_DAEMON_MEMORY in spark-env.sh is setted to modify memory of Master 
> or Worker, there's an effect on heap size of thriftserver, further, this 
> effect cannot be modified by spark.driver.memory or --driver-memory. Version 
> 1.3.1 does not have the same problem. 
> in org.apache.spark.launcher.SparkSubmitCommandBuilder:
> {quote}
> String tsMemory =
> isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : 
> null;
>   String memory = firstNonEmpty(tsMemory,
> firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props),
> System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), 
> DEFAULT_MEM);
>   cmd.add("-Xms" + memory);
>   cmd.add("-Xmx" + memory);
> {quote} 
> SPARK_DAEMON_MEMORY has the highest priority.
> It can be modified like this:
> {quote}
> String memory = firstNonEmpty(firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, 
> conf, props),
>   System.getenv("SPARK_DRIVER_MEMORY"), tsMemory, 
> System.getenv("SPARK_MEM"), DEFAULT_MEM);
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10839) SPARK_DAEMON_MEMORY has effect on heap size of thriftserver

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10839:


Assignee: Apache Spark

> SPARK_DAEMON_MEMORY has effect on heap size of thriftserver
> ---
>
> Key: SPARK-10839
> URL: https://issues.apache.org/jira/browse/SPARK-10839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Yun Zhao
>Assignee: Apache Spark
>
> When SPARK_DAEMON_MEMORY in spark-env.sh is setted to modify memory of Master 
> or Worker, there's an effect on heap size of thriftserver, further, this 
> effect cannot be modified by spark.driver.memory or --driver-memory. Version 
> 1.3.1 does not have the same problem. 
> in org.apache.spark.launcher.SparkSubmitCommandBuilder:
> {quote}
> String tsMemory =
> isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : 
> null;
>   String memory = firstNonEmpty(tsMemory,
> firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props),
> System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), 
> DEFAULT_MEM);
>   cmd.add("-Xms" + memory);
>   cmd.add("-Xmx" + memory);
> {quote} 
> SPARK_DAEMON_MEMORY has the highest priority.
> It can be modified like this:
> {quote}
> String memory = firstNonEmpty(firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, 
> conf, props),
>   System.getenv("SPARK_DRIVER_MEMORY"), tsMemory, 
> System.getenv("SPARK_MEM"), DEFAULT_MEM);
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10839) SPARK_DAEMON_MEMORY has effect on heap size of thriftserver

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10839:


Assignee: (was: Apache Spark)

> SPARK_DAEMON_MEMORY has effect on heap size of thriftserver
> ---
>
> Key: SPARK-10839
> URL: https://issues.apache.org/jira/browse/SPARK-10839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Yun Zhao
>
> When SPARK_DAEMON_MEMORY in spark-env.sh is setted to modify memory of Master 
> or Worker, there's an effect on heap size of thriftserver, further, this 
> effect cannot be modified by spark.driver.memory or --driver-memory. Version 
> 1.3.1 does not have the same problem. 
> in org.apache.spark.launcher.SparkSubmitCommandBuilder:
> {quote}
> String tsMemory =
> isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : 
> null;
>   String memory = firstNonEmpty(tsMemory,
> firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props),
> System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), 
> DEFAULT_MEM);
>   cmd.add("-Xms" + memory);
>   cmd.add("-Xmx" + memory);
> {quote} 
> SPARK_DAEMON_MEMORY has the highest priority.
> It can be modified like this:
> {quote}
> String memory = firstNonEmpty(firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, 
> conf, props),
>   System.getenv("SPARK_DRIVER_MEMORY"), tsMemory, 
> System.getenv("SPARK_MEM"), DEFAULT_MEM);
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10836) SparkR: Add sort function to dataframe

2015-09-25 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-10836:
--
Summary: SparkR: Add sort function to dataframe  (was: Add SparkR sort 
function to dataframe)

> SparkR: Add sort function to dataframe
> --
>
> Key: SPARK-10836
> URL: https://issues.apache.org/jira/browse/SPARK-10836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Hi everyone,
> the sort function can be used as an alternative to arrange(... ).
> As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of 
> orderings for columns and the list of columns, represented as string names
> for example: 
>  sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to 
> sort some of the columns in the same order
>  sort(df, decreasing=TRUE, "col1")
>  sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10829:


Assignee: Apache Spark

> Scan DataSource with predicate expression combine partition key and 
> attributes doesn't work
> ---
>
> Key: SPARK-10829
> URL: https://issues.apache.org/jira/browse/SPARK-10829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Apache Spark
>Priority: Blocker
>
> To reproduce that with the code:
> {code}
> withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/part=1"
> (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
> // If the "part = 1" filter gets pushed down, this query will throw 
> an exception since
> // "part" is not a valid column in the actual Parquet file
> checkAnswer(
>   sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 
> 1)"),
>   (2 to 3).map(i => Row(i, i.toString, 1)))
>   }
> }
> {code}
> We expect the result as:
> {code}
> 2, 1
> 3, 1
> {code}
> But we got:
> {code}
> 1, 1
> 2, 1
> 3, 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-09-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907646#comment-14907646
 ] 

Apache Spark commented on SPARK-10829:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/8916

> Scan DataSource with predicate expression combine partition key and 
> attributes doesn't work
> ---
>
> Key: SPARK-10829
> URL: https://issues.apache.org/jira/browse/SPARK-10829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> To reproduce that with the code:
> {code}
> withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/part=1"
> (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
> // If the "part = 1" filter gets pushed down, this query will throw 
> an exception since
> // "part" is not a valid column in the actual Parquet file
> checkAnswer(
>   sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 
> 1)"),
>   (2 to 3).map(i => Row(i, i.toString, 1)))
>   }
> }
> {code}
> We expect the result as:
> {code}
> 2, 1
> 3, 1
> {code}
> But we got:
> {code}
> 1, 1
> 2, 1
> 3, 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10829:


Assignee: (was: Apache Spark)

> Scan DataSource with predicate expression combine partition key and 
> attributes doesn't work
> ---
>
> Key: SPARK-10829
> URL: https://issues.apache.org/jira/browse/SPARK-10829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> To reproduce that with the code:
> {code}
> withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/part=1"
> (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
> // If the "part = 1" filter gets pushed down, this query will throw 
> an exception since
> // "part" is not a valid column in the actual Parquet file
> checkAnswer(
>   sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 
> 1)"),
>   (2 to 3).map(i => Row(i, i.toString, 1)))
>   }
> }
> {code}
> We expect the result as:
> {code}
> 2, 1
> 3, 1
> {code}
> But we got:
> {code}
> 1, 1
> 2, 1
> 3, 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10802) Let ALS recommend for subset of data

2015-09-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10802:
--
Priority: Minor  (was: Major)

> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10828) Can we use the accumulo data RDD created from JAVA in spark, in sparkR?Is there any other way to proceed with it to create RRDD from a source RDD other than text RDD?Or

2015-09-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10828.
---
Resolution: Invalid

If you would please, ask questions on u...@spark.apache.org
See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> Can we use the accumulo data RDD created from JAVA in spark, in sparkR?Is 
> there any other way to proceed with it to create RRDD from a source RDD other 
> than text RDD?Or to use any other format of data stored in HDFS in sparkR?
> --
>
> Key: SPARK-10828
> URL: https://issues.apache.org/jira/browse/SPARK-10828
> Project: Spark
>  Issue Type: Question
>  Components: R
>Affects Versions: 1.5.0
> Environment: ubuntu 12.04,8GB RAM,accumulo 1.6.3,hadoop 2.6
>Reporter: madhvi gupta
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10828) Can we use the accumulo data RDD created from JAVA in spark, in sparkR?Is there any other way to proceed with it to create RRDD from a source RDD other than text RDD?O

2015-09-25 Thread madhvi gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907725#comment-14907725
 ] 

madhvi gupta commented on SPARK-10828:
--

hey I have post this question on the spark amiling list also.There I was told 
to post this as an issue here to have better discussion.
see https://www.mail-archive.com/user@spark.apache.org/msg37450.html


> Can we use the accumulo data RDD created from JAVA in spark, in sparkR?Is 
> there any other way to proceed with it to create RRDD from a source RDD other 
> than text RDD?Or to use any other format of data stored in HDFS in sparkR?
> --
>
> Key: SPARK-10828
> URL: https://issues.apache.org/jira/browse/SPARK-10828
> Project: Spark
>  Issue Type: Question
>  Components: R
>Affects Versions: 1.5.0
> Environment: ubuntu 12.04,8GB RAM,accumulo 1.6.3,hadoop 2.6
>Reporter: madhvi gupta
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-25 Thread Sebastian YEPES FERNANDEZ (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907747#comment-14907747
 ] 

Sebastian YEPES FERNANDEZ commented on SPARK-10309:
---

Is there currently any workaround this issue?

I am also facing it with the last 1.5.1:
{code:title=Error|borderStyle=solid}
Caused by: java.io.IOException: Unable to acquire 33554432 bytes of memory at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:74)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:56)
 at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:339)
 ... 8 more
{code}

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9681) Support R feature interactions in RFormula

2015-09-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9681.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8830
[https://github.com/apache/spark/pull/8830]

> Support R feature interactions in RFormula
> --
>
> Key: SPARK-9681
> URL: https://issues.apache.org/jira/browse/SPARK-9681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 1.6.0
>
>
> Support the interaction (":") operator RFormula feature transformer, so that 
> it is available for use in SparkR's glm.
> Umbrella design doc for RFormula integration: 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?pli=1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907817#comment-14907817
 ] 

Ondřej Smola commented on SPARK-8734:
-

I am going to work on this - i need at least ability to set docker network 
type. Any suggestions on what else will be useful?

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907817#comment-14907817
 ] 

Ondřej Smola edited comment on SPARK-8734 at 9/25/15 8:53 AM:
--

I am going to work on this - i need at least ability to set docker network 
type. Any suggestions on what else can be useful?


was (Author: ondrej.smola):
I am going to work on this - i need at least ability to set docker network 
type. Any suggestions on what else will be useful?

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10830) Install unittest-xml-reporting on Jenkins

2015-09-25 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10830:
-

 Summary: Install unittest-xml-reporting on Jenkins
 Key: SPARK-10830
 URL: https://issues.apache.org/jira/browse/SPARK-10830
 Project: Spark
  Issue Type: New Feature
  Components: Tests
Reporter: Xiangrui Meng
Assignee: shane knapp


SPARK-7021 uses https://pypi.python.org/pypi/unittest-xml-reporting/1.12.0 to 
report Python unit test result to Jenkins. It requires unittest-xml-reporting 
to be installed on Jenkins workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10831) Spark SQL Configuration missing in the doc

2015-09-25 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10831:
-

 Summary: Spark SQL Configuration missing in the doc
 Key: SPARK-10831
 URL: https://issues.apache.org/jira/browse/SPARK-10831
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Cheng Hao


E.g.
spark.sql.codegen
spark.sql.planner.sortMergeJoin
spark.sql.dialect
spark.sql.caseSensitive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10797) RDD's coalesce should not write out the temporary key

2015-09-25 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Zvara updated SPARK-10797:
-
Description: 
It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}} as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} writes out both the (temporary) key and value to the 
specified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read {{Int}} 
and transform the data.

  was:
It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}} as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} writes out both the (temporary) key and value to the 
specified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read 
{{Int}}s and transform the data.


> RDD's coalesce should not write out the temporary key
> -
>
> Key: SPARK-10797
> URL: https://issues.apache.org/jira/browse/SPARK-10797
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Zoltán Zvara
>
> It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle 
> files) temporary keys used on the shuffle code path. Consider the following 
> code:
> {code:title=RDD.scala|borderStyle=solid}
> if (shuffle) {
>   /** Distributes elements evenly across output partitions, starting from 
> a random partition. */
>   val distributePartition = (index: Int, items: Iterator[T]) => {
> var position = (new Random(index)).nextInt(numPartitions)
> items.map { t =>
>   // Note that the hash code of the key will just be the key itself. 
> The HashPartitioner
>   // will mod it with the number of total partitions.
>   position = position + 1
>   (position, t)
> }
>   } : Iterator[(Int, T)]
>   // include a shuffle step so that our upstream tasks are still 
> distributed
>   new CoalescedRDD(
> new ShuffledRDD[Int, T, 
> T](mapPartitionsWithIndex(distributePartition),
> new HashPartitioner(numPartitions)),
> numPartitions).values
> } else {
> {code}
> {{ShuffledRDD}} will hash using {{position}} as keys as in the 
> {{distributePartition}} function. After the bucket has been chosen by the 
> sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
> {{DiskBlockObjectWriter}} writes out both the (temporary) key and value to 
> the specified partition. On 

[jira] [Assigned] (SPARK-10831) Spark SQL Configuration missing in the doc

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10831:


Assignee: (was: Apache Spark)

> Spark SQL Configuration missing in the doc
> --
>
> Key: SPARK-10831
> URL: https://issues.apache.org/jira/browse/SPARK-10831
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Cheng Hao
>
> E.g.
> spark.sql.codegen
> spark.sql.planner.sortMergeJoin
> spark.sql.dialect
> spark.sql.caseSensitive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10831) Spark SQL Configuration missing in the doc

2015-09-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907796#comment-14907796
 ] 

Apache Spark commented on SPARK-10831:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/8917

> Spark SQL Configuration missing in the doc
> --
>
> Key: SPARK-10831
> URL: https://issues.apache.org/jira/browse/SPARK-10831
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Cheng Hao
>
> E.g.
> spark.sql.codegen
> spark.sql.planner.sortMergeJoin
> spark.sql.dialect
> spark.sql.caseSensitive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10831) Spark SQL Configuration missing in the doc

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10831:


Assignee: Apache Spark

> Spark SQL Configuration missing in the doc
> --
>
> Key: SPARK-10831
> URL: https://issues.apache.org/jira/browse/SPARK-10831
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Apache Spark
>
> E.g.
> spark.sql.codegen
> spark.sql.planner.sortMergeJoin
> spark.sql.dialect
> spark.sql.caseSensitive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907963#comment-14907963
 ] 

Ondřej Smola commented on SPARK-8734:
-

+1   spark.mesos.executor.docker.parameter.

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 






  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 






> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 





  was:

hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 





> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Attachment: 1.jpg

> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: 1.jpg, screenshot-1.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907984#comment-14907984
 ] 

Ricky Yang commented on SPARK-10832:


15/09/25 19:00:09 INFO Master: Registering app JavaSparkSQL
15/09/25 19:00:09 INFO Master: Registered app JavaSparkSQL with ID 
app-20150925190009-0242
15/09/25 19:00:09 INFO Master: Launching executor app-20150925190009-0242/0 on 
worker worker-20150923201210-10.27.1.142-8
079
15/09/25 19:00:09 INFO Master: Launching executor app-20150925190009-0242/1 on 
worker worker-20150923201210-10.27.1.138-8
15/09/25 19:00:09 INFO Master: Launching executor app-20150925190009-0242/1 on 
worker worker-20150923201210-10.27.1.138-8
079
15/09/25 19:00:11 INFO Master: akka.tcp://driverClient@10.27.1.143:47123 got 
disassociated, removing it.
15/09/25 19:00:11 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://driverClient@10.27.1.143:47
123] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/09/25 19:00:11 INFO Master: akka.tcp://driverClient@10.27.1.143:47123 got 
disassociated, removing it.
15/09/25 19:00:14 INFO Master: Driver submitted 
org.apache.spark.deploy.worker.DriverWrapper
15/09/25 19:00:14 INFO Master: Launching driver driver-20150925190014-0205 on 
worker worker-20150923201210-10.27.1.143-80
79
15/09/25 19:00:17 INFO Master: Registering app JavaSparkPi
15/09/25 19:00:17 INFO Master: Registered app JavaSparkPi with ID 
app-20150925190017-0243
15/09/25 19:00:17 INFO Master: Launching executor app-20150925190017-0243/0 on 
worker worker-20150923201210-10.27.1.142-8
079
15/09/25 19:00:17 INFO Master: Launching executor app-20150925190017-0243/1 on 
worker worker-20150923201210-10.27.1.138-8
079
15/09/25 19:00:20 INFO Master: akka.tcp://driverClient@10.27.1.143:44975 got 
disassociated, removing it.
15/09/25 19:00:20 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://driverClient@10.27.1.143:44
975] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/09/25 19:00:20 INFO Master: akka.tcp://driverClient@10.27.1.143:44975 got 
disassociated, removing it.
15/09/25 19:00:21 INFO Master: Received unregister request from application 
app-20150925190009-0242
15/09/25 19:00:21 INFO Master: Removing app app-20150925190009-0242
15/09/25 19:00:21 WARN Master: Application JavaSparkSQL is still in progress, 
it may be terminated abnormally.
15/09/25 19:00:21 WARN Master: No event logs found for application JavaSparkSQL 
in hdfs://SuningHadoop2/sparklogs/sparklo
gshistorylog.
15/09/25 19:00:21 INFO Master: akka.tcp://sparkDriver@10.27.1.143:57388 got 
disassociated, removing it.
15/09/25 19:00:22 WARN Master: Got status update for unknown executor 
app-20150925190009-0242/1
15/09/25 19:00:21 INFO Master: Removing app app-20150925190009-0242
15/09/25 19:00:21 WARN Master: Application JavaSparkSQL is still in progress, 
it may be terminated abnormally.
15/09/25 19:00:21 WARN Master: No event logs found for application JavaSparkSQL 
in hdfs://SuningHadoop2/sparklogs/sparklo
gshistorylog.
15/09/25 19:00:21 INFO Master: akka.tcp://sparkDriver@10.27.1.143:57388 got 
disassociated, removing it.
15/09/25 19:00:22 WARN Master: Got status update for unknown executor 
app-20150925190009-0242/1
15/09/25 19:00:22 WARN Master: Got status update for unknown executor 
app-20150925190009-0242/0

> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png and screenshot-1-1.png
> and The following is master log  as screenshot-2.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png and screenshot-1-1.png
and The following is master log  as screenshot-2.png






  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png
and The following is master log  as screenshot-2.png







> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png and screenshot-1-1.png
> and The following is master log  as screenshot-2.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-09-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908001#comment-14908001
 ] 

Zsolt Tóth commented on SPARK-7736:
---

As I see, this is also a problem for SparkR applications in yarn-cluster mode. 
Is there an open JIRA for that?

> Exception not failing Python applications (in yarn cluster mode)
> 
>
> Key: SPARK-7736
> URL: https://issues.apache.org/jira/browse/SPARK-7736
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
> Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
>Reporter: Shay Rojansky
>Assignee: Marcelo Vanzin
> Fix For: 1.5.1, 1.6.0
>
>
> It seems that exceptions thrown in Python spark apps after the SparkContext 
> is instantiated don't cause the application to fail, at least in Yarn: the 
> application is marked as SUCCEEDED.
> Note that any exception right before the SparkContext correctly places the 
> application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-09-25 Thread Marko Asplund (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907975#comment-14907975
 ] 

Marko Asplund commented on SPARK-10791:
---

This performance issue was actually discussed on the spark mailing list.
Please see full discussion here: 
https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/browser

My tests were performed on a single node.

> Optimize MLlib LDA topic distribution query performance
> ---
>
> Key: SPARK-10791
> URL: https://issues.apache.org/jira/browse/SPARK-10791
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
> Environment: Ubuntu 13.10, Oracle Java 8
>Reporter: Marko Asplund
>
> I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size 
> and ~3.4 M documents using EMLDAOptimizer.
> Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
> training with the same data and on the same system set took ~5 minutes. 
> Loading the persisted model from disk (~2 minutes), as well as querying LDA 
> model topic distributions (~4 seconds for one document) are also quite slow 
> operations.
> Our application is querying LDA model topic distribution (for one doc at a 
> time) as part of end-user operation execution flow, so a ~4 second execution 
> time is very problematic.
> The log includes the following message, which AFAIK, should mean that 
> netlib-java is using machine optimised native implementation: 
> "com.github.fommil.jni.JniLoader - successfully loaded 
> /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"
> My test code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57
> I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable 
> change in training performance. Model loading time was reduced to ~ 5 seconds 
> from ~ 2 minutes (now persisted as LocalLDAModel). However, query / 
> prediction time was unchanged.
> Unfortunately, this is the critical performance characteristic in our case.
> I did some profiling for my LDA prototype code that requests topic 
> distributions from a model. According to Java Mission Control more than 80 % 
> of execution time during sample interval is spent in the following methods:
> - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
> - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
> - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
> 6.98%
> - java.lang.Double.valueOf(double); count: 31; 4.33%
> Is there any way of using the API more optimally?
> Are there any opportunities for optimising the "topicDistributions" code
> path in MLlib?
> My query test code looks like this essentially:
> // executed once
> val model = LocalLDAModel.load(ctx, ModelFileName)
> // executed four times
> val samples = Transformers.toSparseVectors(vocabularySize,
> ctx.parallelize(Seq(input))) // fast
> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
> seems to take about 4 seconds to execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 

hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 




  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 





> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 




  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 




> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Attachment: screenshot-1.png

> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Attachment: (was: 1.jpg)

> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png
> and The following is master log  as 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-09-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907986#comment-14907986
 ] 

Sean Owen commented on SPARK-10390:
---

My guess is it ends up building in a different Guava dependency when built via 
SBT? I'm still not entirely sure. I do know the dependency resolution rules are 
different and that's why only the Maven build 'counts'. I'd try Maven, anyway, 
just to see if it works. If not then we know this guess isn't correct.

> Py4JJavaError java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
> 
>
> Key: SPARK-10390
> URL: https://issues.apache.org/jira/browse/SPARK-10390
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Zoltán Zvara
>
> While running PySpark through iPython.
> {code}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsedMillis()J
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
>   at 
> org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {{spark-env.sh}}
> {code}
> export IPYTHON=1
> export PYSPARK_PYTHON=/usr/bin/python3
> export PYSPARK_DRIVER_PYTHON=ipython3
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
> {code}
> Spark built with:
> {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}}
> Not a problem, when built against {{Hadoop 2.4}}!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread Martin Tapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907966#comment-14907966
 ] 

Martin Tapp commented on SPARK-8734:


Same here, spark.mesos.executor.docker.parameter.
 is fined by me.

On Fri, Sep 25, 2015 at 7:50 AM, Ondřej Smola (JIRA) 



> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 


  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .


> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)
Ricky Yang created SPARK-10832:
--

 Summary: sometimes No event logs found for application using same 
JavaSparkSQL  example
 Key: SPARK-10832
 URL: https://issues.apache.org/jira/browse/SPARK-10832
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Ricky Yang


hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 



  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 



> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png






  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is 







> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: 1.jpg, screenshot-1.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png
and The following is master log  as screenshot-2.png






  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png
and The following is master log  as 







> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png
> and The following is master log  as screenshot-2.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Attachment: screenshot-2.png

> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png
> and The following is master log  as 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png
and The following is master log  as 






  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png







> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png
> and The following is master log  as 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Attachment: screenshot-1-1.png

> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows has"No event 
> logs found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png
> and The following is master log  as screenshot-2.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows "No event logs found 
for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png and screenshot-1-1.png
and master log  as screenshot-2.png






  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows "No event logs found 
for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png and screenshot-1-1.png
and The following is master log  as screenshot-2.png







> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows "No event logs 
> found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png and screenshot-1-1.png
> and master log  as screenshot-2.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example

2015-09-25 Thread Ricky Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang updated SPARK-10832:
---
Description: 
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows "No event logs found 
for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png and screenshot-1-1.png
and The following is master log  as screenshot-2.png






  was:
hi all,
   when  using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.sql.JavaSparkSQL 
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar

unfortunately , sometimes completed applications web shows has"No event 
logs found for application",but  a majority of same application is nomal .

the wrong  log picture is screenshot-1.png and screenshot-1-1.png
and The following is master log  as screenshot-2.png







> sometimes No event logs found for application using same JavaSparkSQL  example
> --
>
> Key: SPARK-10832
> URL: https://issues.apache.org/jira/browse/SPARK-10832
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Ricky Yang
> Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png
>
>
> hi all,
>when  using JavaSparkSQL example,the code was submit many times as 
> following:
> /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.sql.JavaSparkSQL 
> hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
> unfortunately , sometimes completed applications web shows "No event logs 
> found for application",but  a majority of same application is nomal .
> the wrong  log picture is screenshot-1.png and screenshot-1-1.png
> and The following is master log  as screenshot-2.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2015-09-25 Thread Tomasz Bartczak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907669#comment-14907669
 ] 

Tomasz Bartczak commented on SPARK-10802:
-

hmm you are probably referring to method  

predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] 

but what I am referring to are topK recommendations for subset of users.

using
 recommendProducts(user: Int, num: Int): Array[Rating]  is quite slow when done 
in loop for many users
 recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] is an overhead 
when I need that for a subset of users

I imagine a method:

 recommendProductsForUsers(users:RDD[Int], num: Int): RDD[(Int, Array[Rating])] 
that would first retain user features for given users and after that do a 
cartesian join with product features.

> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10833) Inline, organize BSD/MIT licenses in LICENSE

2015-09-25 Thread Sean Owen (JIRA)
Sean Owen created SPARK-10833:
-

 Summary: Inline, organize BSD/MIT licenses in LICENSE
 Key: SPARK-10833
 URL: https://issues.apache.org/jira/browse/SPARK-10833
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.5.0
Reporter: Sean Owen
Assignee: Sean Owen


In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to 
light that the guidance at 
http://www.apache.org/dev/licensing-howto.html#permissive-deps means that 
permissively-licensed dependencies has a different interpretation than we (er, 
I) had been operating under. "pointer ... to the license within the source 
tree" specifically means a copy of the license within Spark's distribution, 
whereas at the moment, Spark's LICENSE has a pointer to the project's license 
in the *other project's* source tree.

The remedy is simply to inline all such license references (i.e. BSD/MIT 
licenses) or include their text in "licenses" subdirectory and point to that.

Along the way, we can also treat other BSD/MIT licenses, whose text has been 
inlined into LICENSE, in the same way.

The LICENSE file can continue to provide a helpful list of BSD/MIT licensed 
projects and a pointer to their sites. This would be over and above including 
license text in the distro, which is the essential thing.

I do not think this blocks a current release, since there's a good-faith 
argument that the current practice satisfies the terms of the third-party 
licenses as well. (If it didn't, this would be a blocker for any further 
release.) However, of course it's better to follow the best practice going 
forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10833) Inline, organize BSD/MIT licenses in LICENSE

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10833:


Assignee: Apache Spark  (was: Sean Owen)

> Inline, organize BSD/MIT licenses in LICENSE
> 
>
> Key: SPARK-10833
> URL: https://issues.apache.org/jira/browse/SPARK-10833
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to 
> light that the guidance at 
> http://www.apache.org/dev/licensing-howto.html#permissive-deps means that 
> permissively-licensed dependencies has a different interpretation than we 
> (er, I) had been operating under. "pointer ... to the license within the 
> source tree" specifically means a copy of the license within Spark's 
> distribution, whereas at the moment, Spark's LICENSE has a pointer to the 
> project's license in the *other project's* source tree.
> The remedy is simply to inline all such license references (i.e. BSD/MIT 
> licenses) or include their text in "licenses" subdirectory and point to that.
> Along the way, we can also treat other BSD/MIT licenses, whose text has been 
> inlined into LICENSE, in the same way.
> The LICENSE file can continue to provide a helpful list of BSD/MIT licensed 
> projects and a pointer to their sites. This would be over and above including 
> license text in the distro, which is the essential thing.
> I do not think this blocks a current release, since there's a good-faith 
> argument that the current practice satisfies the terms of the third-party 
> licenses as well. (If it didn't, this would be a blocker for any further 
> release.) However, of course it's better to follow the best practice going 
> forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10833) Inline, organize BSD/MIT licenses in LICENSE

2015-09-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908038#comment-14908038
 ] 

Apache Spark commented on SPARK-10833:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8919

> Inline, organize BSD/MIT licenses in LICENSE
> 
>
> Key: SPARK-10833
> URL: https://issues.apache.org/jira/browse/SPARK-10833
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to 
> light that the guidance at 
> http://www.apache.org/dev/licensing-howto.html#permissive-deps means that 
> permissively-licensed dependencies has a different interpretation than we 
> (er, I) had been operating under. "pointer ... to the license within the 
> source tree" specifically means a copy of the license within Spark's 
> distribution, whereas at the moment, Spark's LICENSE has a pointer to the 
> project's license in the *other project's* source tree.
> The remedy is simply to inline all such license references (i.e. BSD/MIT 
> licenses) or include their text in "licenses" subdirectory and point to that.
> Along the way, we can also treat other BSD/MIT licenses, whose text has been 
> inlined into LICENSE, in the same way.
> The LICENSE file can continue to provide a helpful list of BSD/MIT licensed 
> projects and a pointer to their sites. This would be over and above including 
> license text in the distro, which is the essential thing.
> I do not think this blocks a current release, since there's a good-faith 
> argument that the current practice satisfies the terms of the third-party 
> licenses as well. (If it didn't, this would be a blocker for any further 
> release.) However, of course it's better to follow the best practice going 
> forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-25 Thread Chris Heller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908050#comment-14908050
 ] 

Chris Heller commented on SPARK-8734:
-

I pushed up to my branch some code for the parameters. Though its untested at 
the moment ... tried to rebase the code to master and am now getting build 
errors.

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9941) Try ML pipeline API on Kaggle competitions

2015-09-25 Thread Kristina Plazonic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908036#comment-14908036
 ] 

Kristina Plazonic commented on SPARK-9941:
--

I would love to do Avito Context Ad Clicks - 
https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of 
feature engineering and preprocessing. I would love to split this with somebody 
else if anybody is interested on working with this. 

Thanks!
Kristina

> Try ML pipeline API on Kaggle competitions
> --
>
> Key: SPARK-9941
> URL: https://issues.apache.org/jira/browse/SPARK-9941
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is an umbrella JIRA to track some fun tasks :)
> We have built many features under the ML pipeline API, and we want to see how 
> it works on real-world datasets, e.g., Kaggle competition datasets 
> (https://www.kaggle.com/competitions). We want to invite community members to 
> help test. The goal is NOT to win the competitions but to provide code 
> examples and to find out missing features and other issues to help shape the 
> roadmap.
> For people who are interested, please do the following:
> 1. Create a subtask (or leave a comment if you cannot create a subtask) to 
> claim a Kaggle dataset.
> 2. Use the ML pipeline API to build and tune an ML pipeline that works for 
> the Kaggle dataset.
> 3. Paste the code to gist (https://gist.github.com/) and provide the link 
> here.
> 4. Report missing features, issues, running times, and accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10833) Inline, organize BSD/MIT licenses in LICENSE

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10833:


Assignee: Sean Owen  (was: Apache Spark)

> Inline, organize BSD/MIT licenses in LICENSE
> 
>
> Key: SPARK-10833
> URL: https://issues.apache.org/jira/browse/SPARK-10833
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to 
> light that the guidance at 
> http://www.apache.org/dev/licensing-howto.html#permissive-deps means that 
> permissively-licensed dependencies has a different interpretation than we 
> (er, I) had been operating under. "pointer ... to the license within the 
> source tree" specifically means a copy of the license within Spark's 
> distribution, whereas at the moment, Spark's LICENSE has a pointer to the 
> project's license in the *other project's* source tree.
> The remedy is simply to inline all such license references (i.e. BSD/MIT 
> licenses) or include their text in "licenses" subdirectory and point to that.
> Along the way, we can also treat other BSD/MIT licenses, whose text has been 
> inlined into LICENSE, in the same way.
> The LICENSE file can continue to provide a helpful list of BSD/MIT licensed 
> projects and a pointer to their sites. This would be over and above including 
> license text in the distro, which is the essential thing.
> I do not think this blocks a current release, since there's a good-faith 
> argument that the current practice satisfies the terms of the third-party 
> licenses as well. (If it didn't, this would be a blocker for any further 
> release.) However, of course it's better to follow the best practice going 
> forward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10766) Add some configurations for the client process in yarn-cluster mode.

2015-09-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10766:


Assignee: Apache Spark

> Add some configurations for the client process in yarn-cluster mode. 
> -
>
> Key: SPARK-10766
> URL: https://issues.apache.org/jira/browse/SPARK-10766
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
>
> In the yarn-cluster mode, it's hard to find the correct configuration for the 
> client process. 
> But this is necessary such as the client process's class path: if I want to 
> use hbase on spark, I have to include the hbase jars into client's classpath.
> But *spark.driver.extraClassPath* can't take effect. The way I can do is set 
> the hbase jars into the Enviroment of SPARK_CLASSPATH. 
> It isn't a better way so I want to add some configuration for this client 
> process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >