Re: Null pointer exception when using com.databricks.spark.csv

2016-03-29 Thread Hyukjin Kwon
Hi, I guess this is not a CSV-datasource specific problem. Does loading any file (eg. textFile()) work as well? I think this is related with this thread, http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html .

aggregateByKey on PairRDD

2016-03-29 Thread Suniti Singh
Hi All, I have an RDD having the data in the following form : tempRDD: RDD[(String, (String, String))] (brand , (product, key)) ("amazon",("book1","tech")) ("eBay",("book1","tech")) ("barns",("book","tech")) ("amazon",("book2","tech")) I would like to group the data by Brand and would

Any documentation on Spark's security model beyond YARN?

2016-03-29 Thread Michael Segel
Hi, So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its security from the underlying YARN job. However… that’s not really saying much when you think about some use cases. Like using the thrift service … I’m wondering what else is new and what people have been

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-29 Thread Joseph Bradley
This is great feedback to hear. I think there was discussion about moving Pipelines outside of ML at some point, but I'll have to spend more time to dig it up. In the meantime, I thought I'd mention this JIRA here in case people have feedback: https://issues.apache.org/jira/browse/SPARK-14033

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-29 Thread Tom Graves
+1. Tom On Tuesday, March 29, 2016 1:17 PM, Reynold Xin wrote: They work. On Tue, Mar 29, 2016 at 10:01 AM, Koert Kuipers wrote: if scala prior to sbt 2.10.4 didn't support java 8, does that mean that 3rd party scala libraries compiled with a

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-29 Thread Reynold Xin
They work. On Tue, Mar 29, 2016 at 10:01 AM, Koert Kuipers wrote: > if scala prior to sbt 2.10.4 didn't support java 8, does that mean that > 3rd party scala libraries compiled with a scala version < 2.10.4 might not > work on java 8? > > > On Mon, Mar 28, 2016 at 7:06 PM,

Fwd: Master options Cluster/Client descrepencies.

2016-03-29 Thread satyajit vegesna
Hi All, I have written a spark program on my dev box , IDE:Intellij scala version:2.11.7 spark verison:1.6.1 run fine from IDE, by providing proper input and output paths including master. But when i try to deploy the code in my cluster made of below, Spark

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-29 Thread Koert Kuipers
if scala prior to sbt 2.10.4 didn't support java 8, does that mean that 3rd party scala libraries compiled with a scala version < 2.10.4 might not work on java 8? On Mon, Mar 28, 2016 at 7:06 PM, Kostas Sakellis wrote: > Also, +1 on dropping jdk7 in Spark 2.0. > > Kostas >

Understanding PySpark Internals

2016-03-29 Thread Adam Roberts
Hi, I'm interested in figuring out how the Python API for Spark works, I've came to the following conclusion and want to share this with the community; could be of use in the PySpark docs here, specifically the "Execution and pipelining part". Any sanity checking would be much appreciated,

Re: SPARK-13843 Next steps

2016-03-29 Thread Steve Loughran
while sonatype are utterly strict about the org.apache namespace (it guarantees that all such artifacts have come through the ASF release process, ideally including code-signing), nobody checks the org.apache internals, or worries too much about them. Note that spark itself has some bits of

SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Hi, I have a web service that provides rest api to train random forest algo. I train random forest on a 5 nodes spark cluster with enough memory - everything is cached (~22 GB). On a small datasets up to 100k samples everything is fine, but with the biggest one (400k samples and ~70k features)