Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic
Yes, it contain one line On Wed, Aug 26, 2015 at 8:20 PM, Yin Huai-2 [via Apache Spark Developers List] ml-node+s1001551n13852...@n3.nabble.com wrote: The JSON support in Spark SQL handles a file with one JSON object per line or one JSON array of objects per line. What is the format your file?

Re: Maven issues with 1.5-RC

2015-08-26 Thread shane knapp
we build on jenkins w/3.1.1, but also have 3.0.4. On Wed, Aug 26, 2015 at 8:18 AM, Sean Owen so...@cloudera.com wrote: It sounds like you're doing the right things. I believe the Jenkins test machines also have 3.0.4, but successfully build by using build/mvn --force. Not sure what to make of

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread Yin Huai
The JSON support in Spark SQL handles a file with one JSON object per line or one JSON array of objects per line. What is the format your file? Does it only contain a single line? On Wed, Aug 26, 2015 at 6:47 AM, gsvic victora...@gmail.com wrote: Hi, I have the following issue. I am trying to

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread Reynold Xin
Any reason why you have more than 2G in a single line? There is a limit of 2G in the Hadoop library we use. Also the JVM doesn't work when your string is that long. On Wed, Aug 26, 2015 at 11:38 AM, gsvic victora...@gmail.com wrote: Yes, it contain one line On Wed, Aug 26, 2015 at 8:20 PM,

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread rake
rxin wrote The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc I was looking

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Luc Bourlier
- tested the backpressure/rate controlling in streaming. It works as expected. - there is a problem with the Scala 2.11 sbt build: https://issues.apache.org/jira/browse/SPARK-10227 Luc Bourlier Luc Bourlier *Spark Team - Typesafe, Inc.* luc.bourl...@typesafe.com http://www.typesafe.com On

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Yu
My understanding is that people on this mailing list who are interested to help can log comments on the GORA JIRA. HBase integration with Spark is proven to work. So the intricacies should be on Gora side. On Wed, Aug 26, 2015 at 8:08 AM, Furkan KAMACI furkankam...@gmail.com wrote: Btw, here is

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Sean Owen
My quick take: no blockers at this point, except for one potential issue. Still some 'critical' bugs worth a look. The release seems to pass tests but i get a lot of spurious failures; it took about 16 hours of running tests to get everything to pass at least once. Current score: 56 issues

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Calvin Jia
+1, tested that 1.5.0-RC2 works with Tachyon 0.7.1 as external block store.

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Reynold Xin
One small update -- the vote should close Saturday Aug 29. Not Friday Aug 29. On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Reynold Xin
The Scala 2.11 issue should be fixed, but doesn't need to be a blocker, since Maven builds fine. The sbt build is more aggressive to make sure we catch warnings. On Wed, Aug 26, 2015 at 10:01 AM, Sean Owen so...@cloudera.com wrote: My quick take: no blockers at this point, except for one

Re: Building with sbt impossible to get artifacts when data has not been loaded

2015-08-26 Thread Marcelo Vanzin
I ran into the same error (different dependency) earlier today. In my case, the maven pom files and the sbt dependencies had a conflict (different versions of the same artifact) and ivy got confused. Not sure whether that will help in your case or not... On Wed, Aug 26, 2015 at 2:23 PM, Holden

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi Ted, You can check full stack trace log from the attachment at Jira: https://issues.apache.org/jira/browse/GORA-386 Kind Regards, Furkan KAMACI On Wed, Aug 26, 2015 at 6:55 PM, Ted Yu yuzhih...@gmail.com wrote: My understanding is that people on this mailing list who are interested to

Building with sbt impossible to get artifacts when data has not been loaded

2015-08-26 Thread Holden Karau
Has anyone else run into impossible to get artifacts when data has not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3 during hive/update when building with sbt. Working around it is pretty simple (just add it as a dependency), but I'm wondering if its impacting anyone else and I should

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-26 Thread rake
This looks promising. I'm trying to use spark-ec2 to launch a cluster with Spark 1.5.0-SNAPSHOT and failing. Where should we ask questions, report problems? I couple of questions I have already after looking through the project: - Where does the configuration file /spark-deployer.conf/ go

Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi, I start an Hbase cluster for my test class. I use that helper class: https://github.com/apache/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/util/HBaseClusterSingleton.java and use it as like that: private static final HBaseClusterSingleton cluster =

RE: Spark builds: allow user override of project version at buildtime

2015-08-26 Thread andrew.rowson
So, I actually tried this, and it built without problems, but publishing the artifacts to artifactory ended up with some strangeness in the child poms, where the property wasn’t resolved. This leads to issues pulling them into other projects of: “Could not find

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Yu
The connection failure was to zookeeper. Have you verified that localhost:2181 can serve requests ? What version of hbase was Gora built against ? Cheers On Aug 26, 2015, at 1:50 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, I start an Hbase cluster for my test class. I use that

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Yu
Can you log the contents of the Configuration you pass from Spark ? The output would give you some clue. Cheers On Aug 26, 2015, at 2:30 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Ted, I'll check Zookeeper connection but another test method which runs on hbase without Spark

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
I've always used HBaseTestingUtility and never really had much trouble. I use that for all my unit testing between Spark and HBase. Here are some code examples if your interested --Main HBase-Spark Module https://github.com/apache/hbase/tree/master/hbase-spark --Unit test that cover all basic

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi, Here is the test method I've ignored due to Connection Refused problem failure: https://github.com/kamaci/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/mapreduce/TestHBaseStoreWordCount.java#L65 I've implemented a Spark backend for Apache Gora as GSoC project and this is

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
Where is the input format class. When every I use the search on your github it says We couldn’t find any issues matching 'GoraInputFormat' On Wed, Aug 26, 2015 at 9:48 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, Here is the MapReduceTestUtils.testSparkWordCount()

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-26 Thread pishen tsai
Please ask questions at the gitter channel for now. https://gitter.im/pishen/spark-deployer - spark-deployer.conf should be placed in your project's root directory (beside build.sbt) - To use the nightly builds, you can replace the value of spark-tgz-url in spark-deployer.conf to the tgz you want

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi Ted, I'll check Zookeeper connection but another test method which runs on hbase without Spark works without any error. Hbase version is 0.98.8-hadoop2 and I use Spark 1.3.1 Kind Regards, Furkan KAMACI 26 Ağu 2015 12:08 tarihinde Ted Yu yuzhih...@gmail.com yazdı: The connection failure was

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
Where can I find the code for MapReduceTestUtils.testSparkWordCount? On Wed, Aug 26, 2015 at 9:29 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, Here is the test method I've ignored due to Connection Refused problem failure:

SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic
Hi, I have the following issue. I am trying to load a 2.5G JSON file from a 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame, using sqlContext.read.json(hdfs://master:9000/path/file.json). The JSON file contains a parsed table(relation) from the TPCH benchmark. After

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Hi, Here is the MapReduceTestUtils.testSparkWordCount() https://github.com/kamaci/gora/blob/master/gora-core/src/test/java/org/apache/gora/mapreduce/MapReduceTestUtils.java#L108 Here is SparkWordCount

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Yu
I found GORA-386 Gora Spark Backend Support Should the discussion be continued there ? Cheers On Wed, Aug 26, 2015 at 7:02 AM, Ted Malaska ted.mala...@cloudera.com wrote: Where is the input format class. When every I use the search on your github it says We couldn’t find any issues matching

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Furkan KAMACI
Btw, here is the source code of GoraInputFormat.java : https://github.com/kamaci/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/GoraInputFormat.java 26 Ağu 2015 18:05 tarihinde Furkan KAMACI furkankam...@gmail.com yazdı: I'll send an e-mail to Gora dev list too and also

Maven issues with 1.5-RC

2015-08-26 Thread Chris Freeman
Currently trying to compile 1.5-RC2 (from https://github.com/apache/spark/commit/727771352855dbb780008c449a877f5aaa5fc27a) and running into issues with the new Maven requirement. I have 3.0.4 installed at the system level, 1.5 requires 3.3.3. As Patrick has pointed out in other places, this

Re: Maven issues with 1.5-RC

2015-08-26 Thread Sean Owen
It sounds like you're doing the right things. I believe the Jenkins test machines also have 3.0.4, but successfully build by using build/mvn --force. Not sure what to make of that. On Wed, Aug 26, 2015 at 4:08 PM, Chris Freeman cfree...@alteryx.com wrote: Currently trying to compile 1.5-RC2

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic
No, I created the file by appending each JSON record in a loop without changing line. I've just changed that and now it works fine. Thank you very much for your support. -- View this message in context:

Differing performance in self joins

2015-08-26 Thread David Smith
I've noticed that two queries, which return identical results, have very different performance. I'd be interested in any hints about how avoid problems like this. The DataFrame df contains a string field series and an integer eday, the number of days since (or before) the 1970-01-01 epoch. I'm

Re: Building with sbt impossible to get artifacts when data has not been loaded

2015-08-26 Thread Josh Rosen
I ran into a similar problem while working on the spark-redshift library and was able to fix it by bumping that library's ScalaTest version. I'm still fighting some mysterious Scala issues while trying to test the spark-csv library against 1.5.0-RC1, so it's possible that a build or dependency

A TPCH benchmark for Spark

2015-08-26 Thread Feng Tian
Hi, We released a package called LLQL, which is a serialization of operators of relational algebra. Spark SQL Plan is the first one supported. More interesting to the spark community probably is our test that implements TPCH. We manually rewrote some sql -- mainly pulling subqueries out and