Spark SQL with Hive error: "Conf non-local session path expected to be non-null;"

2015-10-04 Thread YaoPau
I've been experimenting with using PySpark SQL to query Hive tables for the last week and all has been smooth, but on a command I've run hundreds of times successfully (a basic SELECT * ...), suddenly this error started popping up every time I ran a sqlCtx command until I restarted my session.

Re: How to install a Spark Package?

2015-10-04 Thread Ted Yu
Are you talking about package which is listed on http://spark-packages.org The package should come with installation instructions, right ? > On Oct 4, 2015, at 8:55 PM, jeff saremi wrote: > > So that it is available even in offline mode? I can't seem to be able to find

Re: Secondary Sorting in Spark

2015-10-04 Thread Koert Kuipers
See also https://github.com/tresata/spark-sorted On Oct 5, 2015 3:41 AM, "Bill Bejeck" wrote: > I've written blog post on secondary sorting in Spark and I'd thought I'd > share it with the group > > http://codingjunkie.net/spark-secondary-sort/ > > Thanks, > Bill >

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-04 Thread Alex Rovner
Can you at least copy paste the error(s) you are seeing when the job fails? Without the error message(s), it's hard to even suggest anything. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * * On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha

Spark 1.5.0 Error on startup

2015-10-04 Thread Julius Fernandes
I have Spark 1.5.0 (Prebuilt for Hadoop 2.6) with JDK 1.7. NOTE: I do not have Hadoop installation. When ever I start spark-shell, I get the following error Caused by: java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at

How to install a Spark Package?

2015-10-04 Thread jeff saremi
So that it is available even in offline mode? I can't seem to be able to find any notes on thatthanksjeff

ml.Pipeline without train step

2015-10-04 Thread Jaonary Rabarisoa
Hi there, The Pipeline of ml package is really a great feature and we use it in our every day task. But we have some use case where we need a Pipeline of Transformers only and the problem is that there's not train phase in that case. For example, we have a pipeline of image analytics with the

Re: preferredNodeLocationData, SPARK-8949, and SparkContext - a leftover?

2015-10-04 Thread Sean Owen
I think it's unused as the JIRA says, but removing it from the constructors would change the API, so that's why it stays in the signature. Removing the internal field and one usage of it seems OK, though I don't think it would help much of anything. On Sun, Oct 4, 2015 at 4:36 AM, Jacek Laskowski

java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-10-04 Thread t_ras
I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying coutn action on a file. The file is a CSV file 217GB zise Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0 configutation: spark.app.id:local-1443956477103 spark.app.name:Spark shell

Re: Examples module not building in intellij

2015-10-04 Thread Sean Owen
It builds for me. That message usually really means you can't resolve or download from a repo. It's just the last thing that happens to fail. On Sun, Oct 4, 2015 at 7:06 AM, Stephen Boesch wrote: > > For a week or two the trunk has not been building for the examples module >

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-10-04 Thread Ted Yu
1.2.0 is quite old. You may want to try 1.5.1 which was released in the past week. Cheers > On Oct 4, 2015, at 4:26 AM, t_ras wrote: > > I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying > coutn action on a file. > > The file is a CSV file

Re: Kafka Direct Stream

2015-10-04 Thread varun sharma
I went through the story and as I understood it is for saving data to multiple keyspaces at once. How will it work for saving data to multiple tables in same keyspace. I think tableName: String should also be tableName: T=>String.. Let me know if I understood incorrectly.. On Sat, Oct 3, 2015 at

Re: Examples module not building in intellij

2015-10-04 Thread Stephen Boesch
Thanks Sean. Why would a repo not be resolvable from IJ eve though all modules build properly on the command line? 2015-10-04 2:47 GMT-07:00 Sean Owen : > It builds for me. That message usually really means you can't resolve > or download from a repo. It's just the last

Examples module not building in intellij

2015-10-04 Thread Stephen Boesch
For a week or two the trunk has not been building for the examples module within intellij. The other modules - including core, sql, mllib, etc *are * working. A portion of the error message is "Unable to get dependency information: Unable to read the metadata file for artifact

Re: Hive ORC Malformed while loading into spark data frame

2015-10-04 Thread Umesh Kacha
Thanks much Zhan Zhang. I will open a JIRA saying orc files created using hiveContext.sql can't be read by dataframe reader. Regards, Umesh On Oct 4, 2015 10:14, "Zhan Zhang" wrote: > HI Umesh, > > It depends on how you create and read the orc file, although everything >

Re: preferredNodeLocationData, SPARK-8949, and SparkContext - a leftover?

2015-10-04 Thread Jacek Laskowski
Hi, You're right - it is unused, but the code does some (very little) initialization as if it'd be really needed. Confusion is seeded. I filled https://issues.apache.org/jira/browse/SPARK-10921 to track it. The other reason I brought it up was to help myself (and hopefully others) who read the

Mini projects for spark novice

2015-10-04 Thread Rahul Jeevanandam
I am currently learning Spark and I wanna solidify my knowledge on Spark, hence I wanna do some projects on it. Can you suggest me some nice project ideas to work on Spark? -- Regards, *Rahul*

Re: Spark Streaming over YARN

2015-10-04 Thread nibiau
4 partitions. - Mail original - De: "Dibyendu Bhattacharya" À: "Nicolas Biau" Cc: "Cody Koeninger" , "user" Envoyé: Dimanche 4 Octobre 2015 16:51:38 Objet: Re: Spark Streaming over YARN How

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-04 Thread Umesh Kacha
Hi I tried to use callUDF in the following way it throws exception saying cant recognise myUDF even though I registered it. List colList = new ArrayList(); colSeq.add(col("myColumn").as("modifiedColumn")); Seq colSeq = JavaConversions.asScalaBuffer(colList);//I need to do this because the

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Philip Weaver
Yes, I am sharing the cluster across many jobs, and each jobs only needs 8 cores (in fact, because the jobs are so small and are counting uniques, it only gets slower as you add more cores). My question is how to limit each job to only use 8 cores, but have the entire cluster available for that

Enriching df.write.jdbc

2015-10-04 Thread Kapil Raaj
Hello folks, I would like to contribute code to enrich DataFrame writer api for JDBC to cover "Update table" feature based on some field name/key passed as LIST of Strings. Use Case: 1. df.write.mode(*"Update"*).jdbc(connectionString, "table_name" ,connectionProperties, *keys*) Or 2.

Re: Mini projects for spark novice

2015-10-04 Thread Ted Yu
See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark FYI On Sun, Oct 4, 2015 at 7:06 AM, Rahul Jeevanandam wrote: > I am currently learning Spark and I wanna solidify my knowledge on Spark, > hence I wanna do some projects on it. Can you suggest me

Re: Spark Streaming over YARN

2015-10-04 Thread nibiau
Hello, I am using https://github.com/dibbhatt/kafka-spark-consumer I specify 4 receivers in the ReceiverLauncher , but in YARN console I can see one node receiving the kafka flow. (I use spark 1.3.1) Tks Nicolas - Mail original - De: "Dibyendu Bhattacharya"

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Adrian Tanase
You are absolutely correct, I apologize. My understanding was that you are sharing the machine across many jobs. That was the context in which I was making that comment. -adrian Sent from my iPhone On 03 Oct 2015, at 07:03, Philip Weaver

Re: how to broadcast huge lookup table?

2015-10-04 Thread Adrian Tanase
have a look at .transformWith, you can specify another RDD. Sent from my iPhone On 02 Oct 2015, at 21:50, "saif.a.ell...@wellsfargo.com" > wrote: I tried broadcasting a key-value rdd, but

Re: Spark Streaming over YARN

2015-10-04 Thread Dibyendu Bhattacharya
How many partitions are there in your Kafka topic ? Regards, Dibyendu On Sun, Oct 4, 2015 at 8:19 PM, wrote: > Hello, > I am using https://github.com/dibbhatt/kafka-spark-consumer > I specify 4 receivers in the ReceiverLauncher , but in YARN console I can > see one node

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Jerry Lam
Philip, the guy is trying to help you. Calling him silly is a bit too far. He might assume your problem is IO bound which might not be the case. If you need only 4 cores per job no matter what there is little advantage to use spark in my opinion because you can easily do this with just a worker

Secondary Sorting in Spark

2015-10-04 Thread Bill Bejeck
I've written blog post on secondary sorting in Spark and I'd thought I'd share it with the group http://codingjunkie.net/spark-secondary-sort/ Thanks, Bill

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Philip Weaver
I believe I've described my use case clearly, and I'm being questioned that it's legitimate. I will assert again that if you don't understand my use case, it really doesn't make sense to make any statement about how many resources I should need. And I'm sorry, but I completely disagree with your

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Philip Weaver
Since I'm running Spark on Mesos, to be fair I should give Mesos credit, too! And I should also put some effort into describing what I'm trying to accomplish of more clearly. There are really three levels of scheduling that I'm hoping to exploit: - Scheduling in Mesos across all frameworks, where

spark-ec2 config files.

2015-10-04 Thread Renato Perini
Can someone provide the relevant config files generated by Spark EC2 script? I'm configuring a Spark cluster on EC2 manually, and I would like to compare my config files (spark-defaults.conf, spark-env.sh) with those generated by the spark-ec2 script. Of course, hide your sensitive

Usage of transform for code reuse between Streaming and Batch job affects the performance ?

2015-10-04 Thread swetha
Hi, I have the following code for code reuse between the batch and the streaming job * val groupedAndSortedSessions = sessions.transform(rdd=>JobCommon.getGroupedAndSortedSessions(rdd))* The same code without code reuse between the batch and the streaming has the following. * val

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-10-04 Thread satish chandra j
Hi Petr, Could you please let me know if I am missing anything on the code as my code is as same as snippet shared by you but still i am getting the below error: *error type mismatch: found String required: Serializable* Please let me know if any fix to be applied on this Regards, Satish