Re: "Too many open files" exception on reduceByKey

2015-10-11 Thread Tian Zhang
It turns out the mesos can overwrite the OS ulimit -n setting. So we have increased the mesos slave ulimit -n setting. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html Sent from the Apache Spark

Why Spark Stream job stops producing outputs after a while?

2015-10-11 Thread Uthayan Suthakar
Hello all, I have a Spark Streaming job that run and produce results successfully. However, after a few days the job stop producing any output. I can see the job is still running ( polling data from Flume, completing jobs and it's subtasks) however, it is failing to produce any output. I have to

Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Subhajit Purkayastha
Can I join 3 different RDDs together in a Spark SQL DF? I can find examples for 2 RDDs but not 3. Thanks

Re: Spark cluster - use machine name in WorkerID, not IP address

2015-10-11 Thread Akhil Das
Did you try setting the SPARK_LOCAL_IP in the conf/spark-env.sh file on each node? Thanks Best Regards On Fri, Oct 2, 2015 at 4:18 AM, markluk wrote: > I'm running a standalone Spark cluster of 1 master and 2 slaves. > > My slaves file under /conf list the fully qualified

Re: Compute Real-time Visualizations using spark streaming

2015-10-11 Thread Akhil Das
Simplest approach would be to push the streaming data (after the computations) to a SQL-Like DB and then let your visualization piece pull it from the DB. Another approach would be to make your visualization piece a web-socket (If you are using D3JS etc) and then from your streaming application

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Richard Eggert
It's the same as joining 2. Join two together, and then join the third one to the result of that. On Oct 11, 2015 2:57 PM, "Subhajit Purkayastha" wrote: > Can I join 3 different RDDs together in a Spark SQL DF? I can find > examples for 2 RDDs but not 3. > > > > Thanks > > >

Handling expirying state in UDF

2015-10-11 Thread brightsparc
Hi, I have created a python UDF to make an API which requires an expirying OAuth token which requires refreshing every 600 seconds which is longer than any given stage. Due to the nature of threads and local state, if I use a global variable, the variable goes out of scope regularly. I look

RE: Hive with apache spark

2015-10-11 Thread Cheng, Hao
One option is you can read the data via JDBC, however, probably it's the worst option, as you probably need some hacky work to enable the parallel reading in Spark SQL. Another option is copy the hive-site.xml of your Hive Server to $SPARK_HOME/conf, then Spark SQL will see everything that Hive

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
A join B join C === (A join B) join C Semantically they are equivalent, right? From: Richard Eggert [mailto:richard.egg...@gmail.com] Sent: Monday, October 12, 2015 5:12 AM To: Subhajit Purkayastha Cc: User Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? It's the same as joining 2.

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Ted Yu
Some weekend reading: http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative Cheers On Sun, Oct 11, 2015 at 5:32 PM, Cheng, Hao wrote: > A join B join C === (A join B) join C > > Semantically they are equivalent, right? > > > > *From:* Richard Eggert

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Spark SQL supports very basic join reordering optimization, based on the raw table data size, this was added couple major releases back. And the “EXPLAIN EXTENDED query” command is a very informative tool to verify whether the optimization taking effect. From: Raajay

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
Thank you Ted, that’s very informative; from the DB optimization point of view, the Cost Base join re-ordering, and the multi-way joins does provide better performance; But from the API design point of view, 2 arguments (relation) for JOIN in the DF API probably be enough for the multiple

Re: Join Order Optimization

2015-10-11 Thread VJ Anand
Hi - Is there a design document for those operations that have been implemented in 1.4.0? if so,where can I find them -VJ On Sun, Oct 11, 2015 at 7:27 PM, Cheng, Hao wrote: > Yes, I think the SPARK-2211 should be the right place to follow the CBO > stuff, but probably that

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Probably you have to read the source code, I am not sure if there are any .ppt or slides. Hao From: VJ Anand [mailto:vjan...@sankia.com] Sent: Monday, October 12, 2015 11:43 AM To: Cheng, Hao Cc: Raajay; user@spark.apache.org Subject: Re: Join Order Optimization Hi - Is there a design document

Re: Join Order Optimization

2015-10-11 Thread Raajay
Hi Cheng, Could you point me to the JIRA that introduced this change ? Also, is this SPARK-2211 the right issue to follow for cost-based optimization? Thanks Raajay On Sun, Oct 11, 2015 at 7:57 PM, Cheng, Hao wrote: > Spark SQL supports very basic join reordering

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Yes, I think the SPARK-2211 should be the right place to follow the CBO stuff, but probably that will not happen right away. The jira issue introduce the statistic info can be found at: https://issues.apache.org/jira/browse/SPARK-2393 Hao From: Raajay [mailto:raaja...@gmail.com] Sent: Monday,

yarn-cluster mode throwing NullPointerException

2015-10-11 Thread Rachana Srivastava
I am trying to submit a job using yarn-cluster mode using spark-submit command. My code works fine when I use yarn-client mode. Cloudera Version: CDH-5.4.7-1.cdh5.4.7.p0.3 Command Submitted: spark-submit --class "com.markmonitor.antifraud.ce.KafkaURLStreaming" \ --driver-java-options

RE: Best practices to call small spark jobs as part of REST api

2015-10-11 Thread Nuthan Kumar
If the data is also on-demand, spark as back end is also good option.. Sent from Outlook Mail for Windows 10 phone From: Akhil Das Sent: Sunday, October 11, 2015 1:32 AM To: unk1102 Cc: user@spark.apache.org Subject: Re: Best practices to call small spark jobs as part of REST api One