Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Although user can use the hdfs glob syntax to support multiple inputs. But sometimes, it is not convenient to do that. Not sure why there's no api of SparkContext#textFiles. It should be easy to implement that. I'd love to create a ticket and contribute for that if there's no other consideration

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Nirmal Fernando
As of now, we are basically serializing the ML model and then deserialize it for prediction at real time. On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase wrote: > I don’t think this answers your question but here’s how you would evaluate > the model in realtime in a streaming

Re: Spark Streaming Checkpoint help failed application

2015-11-11 Thread Gideon
Hi, I'm no expert but Short answer: yes, after restarting your application will reread the failed messages Longer answer: it seems like you're mixing several things together Let me try and explain: - WAL is used to prevent your application from losing data by making the executor first write the

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Adrian Tanase
I don’t think this answers your question but here’s how you would evaluate the model in realtime in a streaming app https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html Maybe you can find a way to extract portions of MLLib and run them

Re: Start python script with SparkLauncher

2015-11-11 Thread Andrejs
Thanks Ted, that helped me, it turned out that I wrongly formated the name of the server, I had to add spark:// in front of server name. Cheers, Andrejs On 11/11/15 14:26, Ted Yu wrote: Please take a look at launcher/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java to see how

Re: How to configure logging...

2015-11-11 Thread Andy Davidson
Hi Hitoshi Looks like you have read http://spark.apache.org/docs/latest/configuration.html#configuring-logging On my ec2 cluster I need to also do the following. I think my notes are not complete. I think you may also need to restart your cluster Hope this helps Andy # # setting up logger so

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Shixiong Zhu
In addition, if you have more than two text files, you can just put them into a Seq and use "reduce(_ ++ _)". Best Regards, Shixiong Zhu 2015-11-11 10:21 GMT-08:00 Jakob Odersky : > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workaround,

Re: Spark on YARN using Java 1.8 fails

2015-11-11 Thread mvle
Unfortunately, no. I switched back to OpenJDK 1.7. Didn't get a chance to dig deeper. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-using-Java-1-8-fails-tp24925p25360.html Sent from the Apache Spark User List mailing list archive at

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky
Hey Jeff, Do you mean reading from multiple text files? In that case, as a workaround, you can use the RDD#union() (or ++) method to concatenate multiple rdds. For example: val lines1 = sc.textFile("file1") val lines2 = sc.textFile("file2") val rdd = lines1 union lines2 regards, --Jakob On 11

Re: Spark on YARN using Java 1.8 fails

2015-11-11 Thread Abel Rincón
Hi, There was another related question https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201506.mbox/%3CCAJ2peNeruM2Y2Tbf8-Wiras-weE586LM_o25FsN=+z1-bfw...@mail.gmail.com%3E Some months ago, if I remember well, using spark 1.3 + YARN + Java 8 we had the same probem.

Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread mvle
Hi, I've deployed a Secure YARN 2.7.1 cluster with HDFS encryption and am trying to run the pyspark shell using Spark 1.5.1 pyspark shell works and I can run a sample code to calculate PI just fine. However, when I try to stop the current context (e.g., sc.stop()) and then create a new context

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread PhuDuc Nguyen
Dean, Thanks for the reply. I'm searching (via spark mailing list archive and google) and can't find the previous thread you mentioned. I've stumbled upon a few but may not be the thread you're referring to. I'm very interested in reading that discussion and any links/keywords would be greatly

Re: NullPointerException with joda time

2015-11-11 Thread Ted Yu
In case you need to adjust log4j properties, see the following thread: http://search-hadoop.com/m/q3RTtJHkzb1t0J66=Re+Spark+Streaming+Log4j+Inside+Eclipse Cheers On Tue, Nov 10, 2015 at 1:28 PM, Ted Yu wrote: > I took a look at >

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Dean Wampler
Dynamic allocation doesn't work yet with Spark Streaming in any cluster scenario. There was a previous thread on this topic which discusses the issues that need to be resolved. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly)

Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-11 Thread Tom Graves
Is there anything other then the spark assembly that needs to be in the classpath?  I verified the assembly was built right and its in the classpath (else nothing would work). Thanks,Tom On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman wrote:

Re: Start python script with SparkLauncher

2015-11-11 Thread Ted Yu
Please take a look at launcher/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java to see how app.getInputStream() and app.getErrorStream() are handled. In master branch, the Suite is located at core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java FYI On Wed, Nov 11,

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread java8964
Any reason that Spark Cassandra connector won't work for you? Yong To: bryan.jeff...@gmail.com; user@spark.apache.org From: bryan.jeff...@gmail.com Subject: RE: Cassandra via SparkSQL/Hive JDBC Date: Tue, 10 Nov 2015 22:42:13 -0500 Anyone have thoughts or a similar use-case for SparkSQL /

Start python script with SparkLauncher

2015-11-11 Thread Andrejs
Hi all, I'm trying to call a python script from a scala application. Below is part of my code. My problem is that it doesn't work, but it also doesn't provide any error message, so I can't debug it. val spark =new SparkLauncher().setSparkHome("/home/user/spark-1.4.1-bin-hadoop2.6")

dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread PhuDuc Nguyen
I'm trying to get Spark Streaming to scale up/down its number of executors within Mesos based on workload. It's not scaling down. I'm using Spark 1.5.1 reading from Kafka using the direct (receiver-less) approach. Based on this ticket https://issues.apache.org/jira/browse/SPARK-6287 with the

Re: Status of 2.11 support?

2015-11-11 Thread Jakob Odersky
Hi Sukant, Regarding the first point: when building spark during my daily work, I always use Scala 2.11 and have only run into build problems once. Assuming a working build I have never had any issues with the resulting artifacts. More generally however, I would advise you to go with Scala 2.11

graphx - trianglecount of 2B edges

2015-11-11 Thread Vinod Mangipudi
I was attempting to use the graphx triangle count method on a 2B edge graph (Friendster dataset on SNAP) . I have access to a 60 node cluster with 90GB memory and 30v cores per node . I am running into memory issues I am using 1000 partitions using the RandomVertexCut. Here’s my submit

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
IIRC, TextInputFormat supports an input path that is a comma separated list. I haven't tried this, but I think you should just be able to do sc.textFile("file1,file2,...") On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: > I know these workaround, but wouldn't it be more

RE: hdfs-ha on mesos - odd bug

2015-11-11 Thread Buttler, David
I have verified that this error exists on my system as well, and the suggested workaround also works. Spark version: 1.5.1; 1.5.2 Mesos version: 0.21.1 CDH version: 4.7 I have set up the spark-env.sh to contain HADOOP_CONF_DIR pointing to the correct place, and I have also linked in the

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
Looks like what I was suggesting doesn't work. :/ On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote: > Yes, that's what I suggest. TextInputFormat support multiple inputs. So in > spark side, we just need to provide API to for that. > > On Thu, Nov 12, 2015 at 8:45 AM, Pradeep

Re: how to run unit test for specific component only

2015-11-11 Thread Ted Yu
Have you tried the following ? build/sbt "sql/test-only *" Cheers On Wed, Nov 11, 2015 at 7:13 PM, weoccc wrote: > Hi, > > I am wondering how to run unit test for specific spark component only. > > mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none > > The above

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
I know these workaround, but wouldn't it be more convenient and straightforward to use SparkContext#textFiles ? On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra wrote: > For more than a small number of files, you'd be better off using > SparkContext#union instead of

how to run unit test for specific component only

2015-11-11 Thread weoccc
Hi, I am wondering how to run unit test for specific spark component only. mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none The above command doesn't seem to work. I'm using spark 1.5. Thanks, Weide

Re: Status of 2.11 support?

2015-11-11 Thread Ted Yu
I started playing with Scala 2.12.0-M3 but the compilation didn't pass (as expected) Planning to get back to 2.12 once it is released. FYI On Wed, Nov 11, 2015 at 4:34 PM, Jakob Odersky wrote: > Hi Sukant, > > Regarding the first point: when building spark during my daily

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread Mohammed Guller
Short answer: yes. The Spark Cassandra Connector supports the data source API. So you can create a DataFrame that points directly to a Cassandra table. You can query it using the DataFrame API or the SQL/HiveQL interface. If you want to see an example, see slide# 27 and 28 in this deck that I

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread DB Tsai
Do you think it will be useful to separate those models and model loader/writer code into another spark-ml-common jar without any spark platform dependencies so users can load the models trained by Spark ML in their application and run the prediction? Sincerely, DB Tsai

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In

Re: Spark Packages Configuration Not Found

2015-11-11 Thread Jakob Odersky
As another, general question, are spark packages the go-to way of extending spark functionality? In my specific use-case I would like to start spark (be it spark-shell or other) and hook into the listener API. Since I wasn't able to find much documentation about spark packages, I was wondering if

Spark cluster with Java 8 using ./spark-ec2

2015-11-11 Thread Philipp Grulich
Hey, i just saw this post. http://qnalist.com/questions/5627042/spark-cluster-with-java-8-using-spark-ec2 and i have the same question. How can I use java 8 with the ./spark-ec2 script Dose anybody has a solution? Philipp

Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Michael V Le
Hi Ted, Thanks for reply. I tried your patch but am having the same problem. I ran: ./bin/pyspark --master yarn-client >> sc.stop() >> sc = SparkContext() Same error dump as below. Do I need to pass something to the new sparkcontext ? Thanks, Mike From: Ted Yu

Porting R code to SparkR

2015-11-11 Thread Sanjay Subramanian
Hi guys This is possibly going to sound like a vague, stupid question but I have a problem to solve and I need help. So any which way I go is only up :-)  I have a bunch of R scripts (I am not a R expert) and we are currently evaluating how to translate these R scripts to SparkR data frame

Re: Slow stage?

2015-11-11 Thread Jakob Odersky
Hi Simone, I'm afraid I don't have an answer to your question. However I noticed the DAG figures in the attachment. How did you generate these? I am myself working on a project in which I am trying to generate visual representations of the spark scheduler DAG. If such a tool already exists, I

Re: Spark Thrift doesn't start

2015-11-11 Thread Zhan Zhang
In the hive-site.xml, you can remove all configuration related to tez and give it a try again. Thanks. Zhan Zhang On Nov 10, 2015, at 10:47 PM, DaeHyun Ryu > wrote: Hi folks, I configured tez as execution engine of Hive. After done that, whenever I

Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Ted Yu
Looks like the delegation token should be renewed. Mind trying the following ? Thanks diff --git a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerB index 20771f6..e3c4a5a 100644

Re: Slow stage?

2015-11-11 Thread Mark Hamstra
Those are from the Application Web UI -- look for the "DAG Visualization" and "Event Timeline" elements on Job and Stage pages. On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky wrote: > Hi Simone, > I'm afraid I don't have an answer to your question. However I noticed the >

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Yes, that's what I suggest. TextInputFormat support multiple inputs. So in spark side, we just need to provide API to for that. On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota wrote: > IIRC, TextInputFormat supports an input path that is a comma separated > list. I

Re: Anybody hit this issue in spark shell?

2015-11-11 Thread Ted Yu
I searched code base and confirmed that there is no class from com.google.common.annotations being used. However, there're classes from com.google.common e.g. import com.google.common.io.{ByteStreams, Files} import com.google.common.net.InetAddresses FYI On Tue, Nov 10, 2015 at 11:22 AM,

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Saisai Shao
I think for receiver-less Streaming connectors like direct Kafka input stream or hdfs connector, dynamic allocation could be worked compared to other receiver-based streaming connectors, since for receiver-less connectors, the behavior of streaming app is more like a normal Spark app, so dynamic

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Saisai Shao
Yeah, agreed. Only for some extreme streaming workload we designed to fit the pattern of dynamic allocation that could be worked very well. In normal cases, no executor will remain idle for long time, so frequently scale up and ramp down of executors will bring large overhead and latency to

Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Ted Yu
I assume your config contains "spark.yarn.credentials.file" - otherwise startExecutorDelegationTokenRenewer(conf) call would be skipped. On Wed, Nov 11, 2015 at 12:16 PM, Michael V Le wrote: > Hi Ted, > > Thanks for reply. > > I tried your patch but am having the same problem.

Re: Slow stage?

2015-11-11 Thread Koert Kuipers
i am a person that usually hates UIs, and i have to say i love these. very useful On Wed, Nov 11, 2015 at 3:23 PM, Mark Hamstra wrote: > Those are from the Application Web UI -- look for the "DAG Visualization" > and "Event Timeline" elements on Job and Stage pages. > >

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Tathagata Das
The reason the existing dynamic allocation does not work out of the box for spark streaming is because the heuristics used for decided when to scale up/down is not the right one for micro-batch workloads. It works great for typical batch workloads. However you can use the underlying developer API

Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-11 Thread Shivaram Venkataraman
Nothing more -- The only two things I can think of are: (a) is there something else on the classpath that comes before this lgpl JAR ? I've seen cases where two versions of netlib-java on the classpath can mess things up. (b) There is something about the way SparkR is using reflection to invoke

Different classpath across stages?

2015-11-11 Thread John Meehan
I’ve been running into a strange class not found problem, but only when my job has more than one phase. I have an RDD[ProtobufClass] which behaves as expected in a single-stage job (e.g. serialize to JSON and export). But when I try to groupByKey, the first stage runs (essentially a keyBy),

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread PhuDuc Nguyen
Awesome, thanks for the tip! On Wed, Nov 11, 2015 at 2:25 PM, Tathagata Das wrote: > The reason the existing dynamic allocation does not work out of the box > for spark streaming is because the heuristics used for decided when to > scale up/down is not the right one for

Re: Spark Packages Configuration Not Found

2015-11-11 Thread Burak Yavuz
Hi Jakob, > As another, general question, are spark packages the go-to way of extending spark functionality? Definitely. There are ~150 Spark Packages out there in spark-packages.org. I use a lot of them in every day Spark work. The number of released packages have steadily increased rate over

Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Michael V Le
It looks like my config does not have "spark.yarn.credentials.file". I executed: sc._conf.getAll() [(u'spark.ssl.keyStore', u'xxx.keystore'), (u'spark.eventLog.enabled', u'true'), (u'spark.ssl.keyStorePassword', u'XXX'), (u'spark.yarn.principal', u'XXX'), (u'spark.master', u'yarn-client'),

How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-11 Thread Amir Rahnama
Hey, Anybody knows how can one sort the result in the stateful example? Python would be prefered. https://github.com/apache/spark/blob/859dff56eb0f8c63c86e7e900a12340c199e6247/examples/src/main/python/streaming/stateful_network_wordcount.py -- Thanks and Regards, Amir Hossein Rahnama *Tel:

Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Ted Yu
Please take a look at yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala where this config is described Cheers On Wed, Nov 11, 2015 at 1:45 PM, Michael V Le wrote: > It looks like my config does not have "spark.yarn.credentials.file". > > I

Upgrading Spark in EC2 clusters

2015-11-11 Thread Augustus Hong
Hey All, I have a Spark cluster(running version 1.5.0) on EC2 launched with the provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same cluster, what's the safest / recommended way to do that? I know I can spin up a new cluster running 1.5.2, but it doesn't seem efficient to

Status of 2.11 support?

2015-11-11 Thread shajra-cogscale
Hi, My company isn't using Spark in production yet, but we are using a bit of Scala. There's a few people who have wanted to be conservative and keep our Scala at 2.10 in the event we start using Spark. There are others who want to move to 2.11 with the idea that by the time we're using Spark

Re: Status of 2.11 support?

2015-11-11 Thread Ted Yu
For #1, the published jars are usable. However, you should build from source for your specific combination of profiles. Cheers On Wed, Nov 11, 2015 at 3:22 PM, shajra-cogscale wrote: > Hi, > > My company isn't using Spark in production yet, but we are using a bit of

Re: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-11 Thread ayan guha
how about this? sorted = running_counts.map(lambda t: t[1],t[0]).sortByKey() Basically swap key and value of the RDD and then sort? On Thu, Nov 12, 2015 at 8:53 AM, Amir Rahnama wrote: > Hey, > > Anybody knows how can one sort the result in the stateful example? > >