unsubscribe

2020-06-28 Thread stephen

Spark Training Scripts (from Strata 09/13) Ec2 Deployment scripts having errors

2014-03-08 Thread Stephen Boesch
The spark-training scripts are not presently working 100%: the errors displayed when starting the slaves are shown below. Possibly a newer location for the files exists (I pulled from https://github.com/amplab/training-scripts an it is nearly 6 months old) cp: cannot create regular file

Re: Spark Training Scripts (from Strata 09/13) Ec2 Deployment scripts having errors

2014-03-08 Thread Stephen Boesch
: link_stat /root/mesos-ec2 failed: No such file or directory (2) But in this latest version the mesos errors appear not to be fatal: the cluster is in the process of coming up (copying wikipedia data now..) . 2014-03-08 6:26 GMT-08:00 Stephen Boesch java...@gmail.com: The spark-training scripts

Optimal Server Design for Spark

2014-04-02 Thread Stephen Watt
Hi Folks I'm looking to buy some gear to run Spark. I'm quite well versed in Hadoop Server design but there does not seem to be much Spark related collateral around infrastructure guidelines (or at least I haven't been able to find them). My current thinking for server design is something

Invoke spark-shell without attempting to start the http server

2014-05-02 Thread Stephen Boesch
We have a spark server already running. When invoking spark-shell a new http server is attempted to be started spark.HttpServer: Starting HTTP Server But that attempts results in a BindException due to the preexisting server: java.net.BindException: Address already in use What is the

Re: How to use spark-submit

2014-05-11 Thread Stephen Boesch
of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote: I have a spark streaming application that uses the external streaming modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly invoke the spark-submit script: what

Re: How to use spark-submit

2014-05-12 Thread Stephen Boesch
@Sonal - makes sense. Is the maven shade plugin runnable within sbt ? If so would you care to share those build.sbt (or .scala) lines? If not, are you aware of a similar plugin for sbt? 2014-05-11 23:53 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com: Hi Stephen, I am using maven shade

Re: Equivalent of collect() on DStream

2014-05-15 Thread Stephen Boesch
It seems the concept I had been missing is to invoke the DStream foreach method. This method takes a function expecting an RDD and applies the function to each RDD within the DStream. 2014-05-14 21:33 GMT-07:00 Stephen Boesch java...@gmail.com: Looking further it appears the functionality I

Re: Equivalent of collect() on DStream

2014-05-15 Thread Stephen Boesch
) = Unit* ) extends DStream[Unit](parent.ssc) { I would like to have access to this structure - particularly the ability to define an foreachFunc that gets applied to each RDD within the DStream. Is there a means to do so? 2014-05-14 21:25 GMT-07:00 Stephen Boesch java...@gmail.com: Given

Re: Express VMs - good idea?

2014-05-16 Thread Stephen Boesch
Hi Marco, Hive itself is not working in the CDH5.0 VM (due to FNFE's on the third party jars). While you did not mention using Shark, you may keep that in mind. I will try out spark-only commands late today and report what I find. 2014-05-14 5:00 GMT-07:00 Marco Shaw marco.s...@gmail.com:

Re: How to Run Machine Learning Examples

2014-05-22 Thread Stephen Boesch
There is a bin/run-example.sh example-class [args] 2014-05-22 12:48 GMT-07:00 yxzhao yxz...@ualr.edu: I want to run the LR, SVM, and NaiveBayes algorithms implemented in the following directory on my data set. But I did not find the sample command line to run them. Anybody help? Thanks.

Sources for kafka-0.7.2-spark

2014-05-23 Thread Stephen Boesch
We are using a back version of spark (0.8.1) that depends on a customized version of kafka 0.7.2-spark. Where are the sources for it - either svn/github or simply the sources..jar For reference here is the maven repo location for the binaries:

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread Stephen Boesch
The MergeStrategy combined with sbt assembly did work for me. This is not painless: some trial and error and the assembly may take multiple minutes. You will likely want to filter out some additional classes from the generated jar file. Here is an SOF answer to explain that and with IMHO the

HBase 0.96+ with Spark 1.0+

2014-06-27 Thread Stephen Boesch
The present trunk is built and tested against HBase 0.94. I have tried various combinations of versions of HBase 0.96+ and Spark 1.0+ and all end up with 14/06/27 20:11:15 INFO HttpServer: Starting HTTP Server [error] (run-main-0) java.lang.SecurityException: class

Re: HBase 0.96+ with Spark 1.0+

2014-06-28 Thread Stephen Boesch
GMT-07:00 Sean Owen so...@cloudera.com: This sounds like an instance of roughly the same item as in https://issues.apache.org/jira/browse/SPARK-1949 Have a look at adding that exclude to see if it works. On Fri, Jun 27, 2014 at 10:21 PM, Stephen Boesch java...@gmail.com wrote: The present

Re: Potential bugs in SparkSQL

2014-07-10 Thread Stephen Boesch
Hi Jerry, To add to your question: Following does work (from master)- notice the registerAsTable is commented : (I took a liberty to add the order by clause) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(USE test) // hql(select id from

Re: Error while running example/scala application using spark-submit

2014-07-17 Thread Stephen Boesch
Hi Sean RE: Windows and hadoop 2.4.x HortonWorks - all the hype aside - only supports Windows Server 2008/2012. So this general concept of supporting Windows is bunk. Given that - and since the vast majority of Windows users do not happen to have Windows Server on their laptop - do you have any

Re: What does @developerApi means?

2014-07-20 Thread Stephen Boesch
The javaDoc seems reasonably helpful: /** * A lower-level, unstable API intended for developers. * * Developer API's might change or be removed in minor versions of Spark. * */ These would be contrasted with non-Developer (more or less production?) API's that are deemed to be stable within a

Implicit conversion RDD - SchemaRDD

2014-10-02 Thread Stephen Boesch
I am noticing disparities in behavior between the REPL and in my standalone program in terms of implicit conversion of an RDD to SchemaRDD. In the REPL the following sequence works: import sqlContext._ val mySchemaRDD = myNormalRDD.where(1=1) However when attempting similar in a standalone

Re: Implicit conversion RDD - SchemaRDD

2014-10-02 Thread Stephen Boesch
at SchemaRDD.scala:102 == Query Plan == == Physical Plan == Filter 1=1 ExistingRdd [col1#8,col2#9], MapPartitionsRDD[27] at mapPartitions at basicOperators.scala:219 So .. what is the magic formula for setting up the imports for the SchemaRDD imports to work properly? 2014-10-02 2:00 GMT-07:00 Stephen

Setup/Cleanup for RDD closures?

2014-10-02 Thread Stephen Boesch
Consider there is some connection / external resource allocation required to be accessed/mutated by each of the rows from within a single worker thread. That connection should only be opened/closed before the first row is accessed / after the last row is completed. It is my understanding that

Building pyspark with maven?

2014-10-08 Thread Stephen Boesch
The build instructions for pyspark appear to be: sbt/sbt assembly Given that maven is the preferred build tool since July 1, presumably I have overlooked the instructions for building via maven? Anyone please point it out? thanks

Re: Building pyspark with maven?

2014-10-08 Thread Stephen Boesch
/classes to the python module search path. 2014-10-08 14:01 GMT-07:00 Stephen Boesch java...@gmail.com: The build instructions for pyspark appear to be: sbt/sbt assembly Given that maven is the preferred build tool since July 1, presumably I have overlooked the instructions for building via

Re: distributing Scala Map datatypes to RDD

2014-10-13 Thread Stephen Boesch
is the following what you are looking for? scala sc.parallelize(myMap.map{ case (k,v) = (k,v) }.toSeq) res2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at console:21 2014-10-13 14:02 GMT-07:00 jon.g.massey jon.g.mas...@gmail.com: Hi guys, Just

NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-23 Thread Stephen Boesch
After having checked out from master/head the following error occurs when attempting to run any test in Intellij Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.init(Utils.scala:648) There appears to

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Stephen Boesch
Yes it is necessary to do a mvn clean when encountering this issue. Typically you would have changed one or more of the profiles/options - which leads to this occurring. 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I started building Spark / running Spark tests this

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Stephen Boesch
; if anyone can confirm whether they've seen it on Linux, that would be good to know. Stephen: good to know re: profiles/options. I don't think changing them is a necessary condition as I believe I've run into it without doing that, but any set of steps to reproduce this would be welcome so that we

How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Stephen Boesch
I seem to recall there were some specific requirements on how to import the implicits. Here is the issue: scala import org.apache.spark.mllib.rdd.RDDFunctions._ console:10: error: object RDDFunctions in package rdd cannot be accessed in package org.apache.spark.mllib.rdd import

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Stephen Boesch
be called by function in mllib. 2014-10-28 17:09 GMT+08:00 Stephen Boesch java...@gmail.com: I seem to recall there were some specific requirements on how to import the implicits. Here is the issue: scala import org.apache.spark.mllib.rdd.RDDFunctions._ console:10: error: object RDDFunctions

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-28 Thread Stephen Boesch
at 2:13 PM, Stephen Boesch java...@gmail.com wrote: After having checked out from master/head the following error occurs when attempting to run any test in Intellij Exception in thread main java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-28 Thread Stephen Boesch
I have checked out from master, cleaned/rebuilt on command line in maven, then cleaned/rebuilt in intellij many times. This error persists through it all. Anyone have a solution? 2014-10-23 1:43 GMT-07:00 Stephen Boesch java...@gmail.com: After having checked out from master/head

Returned type of Broadcast variable is byte array

2014-10-30 Thread Stephen Boesch
As a template for creating a broadcast variable, the following code snippet within mllib was used: val bcIdf = dataset.context.broadcast(idf) dataset.mapPartitions { iter = val thisIdf = bcIdf.value The new code follows that model: import org.apache.spark.mllib.linalg.{Vector =

Re: Returned type of Broadcast variable is byte array

2014-10-30 Thread Stephen Boesch
= sc.broadcast(crows) .. val arrayVect = bcRows.value 2014-10-30 7:42 GMT-07:00 Stephen Boesch java...@gmail.com: As a template for creating a broadcast variable, the following code snippet within mllib was used: val bcIdf = dataset.context.broadcast(idf) dataset.mapPartitions

Using Intellij for pyspark

2014-12-10 Thread Stephen Boesch
Anyone have luck with this? An issue encountered is handling multiple languages - python, java,scala within one module : it is unclear how to select two module SDK's. Both Python and Scala facets were added to the spark-parent module. But when the Project level SDK is not set to Python then the

GraphX for large scale PageRank (~4 billion nodes, ~128 billion edges)

2014-12-12 Thread Stephen Merity
) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) -- Regards, Stephen Merity Data Scientist @ Common Crawl

GraphX for large scale PageRank (~4 billion nodes, ~128 billion edges)

2014-12-12 Thread Stephen Merity
Hi! tldr; We're looking at potentially using Spark+GraphX to compute PageRank over a 4 billion node + 128 billion edge graph on a regular (monthly) basis, possibly growing larger in size over time. If anyone has hints / tips / upcoming optimizations I should test use (or wants to contribute --

sbt assembly with hive

2014-12-12 Thread Stephen Boesch
What is the proper way to build with hive from sbt? The SPARK_HIVE is deprecated. However after running the following: sbt -Pyarn -Phadoop-2.3 -Phive assembly/assembly And then bin/pyspark hivectx = HiveContext(sc) hivectx.hiveql(select * from my_table) Exception: (You must build

Re: Using the DataStax Cassandra Connector from PySpark

2014-12-26 Thread Stephen Boesch
Did you receive any response on this? I am trying to load hbase classes and getting the same error py4j.protocol.Py4JError: Trying to call a package. . Even though the $HBASE_HOME/lib/* had already been added to the compute-classpath.sh 2014-10-21 16:02 GMT-07:00 Mike Sukmanowsky

recent join/iterator fix

2014-12-28 Thread Stephen Haberman
/best practice for cogroup code? Thanks, Stephen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: recent join/iterator fix

2014-12-29 Thread Stephen Haberman
be safer in this regard, but I don't understand the nuances yet. - Stephen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: recent join/iterator fix

2014-12-29 Thread Stephen Haberman
Hi Shixiong, The Iterable from cogroup is CompactBuffer, which is already materialized. It's not a lazy Iterable. So now Spark cannot handle skewed data that some key has too many values that cannot be fit into the memory.​ Cool, thanks for the confirmation. - Stephen

Re: spark-shell working in scala-2.11 (breaking change?)

2015-01-30 Thread Stephen Haberman
like a breaking change to the spark.eventLog.dir config property. Perhaps it should be patched to convert the previously supported just a file path values to HDFS-compatible file://... URIs for backwards compatibility? - Stephen On Wed, 28 Jan 2015 12:27:17 -0800 Krishna Sankar ksanka

spark-shell working in scala-2.11

2015-01-28 Thread Stephen Haberman
? It is possible I did something dumb while compiling master, but I'm not sure what it would be. Thanks, Stephen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: groupByKey is not working

2015-01-30 Thread Stephen Boesch
Amit - IJ will not find it until you add the import as Sean mentioned. It includes implicits that intellij will not know about otherwise. 2015-01-30 12:44 GMT-08:00 Amit Behera amit.bd...@gmail.com: I am sorry Sean. I am developing code in intelliJ Idea. so with the above dependencies I am

Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Stephen Boesch
Theoretically your approach would require less overhead - i.e. a collect on the driver is not required as the last step. But maybe the difference is small and that particular path may or may not have been properly optimized vs the count(). Do you have a biggish data set to compare the timings?

Re: spark-shell working in scala-2.11 (breaking change?)

2015-01-31 Thread Stephen Haberman
Looking at https://github.com/apache/spark/pull/1222/files , the following change may have caused what Stephen described: + if (!fileSystem.isDirectory(new Path(logBaseDir))) { When there is no schema associated with logBaseDir, local path should be assumed. Yes, that looks right

Example of partitionBy in pyspark

2015-03-11 Thread Stephen Boesch
I am finding that partitionBy is hanging - and it is not clear whether the custom partitioner is even being invoked (i put an exception in there and can not see it in the worker logs). The structure is similar to the following: inputPairedRdd = sc.parallelize([{0:Entry1,1,Entry2}]) def

Re: Performance tuning in Spark SQL.

2015-03-02 Thread Stephen Boesch
You have sent four questions that are very general in nature. They might be better answered if you googled for those topics: there is a wealth of materials available. 2015-03-02 2:01 GMT-08:00 dubey_a abhishek.du...@xoriant.com: What are the ways to tune query performance in Spark SQL? --

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Stephen Boesch
into the directory structure boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false); where: public static final String INPUT_DIR_RECURSIVE = mapreduce.input.fileinputformat.input.dir.recursive; FYI On Tue, Mar 3, 2015 at 3:14 PM, Stephen Boesch java...@gmail.com wrote

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Stephen Boesch
The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass) TextInputFormat. Inside the logic does exist to do the recursive directory reading - i.e. first detecting if an entry were a directory and if so then descending: for (FileStatus

Re: Spark code development practice

2015-03-05 Thread Stephen Boesch
Hi Xi, Yes, You can do the following: val sc = new SparkContext(local[2], mptest) // or .. val sc = new SparkContext(spark://master:7070, mptest) val fileDataRdd = sc.textFile(/path/to/dir) val fileLines = fileDataRdd.take(100) The key here - i.e. the answer to your specific question -

SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

2015-03-23 Thread Stephen Boesch
Is there a way to take advantage of the underlying datasource partitions when generating a DataFrame/SchemaRDD via catalyst? It seems from the sql module that the only options are RangePartitioner and HashPartitioner - and further that those are selected automatically by the code . It was not

Spark on Mesos

2015-04-24 Thread Stephen Carman
So I can’t for the life of me to get something even simple working for Spark on Mesos. I installed a 3 master, 3 slave mesos cluster, which is all configured, but I can’t for the life of me even get the spark shell to work properly. I get errors like this org.apache.spark.SparkException: Job

Re: Spark on Mesos

2015-04-27 Thread Stephen Carman
So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just basically got the pre-built from the spark website… I placed those compiled spark installs on each slave at /opt/spark My spark properties seem to be getting picked up on my side fine…

Re: JAVA for SPARK certification

2015-05-05 Thread Stephen Boesch
There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm.

Re: Building Spark

2015-05-13 Thread Stephen Boesch
Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven requirements are much lower , around 1.7GB. So try using maven . For reference I have the following settings and both do compile. sbt would not work with lower values. $echo $SBT_OPTS -Xmx3012m -XX:MaxPermSize=512m

RE: swap tuple

2015-05-14 Thread Stephen Carman
Yea, I wouldn't try and modify the current since RDDs are suppose to be immutable, just create a new one... val newRdd = oldRdd.map(r = (r._2(), r._1())) or something of that nature... Steve From: Evo Eftimov [evo.efti...@isecc.com] Sent: Thursday, May 14, 2015

Re: Spark on Mesos

2015-05-13 Thread Stephen Carman
didn’t exist there. When ran as root, it ran totally fine with no problems what so ever. Hopefully this works for you too, Steve On May 13, 2015, at 11:45 AM, Sander van Dijk sgvand...@gmail.com wrote: Hey all, I seem to be experiencing the same thing as Stephen. I run Spark 1.2.1 with Mesos

Re: Spark on Mesos

2015-05-13 Thread Stephen Carman
Yup, exactly as Tim mentioned on it too. I went back and tried what you just suggested and that was also perfectly fine. Steve On May 13, 2015, at 1:58 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Stephen, You probably didn't run the Spark driver/shell as root, as Mesos

Re: Spark on Windows

2015-04-16 Thread Stephen Boesch
The hadoop support from HortonWorks only *actually *works with Windows Server - well at least as of Spark Summit last year : and AFAIK that has not changed since 2015-04-16 15:18 GMT-07:00 Dean Wampler deanwamp...@gmail.com: If you're running Hadoop, too, now that Hortonworks supports Spark,

Spark Memory Utilities

2015-04-03 Thread Stephen Carman
they can be made public inside the library or have some interface to them such that children classes can make use of them? Thanks, Stephen Carman, M.S. AI Engineer, Coldlight Solutions, LLC Cell - 267 240 0363 This e-mail is intended solely for the above-mentioned recipient and it may contain

Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Stephen Boesch
What conditions would cause the following delays / failure for a standalone machine/cluster to have the Worker contact the Master? 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at http://10.0.0.3:8081 15/05/20 02:02:53 INFO Worker: Connecting to master

Linear Regression with SGD

2015-06-09 Thread Stephen Carman
Hi User group, We are using spark Linear Regression with SGD as the optimization technique and we are achieving very sub-optimal results. Can anyone shed some light on why this implementation seems to produce such poor results vs our own implementation? We are using a very small dataset, but

Re: Velox Model Server

2015-06-21 Thread Stephen Boesch
Oryx 2 has a scala client https://github.com/OryxProject/oryx/blob/master/framework/oryx-api/src/main/scala/com/cloudera/oryx/api/ 2015-06-20 11:39 GMT-07:00 Debasish Das debasish.da...@gmail.com: After getting used to Scala, writing Java is too much work :-) I am looking for scala based

Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-01 Thread Stephen Boesch
I downloaded the 1.3.1 distro tarball $ll ../spark-1.3.1.tar.gz -rw-r-@ 1 steve staff 8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz However the build on it is failing with an unresolved dependency: *configuration not public* $ build/sbt assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
(the same btw applies for the Node where you run the driver app – all other nodes must be able to resolve its name) *From:* Stephen Boesch [mailto:java...@gmail.com] *Sent:* Wednesday, May 20, 2015 10:07 AM *To:* user *Subject:* Intermittent difficulties for Worker to contact Master on same

Where does partitioning and data loading happen?

2015-05-27 Thread Stephen Carman
A colleague and I were having a discussion and we were disagreeing about something in Spark/Mesos that perhaps someone can shed some light into. We have a mesos cluster that runs spark via a sparkHome, rather than downloading an executable and such. My colleague says that say we have parquet

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Stephen Boesch
TestRunner: power-iteration-clustering 8 512.0 MB 2015/05/27 12:44:03 steve FINISHED 6 s app-20150527123822- TestRunner: power-iteration-clustering 8 512.0 MB 2015/05/27 12:38:22 steve FINISHED 6 s 2015-05-27 11:42 GMT-07:00 Stephen Boesch java...@gmail.com: Thanks Yana, My current

Re: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Stephen Boesch
Vanilla map/reduce does not expose it: but hive on top of map/reduce has superior partitioning (and bucketing) support to Spark. 2015-06-28 13:44 GMT-07:00 Koert Kuipers ko...@tresata.com: spark is partitioner aware, so it can exploit a situation where 2 datasets are partitioned the same way

Re: Code error

2015-05-19 Thread Stephen Boesch
Hi Ricardo, providing the error output would help . But in any case you need to do a collect() on the rdd returned from computeCost. 2015-05-19 11:59 GMT-07:00 Ricardo Goncalves da Silva ricardog.si...@telefonica.com: Hi, Can anybody see what’s wrong in this piece of code:

Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-16 Thread Stephen Boesch
I am building spark with the following options - most notably the **scala-2.11**: . dev/switch-to-scala-2.11.sh mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests -Dmaven.javadoc.skip=true clean package The build goes pretty far but fails in one of the minor modules

Re: Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-17 Thread Stephen Boesch
. FYI On Sun, Aug 16, 2015 at 11:12 AM, Stephen Boesch java...@gmail.com wrote: I am building spark with the following options - most notably the **scala-2.11**: . dev/switch-to-scala-2.11.sh mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests -Dmaven.javadoc.skip

Spark-submit not finding main class and the error reflects different path to jar file than specified

2015-08-06 Thread Stephen Boesch
Given the following command line to spark-submit: bin/spark-submit --verbose --master local[2]--class org.yardstick.spark.SparkCoreRDDBenchmark /shared/ysgood/target/yardstick-spark-uber-0.0.1.jar Here is the output: NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes

Retrieving offsets from previous spark streaming checkpoint

2015-08-13 Thread Stephen Durfey
When deploying a spark streaming application I want to be able to retrieve the lastest kafka offsets that were processed by the pipeline, and create my kafka direct streams from those offsets. Because the checkpoint directory isn't guaranteed to be compatible between job deployments, I don't want

Re: Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-14 Thread Stephen Boesch
The NoClassDefFoundException differs from ClassNotFoundException : it indicates an error while initializing that class: but the class is found in the classpath. Please provide the full stack trace. 2015-08-14 4:59 GMT-07:00 stelsavva stel...@avocarrot.com: Hello, I am just starting out with

Which directory contains third party libraries for Spark

2015-07-27 Thread Stephen Boesch
when using spark-submit: which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries to each of the slaves instead of shipping the contents in the application uber-jar. Note: I did try adding to $SPARK_HOME/lib_managed/jars.

Re: spark benchmarking

2015-07-08 Thread Stephen Boesch
One option is the databricks/spark-perf project https://github.com/databricks/spark-perf 2015-07-08 11:23 GMT-07:00 MrAsanjar . afsan...@gmail.com: Hi all, What is the most common used tool/product to benchmark spark job?

Catalyst Errors when building spark from trunk

2015-07-07 Thread Stephen Boesch
The following errors are occurring upon building using mvn options clean package Are there some requirements/restrictions on profiles/settings for catalyst to build properly? [error] /shared/sparkup2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:138: value

Re: How to restrict java unit tests from the maven command line

2015-09-10 Thread Stephen Boesch
Yes, adding that flag does the trick. thanks. 2015-09-10 13:47 GMT-07:00 Sean Owen <so...@cloudera.com>: > -Dtest=none ? > > > https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests > > On Thu, Sep 10, 2015 at

How to restrict java unit tests from the maven command line

2015-09-10 Thread Stephen Boesch
I have invoked mvn test with the -DwildcardSuites option to specify a single BinarizerSuite scalatest suite. The command line is mvn -pl mllib -Pyarn -Phadoop-2.6 -Dhadoop2.7.1 -Dscala-2.11 -Dmaven.javadoc.skip=true -DwildcardSuites=org.apache.spark.ml.feature.BinarizerSuite test The scala

UnknownHostException with Mesos and custom Jar

2015-09-28 Thread Stephen Hankinson
n. 15/09/28 20:07:46 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/09/28 20:07:46 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. Any thoughts? Thanks Stephen

Re: Examples module not building in intellij

2015-10-04 Thread Stephen Boesch
ust the last thing that happens to > fail. > > On Sun, Oct 4, 2015 at 7:06 AM, Stephen Boesch <java...@gmail.com> wrote: > > > > For a week or two the trunk has not been building for the examples module > > within intellij. The other modules - including core, sql,

Examples module not building in intellij

2015-10-04 Thread Stephen Boesch
For a week or two the trunk has not been building for the examples module within intellij. The other modules - including core, sql, mllib, etc *are * working. A portion of the error message is "Unable to get dependency information: Unable to read the metadata file for artifact

Re: Breakpoints not hit with Scalatest + intelliJ

2015-09-18 Thread Stephen Boesch
Hi Michel, please try local[1] and report back if the breakpoint were hit. 2015-09-18 7:37 GMT-07:00 Michel Lemay : > Hi, > > I'm adding unit tests to some utility functions that are using > SparkContext but I'm unable to debug code and hit breakpoints when running > under

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Stephen Boesch
@Yu Fengdong: Your approach - specifically the groupBy results in a shuffle does it not? 2015-12-04 2:02 GMT-08:00 Fengdong Yu : > There are many ways, one simple is: > > such as: you want to know how many rows for each month: > > >

Re: Scala VS Java VS Python

2015-12-16 Thread Stephen Boesch
There are solid reasons to have built spark on the jvm vs python. The question for Daniel appear to be at this point scala vs java8. For that there are many comparisons already available: but in the case of working with spark there is the additional benefit for the scala side that the core

Re: Recommendations using Spark

2016-01-07 Thread Stephen Boesch
Alternating least squares takes an RDD of (user/product/ratings) tuples and the resulting Model provides predict(user, product) or predictProducts methods among others.

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch
The postgres jdbc driver needs to be added to the classpath of your spark workers. You can do a search for how to do that (multiple ways). 2015-12-22 17:22 GMT-08:00 b2k70 : > I see in the Spark SQL documentation that a temporary table can be created > directly onto a

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch
HI Benjamin, yes by adding to the thrift server then the create table would work. But querying is performed by the workers: so you need to add to the classpath of all nodes for reads to work. 2015-12-22 18:35 GMT-08:00 Benjamin Kim <bbuil...@gmail.com>: > Hi Stephen, > > I fo

Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
r.t. building locally, please specify -Pscala-2.11 > > Cheers > > On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch <java...@gmail.com> wrote: > >> HI Madabhattula >> Scala 2.11 requires building from source. Prebuilt binaries are >> available only for sca

Re: Spark 1.6 Build

2015-11-24 Thread Stephen Boesch
HI Madabhattula Scala 2.11 requires building from source. Prebuilt binaries are available only for scala 2.10 >From the src folder: dev/change-scala-version.sh 2.11 Then build as you would normally either from mvn or sbt The above info *is* included in the spark docs but a little hard

Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
The following works against a hive table from spark sql hc.sql("select id,r from (select id, name, rank() over (order by name) as r from tt2) v where v.r >= 1 and v.r <= 12") But when using a standard sql context against a temporary table the following occurs: Exception in thread "main"

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Checked out 1.6.0-SNAPSHOT 60 minutes ago 2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>: > Which version of spark are you using? > > > > *From:* Stephen Boesch [mailto:java...@gmail.com] > *Sent:* Thursday, 19 November 2015 2:12 PM > *To:* user > *Su

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
But to focus the attention properly: I had already tried out 1.5.2. 2015-11-18 19:46 GMT-08:00 Stephen Boesch <java...@gmail.com>: > Checked out 1.6.0-SNAPSHOT 60 minutes ago > > 2015-11-18 19:19 GMT-08:00 Jack Yang <j...@uow.edu.au>: > >> Which version of spark ar

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Why is the same query (and actually i tried several variations) working against a hivecontext and not against the sql context? 2015-11-18 19:57 GMT-08:00 Michael Armbrust <mich...@databricks.com>: > Yes they do. > > On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch <java..

Re: Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

2015-11-22 Thread Stephen Boesch
>> and then use the Hive's dynamic partitioned insert syntax What does this entail? Same sql but you need to do set hive.exec.dynamic.partition = true; in the hive/sql context (along with several other related dynamic partition settings.) Is there anything else/special

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Stephen Boesch
ooc are the tables partitioned on a.pk and b.fk? Hive might be using copartitioning in that case: it is one of hive's strengths. 2016-06-09 7:28 GMT-07:00 Gourav Sengupta : > Hi Mich, > > does not Hive use map-reduce? I thought it to be so. And since I am > running

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Stephen Boesch
How many workers (/cpu cores) are assigned to this job? 2016-06-09 13:01 GMT-07:00 SRK : > Hi, > > How to insert data into 2000 partitions(directories) of ORC/parquet at a > time using Spark SQL? It seems to be not performant when I try to insert > 2000 directories of

Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Presently only the mllib version has the one-vs-all approach for multinomial support. The ml version with ElasticNet support only allows binary regression. With feature parity of ml vs mllib having been stated as an objective for 2.0.0 - is there a projected availability of the multinomial

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Followup: just encountered the "OneVsRest" classifier in ml.classsification: I will look into using it with the binary LogisticRegression as the provided classifier. 2016-05-28 9:06 GMT-07:00 Stephen Boesch <java...@gmail.com>: > > Presently only the mllib version has t

  1   2   3   >