Re: Spark 2.0 with Hadoop 3.0?

2016-10-28 Thread Sean Owen
I don't think it works, but, there is no Hadoop 3.0 right now either. As the version implies, it's going to be somewhat different API-wise. On Thu, Oct 27, 2016 at 11:04 PM adam kramer wrote: > Is the version of Spark built for Hadoop 2.7 and later only for 2.x > releases? > >

Re: Executor shutdown hook and initialization

2016-10-27 Thread Sean Owen
Init is easy -- initialize them in your singleton. Shutdown is harder; a shutdown hook is probably the only reliable way to go. Global state is not ideal in Spark. Consider initializing things like connections per partition, and open/close them with the lifecycle of a computation on a partition

Re: HiveContext is Serialized?

2016-10-26 Thread Sean Owen
erty which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 26 October 2016 at 09:06, Sean Owen <so...@cloudera.com> wro

Re: HiveContext is Serialized?

2016-10-26 Thread Sean Owen
is email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote: > > Sean, thank you for mak

Re: HiveContext is Serialized?

2016-10-25 Thread Sean Owen
This usage is fine, because you are only using the HiveContext locally on the driver. It's applied in a function that's used on a Scala collection. You can't use the HiveContext or SparkContext in a distribution operation. It has nothing to do with for loops. The fact that they're serializable

Re: Spark 1.2

2016-10-25 Thread Sean Owen
archive.apache.org will always have all the releases: http://archive.apache.org/dist/spark/ On Tue, Oct 25, 2016 at 1:17 PM ayan guha wrote: > Just in case, anyone knows how I can download Spark 1.2? It is not showing > up in Spark download page drop down > > > -- > Best

Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Sean Owen
In the context of Spark, there are already things like RandomRDD and SQL randn() to generate random standard normal variables. If you want to do it directly, Commons Math is a good choice in the JVM, among others. Once you have a standard normal, just multiply by the stdev and add the mean to

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen
I believe it will be too late to set it there, and these are JVM flags, not app or Spark flags. See spark.driver.extraJavaOptions and likewise for the executor. On Mon, Oct 24, 2016 at 4:04 PM Pietro Pugni wrote: > Thank you! > > I tried again setting locale options in

Re: reading info from spark 2.0 application UI

2016-10-24 Thread Sean Owen
ach type of machine. Maybe I could go > beyond the limitation in the cluster. I just want to make sure I understand > correctly that when allocating vcores, it means vcores not the threads. > > Thanks a lot. > > Best > > > > On Mon, Oct 24, 2016 at 4:55 PM, Sean Owen <

Re: reading info from spark 2.0 application UI

2016-10-24 Thread Sean Owen
If you're really sure that 4 executors are on 1 machine, then it means your resource manager allowed it. What are you using, YARN? check that you really are limited to 40 cores per machine in the YARN config. On Mon, Oct 24, 2016 at 3:33 PM TheGeorge1918 . wrote: > Hi

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen
k into this too > within coming few days.. > > 2016-10-24 21:32 GMT+09:00 Sean Owen <so...@cloudera.com>: > > I actually think this is a general problem with usage of DateFormat and > SimpleDateFormat across the code, in that it relies on the default locale > of the

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen
I actually think this is a general problem with usage of DateFormat and SimpleDateFormat across the code, in that it relies on the default locale of the JVM. I believe this needs to, at least, default consistently to Locale.US so that behavior is consistent; otherwise it's possible that parsing

Re: Spark Streaming 2 Kafka 0.10 Integration for Aggregating Data

2016-10-18 Thread Sean Owen
Try adding the spark-streaming_2.11 artifact as a dependency too. You will be directly depending on it. On Tue, Oct 18, 2016 at 2:16 PM Furkan KAMACI wrote: > Hi, > > I have a search application and want to monitor queries per second for it. > I have Kafka at my backend

Re: Couchbase-Spark 2.0.0

2016-10-17 Thread Sean Owen
You're now asking about couchbase code, so this isn't the best place to ask. Head to couchbase forums. On Mon, Oct 17, 2016 at 10:14 AM Devi P.V wrote: > Hi, > I tried with the following code > > import com.couchbase.spark._ > val conf = new SparkConf() >

Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Sean Owen
You can take the "with user-provided Hadoop" binary from the download page, and yes that should mean it does not drag in a Hive dependency of its own. On Mon, Oct 17, 2016 at 7:08 AM Xi Shen wrote: > Hi, > > I want to configure my Hive to use Spark 2 as its engine.

Re: Possible memory leak after closing spark context in v2.0.1

2016-10-17 Thread Sean Owen
Did you unpersist the broadcast objects? On Mon, Oct 17, 2016 at 10:02 AM lev wrote: > Hello, > > I'm in the process of migrating my application to spark 2.0.1, > And I think there is some memory leaks related to Broadcast joins. > > the application has many unit tests, > and

Re: Resizing Image with Scrimage in Spark

2016-10-17 Thread Sean Owen
It pretty much means what it says. Objects you send across machines must be serializable, and the object from the library is not. You can write a wrapper object that is serializable and knows how to serialize it. Or ask the library dev to consider making this object serializable. On Mon, Oct 17,

Re: 回复:Spark-submit Problems

2016-10-16 Thread Sean Owen
Is it just a typo in the email or are you missing a space after your --master argument? The logs here actually don't say much but "something went wrong". It seems fairly low-level, like the gateway process failed or didn't start, rather than a problem with the program. It's hard to say more

Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Sean Owen
You can specify it; it just doesn't do anything but cause a warning in Java 8. It won't work in general to have such a tiny PermGen. If it's working it means you're on Java 8 because it's ignored. You should set MaxPermSize if anything, not PermSize. However the error indicates you are not using

Re: OOM when running Spark SQL by PySpark on Java 8

2016-10-13 Thread Sean Owen
The error doesn't say you're out of memory, but says you're out of PermGen. If you see this, you aren't running Java 8 AFAIK, because 8 has no PermGen. But if you're running Java 7, and you go investigate what this error means, you'll find you need to increase PermGen. This is mentioned in the

Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Sean Owen
I don't believe that's been released yet. It looks like it was merged into branches about a week ago. You're looking at unreleased docs too - have a look at http://spark.apache.org/docs/latest/ for the latest released docs. On Thu, Oct 13, 2016 at 9:24 AM JayKay

Re: Linear Regression Error

2016-10-12 Thread Sean Owen
See https://issues.apache.org/jira/browse/SPARK-17588 On Wed, Oct 12, 2016 at 9:07 PM Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > If I drop the last feature on the third model, the error seems to go away. > > On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath < >

Re: mllib model in production web API

2016-10-11 Thread Sean Owen
I don't believe it will ever scale to spin up a whole distributed job to serve one request. You can look possibly at the bits in mllib-local. You might do well to export as something like PMML either with Spark's export or JPMML and then load it into a web container and score it, without Spark

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-10-01 Thread Sean Owen
"Compile failed via zinc server" Try shutting down zinc. Something's funny about your compile server. It's not required anyway. On Sat, Oct 1, 2016 at 3:24 PM, Marco Mistroni wrote: > Hi guys > sorry to annoy you on this but i am getting nowhere. So far i have tried to >

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-29 Thread Sean Owen
No, I think that's what dependencyManagent (or equivalent) is definitely for. On Thu, Sep 29, 2016 at 5:37 AM, Olivier Girardot wrote: > I know that the code itself would not be the same, but it would be useful to > at least have the pom/build.sbt transitive

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Sean Owen
with the > ones defined in the pom profile > > > > On Thu, Sep 22, 2016 11:17 AM, Sean Owen so...@cloudera.com wrote: > >> There can be just one published version of the Spark artifacts and they >> have to depend on something, though in truth they'd be binary-compatible >

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Sean Owen
I don't recall any code in Spark that computes a matrix inverse. There is code that solves linear systems Ax = b with a decomposition. For example from looking at the code recently, I think the regression implementation actually solves AtAx = Atb using a Cholesky decomposition. But, A = n x k,

Re: MLib Documentation Update Needed

2016-09-26 Thread Sean Owen
Yes I think that footnote could be a lot more prominent, or pulled up right under the table. I also think it would be fine to present the {0,1} formulation. It's actually more recognizable, I think, for log-loss in that form. It's probably less recognizable for hinge loss, but, consistency is

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Sean Owen
I don't think I'd enable swap on a cluster. You'd rather processes fail than grind everything to a halt. You'd buy more memory or optimize memory before trading it for I/O. On Thu, Sep 22, 2016 at 6:29 PM, Michael Segel wrote: > Ok… gotcha… wasn’t sure that YARN just

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Sean Owen
wrote: > Thanks for the response Sean. > > But how does YARN know about the off-heap memory usage? > That’s the piece that I’m missing. > > Thx again, > > -Mike > >> On Sep 21, 2016, at 10:09 PM, Sean Owen <so...@cloudera.com> wrote: >> >> No, Xmx o

Re: Open source Spark based projects

2016-09-22 Thread Sean Owen
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects and maybe related ... https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark On Thu, Sep 22, 2016 at 11:15 AM, tahirhn wrote: > I am planning to write a thesis on certain aspects (i.e testing,

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-22 Thread Sean Owen
There can be just one published version of the Spark artifacts and they have to depend on something, though in truth they'd be binary-compatible with anything 2.2+. So you merely manage the dependency versions up to the desired version in your . On Thu, Sep 22, 2016 at 7:05 AM, Olivier Girardot <

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Sean Owen
No, Xmx only controls the maximum size of on-heap allocated memory. The JVM doesn't manage/limit off-heap (how could it? it doesn't know when it can be released). The answer is that YARN will kill the process because it's using more memory than it asked for. A JVM is always going to use a little

Re: Israel Spark Meetup

2016-09-21 Thread Sean Owen
Done. On Wed, Sep 21, 2016 at 5:53 AM, Romi Kuntsman wrote: > Hello, > Please add a link in Spark Community page > (https://spark.apache.org/community.html) > To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/) > We're an active meetup group, unifying the

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen
>> Thanks Sean. >> >> On Sep 20, 2016 7:45 AM, "Sean Owen" <so...@cloudera.com> wrote: >>> >>> Ah, I think that this was supposed to be changed with SPARK-9062. Let >>> me see about reopening 10835 and addressing it. >>&

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen
Ah, I think that this was supposed to be changed with SPARK-9062. Let me see about reopening 10835 and addressing it. On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty wrote: > Is this a bug? > > On Sep 19, 2016 10:10 PM, "janardhan shetty" wrote:

Re: Java Compatibity Problems when we install rJava

2016-09-19 Thread Sean Owen
This isn't a Spark question, so I don't think this is the right place. It shows that compilation of rJava failed for lack of some other shared libraries (not Java-related). I think you'd have to get those packages installed locally too. If it ends up being Anaconda specific, you should try

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Sean Owen
It backed the "OFF_HEAP" storage level for RDDs. That's not quite the same thing that off-heap Tungsten allocation refers to. It's also worth pointing out that things like HDFS also can put data into memory already. On Mon, Sep 19, 2016 at 7:48 PM, Richard Catlin

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Sean Owen
Yes, relevance is always 1. The label is not a relevance score so don't think it's valid to use it as such. On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim wrote: > Hi, > > I'm trying to evaluate a recommendation model, and found that Spark and > Rival give different results,

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Sean Owen
Alluxio isn't a database though; it's storage. I may be still harping on the wrong solution for you, but as we discussed offline, that's also what Impala, Drill et al are for. Sorry if this was mentioned before but Ignite is what GridGain became, if that helps. On Sat, Sep 17, 2016 at 11:00 PM,

Re: NoSuchField Error : INSTANCE specify user defined httpclient jar

2016-09-18 Thread Sean Owen
NoSuchFieldError in an HTTP client class? This almost always means you have a conflicting versions of an unshaded dependency on your classpath, and in this case could be httpclient. You can often work around this with the userClassPathFirst options for driver and executor. On Sun, Sep 18, 2016 at

Re: How PolynomialExpansion works

2016-09-16 Thread Sean Owen
The result includes, essentially, all the terms in (x+y) and (x+y)^2, and so on up if you chose a higher power. It is not just the second-degree terms. On Fri, Sep 16, 2016 at 7:43 PM, Nirav Patel wrote: > Doc says: > > Take a 2-variable feature vector as an example: (x,

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen
code given with spark > to run ALS on movie lens dataset. I did not change anything in the code. > However I am running this example on Netflix dataset (1.5 gb) > > Thanks, > Roshani > > > On Friday, September 16, 2016, Sean Owen <so...@cloudera.com> wrote: >> >

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen
You may have to decrease the checkpoint interval to say 5 if you're getting StackOverflowError. You may have a particularly deep lineage being created during iterations. No space left on device means you don't have enough local disk to accommodate the big shuffles in some stage. You can add more

Re: countApprox

2016-09-16 Thread Sean Owen
countApprox gives the best answer within some timeout. Is it possible that 1ms is more than enough to count this exactly? then the confidence wouldn't matter. Although that seems way too fast, you're counting ranges whose values don't actually matter, and maybe the Python side is smart enough to

Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Sean Owen
Why Hive and why precompute data at 15 minute latency? there are several ways here to query the source data directly with no extra step or latency here. Even Spark SQL is real-time-ish for queries on the source data, and Impala (or heck Drill etc) are. On Thu, Sep 15, 2016 at 10:56 PM, Mich

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Sean Owen
If your core requirement is ad-hoc real-time queries over the data, then the standard Hadoop-centric answer would be: Ingest via Kafka, maybe using Flume, or possibly Spark Streaming, to read and land the data, in... Parquet on HDFS or possibly Kudu, and Impala to query >> On 15 September 2016

Re: Please assist: migrating RandomForestExample from MLLib to ML

2016-09-14 Thread Sean Owen
If it helps, I've already updated that code for the 2nd edition, which will be based on ~Spark 2.1: https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/scala/com/cloudera/datascience/rdf/RunRDF.scala#L220 This should be an equivalent working example that deals with categoricals via

Re: RMSE in ALS

2016-09-14 Thread Sean Owen
, Pasquinell Urbani <pasquinell.urb...@exalitica.com> wrote: > The implicit rankings are the output of Tf-idf. I.e.: > Each_ranking= frecuency of an ítem * log(amount of total customers/amount of > customers buying the ítem) > > > El 14 sept. 2016 17:14, "Sean Owen" <so.

Re: RMSE in ALS

2016-09-14 Thread Sean Owen
enerated by TF-IDF), can this affect > the error? (I'm currently using trainImplicit in ALS, spark 1.6.2) > > Thank you. > > > > 2016-09-14 16:49 GMT-03:00 Sean Owen <so...@cloudera.com>: > >> There is no way to answer this without knowing what your inpu

Re: RMSE in ALS

2016-09-14 Thread Sean Owen
There is no way to answer this without knowing what your inputs are like. If they're on the scale of thousands, that's small (good). If they're on the scale of 1-5, that's extremely poor. What's RMS vs RMSE? On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Sean Owen
is defined as > abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1) > One cannot do > abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) ) > > > Any further tips? > > Best, > Rex > > > > On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen <so...@cloudera.com> wrote: >>

Re: Character encoding corruption in Spark JDBC connector

2016-09-13 Thread Sean Owen
Based on your description, this isn't a problem in Spark. It means your JDBC connector isn't interpreting bytes from the database according to the encoding in which they were written. It could be Latin1, sure. But if "new String(ResultSet.getBytes())" works, it's only because your platform's

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Sean Owen
e/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and a

Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Sean Owen
The key is really to specify the distance metric that defines "closeness" for you. You have features that aren't on the same scale, and some that aren't continuous. You might look to clustering for ideas here, though mostly you just want to normalize the scale of dimensions to make them

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Sean Owen
But you're in the shell there, which already has a SparkContext for you as sc. On Tue, Sep 13, 2016 at 6:49 PM, Kevin Burton wrote: > I'm rather confused here as to what to do about creating a new > SparkContext. > > Spark 2.0 prevents it... (exception included below) > >

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-07 Thread Sean Owen
be > TRUE right? > > On Tue, Sep 6, 2016 at 1:38 PM Sean Owen <so...@cloudera.com> wrote: >> >> Are you not fitting an intercept / regressing through the origin? with >> that constraint it's no longer true that R^2 is necessarily >> nonnegative. It basically me

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Sean Owen
Are you not fitting an intercept / regressing through the origin? with that constraint it's no longer true that R^2 is necessarily nonnegative. It basically means that the errors are even bigger than what you'd get by predicting the data's mean value as a constant model. On Tue, Sep 6, 2016 at

Re: Why there is no top method in dataset api

2016-09-05 Thread Sean Owen
ze of dataset is (unsurprisingly) big. > > To be honest I do not really understand what do you mean by b). Since > DataFrame is now only an alias for Dataset[Row] what do you mean by > "DataFrame-like counterpart"? > > Thanks > > On Thu, Sep 1, 2016 at 2:31 PM, Sea

Re: How to detect when a JavaSparkContext gets stopped

2016-09-05 Thread Sean Owen
You can look into the SparkListener interface to get some of those messages. Losing the master though is pretty fatal to all apps. On Mon, Sep 5, 2016 at 7:30 AM, Hough, Stephen C wrote: > I have a long running application, configured to be HA, whereby only the >

Re: BinaryClassificationMetrics - get raw tp/fp/tn/fn stats per threshold?

2016-09-02 Thread Sean Owen
Given recall by threshold, you can compute true positive count per threshold by just multiplying through by the count of elements where label = 1. From that you can get false negatives by subtracting from that same count. Given precision by threshold, and true positives count by threshold, you

Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Sean Owen
Spark should work fine with Python 3. I'm not a Python person, but all else equal I'd use 3.5 too. I assume the issue could be libraries you want that don't support Python 3. I don't think that changes with CDH. It includes a version of Anaconda from Continuum, but that lays down Python 2.7.11. I

Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Sean Owen
On Thu, Sep 1, 2016 at 4:56 PM, Mich Talebzadeh wrote: > Data Frame built on top of RDD to create as tabular format that we all love > to make the original build easily usable (say SQL like queries, column > headings etc). The drawback is it restricts you with what you

Re: Difference between Data set and Data Frame in Spark 2

2016-09-01 Thread Sean Owen
Here's my paraphrase: Datasets are really the new RDDs. They have a similar nature (container of strongly-typed objects) but bring some optimizations via Encoders for common types. DataFrames are different from RDDs and Datasets and do not replace and are not replaced by them. They're

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Sean Owen
Yeah there's a method to predict one Vector in the .mllib API but not the newer one. You could possibly hack your way into calling it anyway, or just clone the logic. On Thu, Sep 1, 2016 at 2:37 PM, Nick Pentreath wrote: > Right now you are correct that Spark ML APIs do

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Sean Owen
use it on a > row by row basis? > > Thanks for your inputs. > > On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote: >> >> If you're trying to score a single example by way of an RDD or >> Dataset, then no it will never be that fast. It's a whole dis

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Sean Owen
If you're trying to score a single example by way of an RDD or Dataset, then no it will never be that fast. It's a whole distributed operation, and while you might manage low latency for one job at a time, consider what will happen when hundreds of them are running at once. It's just huge overkill

Re: Why there is no top method in dataset api

2016-09-01 Thread Sean Owen
You can always call .rdd.top(n) of course. Although it's slightly clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an easier way. I don't think if there's a strong reason other than it wasn't worth it to write this and many other utility wrappers that a) already exist on the

Re: Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Sean Owen
I can't think of a situation where it would be materially different. Both are using the JVM-based APIs directly. Here and there there's a tiny bit of overhead in using the Java APIs because something is translated from a Java-style object to a Scala-style object, but this is generally trivial. On

Re: difference between package and jar Option in Spark

2016-09-01 Thread Sean Owen
--jars includes a local JAR file in the application's classpath. --package references Maven coordinates of a dependency and retrieves and includes all of those JAR files, and includes them in the app classpath. On Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot wrote: > Hi, >

Re: Model abstract class in spark ml

2016-08-31 Thread Sean Owen
Weird, I recompiled Spark with a similar change to Model and it seemed to work but maybe I missed a step in there. On Wed, Aug 31, 2016 at 6:33 AM, Mohit Jaggi wrote: > I think I figured it out. There is indeed "something deeper in Scala” :-) > > abstract class A { > def

Re: Model abstract class in spark ml

2016-08-30 Thread Sean Owen
I think it's imitating, for example, how Enum is delcared in Java: abstract class Enum> this is done so that Enum can refer to the actual type of the derived enum class when declaring things like public final int compareTo(E o) to implement Comparable. The type is redundant in a sense, because

Re: Coding in the Spark ml "ecosystem" why is everything private?!

2016-08-29 Thread Sean Owen
If something isn't public, then it could change across even maintenance releases. Although you can indeed still access it in some cases by writing code in the same package, you're taking some risk that it will stop working across releases. If it's not public, the message is that you should build

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sean Owen
oint. > > Also which option would be better, store the output of RDD to a persistent > storage, or store the new RDD of that ouput itself using checkpoint. > > Thanks > Sachin > > > > > On Mon, Aug 29, 2016 at 1:39 PM, Sean Owen <so...@cloudera.com>

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sean Owen
You just save the data in the RDD in whatever form you want to whatever persistent storage you want, and then re-read it from another job. This could be Parquet format on HDFS for example. Parquet is just a common file format. There is no need to keep the job running just to keep an RDD alive. On

Re: Spark StringType could hold how many characters ?

2016-08-28 Thread Sean Owen
No, it is just being truncated for display as the ... implies. Pass truncate=false to the show command. On Sun, Aug 28, 2016, 15:24 Kevin Tran wrote: > Hi, > I wrote to parquet file as following: > > ++ > |word| > ++ >

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Sean Owen
Without a distributed storage system, your application can only create data on the driver and send it out to the workers, and collect data back from the workers. You can't read or write data in a distributed way. There are use cases for this, but pretty limited (unless you're running on 1

Re: Sqoop vs spark jdbc

2016-08-25 Thread Sean Owen
Sqoop is probably the more mature tool for the job. It also just does one thing. The argument for doing it in Spark would be wanting to integrate it with a larger workflow. I imagine Sqoop would be more efficient and flexible for just the task of ingest, including continuously pulling deltas which

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread Sean Owen
We're probably mixing up some semantics here. An RDD is indeed, really, just some bookkeeping that records how a certain result is computed. It is not the data itself. However we often talk about "persisting an RDD" which means "persisting the result of computing the RDD" in which case that

Re: Dataframe corrupted when sqlContext.read.json on a Gzipped file that contains more than one file

2016-08-21 Thread Sean Owen
You are attempting to read a tar file. That won't work. A compressed JSON file would. On Sun, Aug 21, 2016, 12:52 Chua Jie Sheng wrote: > Hi Spark user list! > > I have been encountering corrupted records when reading Gzipped files that > contains more than one file. > >

Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread Sean Owen
Yes, have a look through JIRA in cases like this. https://issues.apache.org/jira/browse/SPARK-16664 On Sat, Aug 20, 2016 at 1:57 AM, mhornbech wrote: > I did some extra digging. Running the query "select column1 from myTable" I > can reproduce the problem on a frame with a

Re: 2.0.1/2.1.x release dates

2016-08-18 Thread Sean Owen
Historically, minor releases happen every ~4 months, and maintenance releases are a bit ad hoc but come about a month after the minor release. It's up to the release manager to decide to do them but maybe realistic to expect 2.0.1 in early September. On Thu, Aug 18, 2016 at 10:35 AM, Adrian

Re: DataFrame use case

2016-08-16 Thread Sean Owen
I'd say that Datasets, not DataFrames, are the natural evolution of RDDs. DataFrames are for inherently tabular data, and most naturally manipulated by SQL-like operations. Datasets operate on programming language objects like RDDs. So, RDDs to DataFrames isn't quite apples-to-apples to begin

Re: Number of tasks on executors become negative after executor failures

2016-08-15 Thread Sean Owen
-dev (this is appropriate for user@) Probably https://issues.apache.org/jira/browse/SPARK-10141 or https://issues.apache.org/jira/browse/SPARK-11334 but those aren't resolved. Feel free to jump in. On Mon, Aug 15, 2016 at 8:13 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote:

Re: spark ml : auc on extreme distributed data

2016-08-15 Thread Sean Owen
Class imbalance can be an issue for algorithms, but decision forests should in general cope reasonably well with imbalanced classes. By default, positive and negative classes are treated 'equally' however, and that may not reflect reality in some cases. Upsampling the under-represented case is a

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen
11, 2016 at 11:02 AM, Sean Owen <so...@cloudera.com> wrote: > No, that doesn't describe the change being discussed, since you've > copied the discussion about adding an 'offset'. That's orthogonal. > You're also suggesting making withMean=True the default, which we > don

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen
seFeatures) > > Thanks, > Tobi > > > On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: >> >> Ah right, got it. As you say for storage it helps significantly, but for >> operations I suspect it puts one back in a "dense

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
an optimization. On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com> wrote: > Sean by 'offset' do you mean basically subtracting the mean but only from > the non-zero elements in each row? > On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote: &g

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
> standardization, as opposed to people thinking they are standardizing when > they actually are not. > > Can anyone confirm whether there is a jira already? > > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote: >> >> Dense vs sparse is

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
Dense vs sparse is just a question of representation, so doesn't make an operation on a vector more or less important as a result. You've identified the reason that subtracting the mean can be undesirable: a notionally billion-element sparse vector becomes too big to fit in memory at once. I know

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-10 Thread Sean Owen
t I am confused based on the results above > and I am wondering what factors should be removed to get a meaningful result > (may be with 5% less accuracy) > > Will appreciate any help here. > > -Rohit > > On Tue, Aug 9, 2016 at 12:55 PM, Sean Owen <so...@cloudera.com>

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Sean Owen
Nightlies are built and made available in the ASF snapshot repo, from master. This is noted at the bottom of the downloads page, and at https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds . This hasn't changed in as long as I can recall.

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-09 Thread Sean Owen
Chaddha <rohitchaddha1...@gmail.com> wrote: > I would rather have less features to make better inferences on the data > based on the smaller number of factors, > Any suggestions Sean ? > > On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <so...@cloudera.com> wrote: > &

Re: FW: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Sean Owen
I also don't know what's going on with the "This post has NOT been accepted by the mailing list yet" message, because actually the messages always do post. In fact this has been sent to the list 4 times: https://www.mail-archive.com/search?l=user%40spark.apache.org=dueckm=0=0 On Mon, Aug 8, 2016

Re: Source format for Apache Spark logo

2016-08-08 Thread Sean Owen
In case the attachments don't come through, BTW those are indeed downloadable from the directory http://spark.apache.org/images/ On Mon, Aug 8, 2016 at 6:09 PM, Sivakumaran S wrote: > Found these from the spark.apache.org website. > > HTH, > > Sivakumaran S > > > > > > On

Re: Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-04 Thread Sean Owen
That message is a warning, not error. It is just because you're cross compiling with Java 8. If something failed it was elsewhere. On Thu, Aug 4, 2016, 07:09 Richard Siebeling wrote: > Hi, > > spark 2.0 with mapr hadoop libraries was succesfully build using the > following

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Sean Owen
You're looking for http://bahir.apache.org/ On Wed, Aug 3, 2016 at 8:40 PM, Kiran Chitturi wrote: > Hi, > > When Spark 2.0.0 is released, the 'spark-streaming-twitter' package and > several other packages are not released/published to maven central. It looks > like

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen
3:0.0" > I thought I need to follow the same numbering while creating vector too. > > thanks a bunch > > > On Thu, Aug 4, 2016 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote: >> >> You mean "new int[] {0,1,2}" because vectors are 0-indexed. &

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen
model with a 3 dimension vector ? > I am not sre what is wrong in this approach. i am missing a point ? > > Tony > > On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen <so...@cloudera.com> wrote: >> >> You declare that the vector has 3 dimensions, but then refer to its >&

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Sean Owen
file: "absolute directory" does not sound like a valid URI On Wed, Aug 3, 2016 at 11:05 AM, Flavio wrote: > Hello everyone, > > I am try to run a very easy example but unfortunately I am stuck on the > follow exception: > > Exception in thread "main"

<    2   3   4   5   6   7   8   9   10   11   >