Setting the vote rate in a Random Forest in MLlib

2015-12-16 Thread Young, Matthew T
One of our data scientists is interested in using Spark to improve performance in some random forest binary classifications, but isn't getting good enough results from MLlib's implementation of the random forest compared to R's randomforest library with the available parameters. She suggested

RE: is repartition very cost

2015-12-08 Thread Young, Matthew T
Shuffling large amounts of data over the network is expensive, yes. The cost is lower if you are just using a single node where no networking needs to be involved to do the repartition (using Spark as a multithreading engine). In general you need to do performance testing to see if a

RE: capture video with spark streaming

2015-11-30 Thread Young, Matthew T
Unless it’s a network camera with the ability to request specific frame numbers for read, the answer is that you will just read from the camera like you normally would without Spark inside of a foreachrdd() and parallelize the result out for processing once you have it in a collection in the

RE: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-12 Thread Young, Matthew T
You can use foreachRDD to get access to the batch API in streaming jobs. From: Amir Rahnama [mailto:amirrahn...@gmail.com] Sent: Thursday, November 12, 2015 12:11 AM To: ayan guha

RE: Very slow performance on very small record counts

2015-11-03 Thread Young, Matthew T
stream API. From: Cody Koeninger [mailto:c...@koeninger.org] Sent: Saturday, October 31, 2015 2:00 PM To: YOUNG, MATTHEW, T (Intel Corp) <matthew.t.yo...@intel.com> Subject: Re: Very slow performance on very small record counts Have you looked at jstack or the thread dump from the spark ui

RE: Pulling data from a secured SQL database

2015-10-30 Thread Young, Matthew T
> Can the driver pull data and then distribute execution? Yes, as long as your dataset will fit in the driver's memory. Execute arbitrary code to read the data on the driver as you normally would if you were writing a single-node application. Once you have the data in a collection on the

RE: save DF to JDBC

2015-10-05 Thread Young, Matthew T
I’ve gotten it to work with SQL Server (with limitations; it’s buggy and doesn’t work with some types/operations). https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html is the Java API you are looking for; the JDBC method lets you write to JDBC databases. I

Reasonable performance numbers?

2015-09-24 Thread Young, Matthew T
Hello, I am doing performance testing with Spark Streaming. I want to know if the throughput numbers I am encountering are reasonable for the power of my cluster and Spark's performance characteristics. My job has the following processing steps: 1. Read 600 Byte JSON strings from a 7

Getting number of physical machines in Spark

2015-08-27 Thread Young, Matthew T
What's the canonical way to find out the number of physical machines in a cluster at runtime in Spark? I believe SparkContext.defaultParallelism will give me the number of cores, but I'm interested in the number of NICs. I'm writing a Spark streaming application to ingest from Kafka with the

RE: How to read a Json file with a specific format?

2015-07-29 Thread Young, Matthew T
...@gmail.com] Sent: Wednesday, July 29, 2015 8:10 AM To: Young, Matthew T Cc: user@spark.apache.org Subject: Re: How to read a Json file with a specific format? Can you give an example with my extract? Mélanie Gallois 2015-07-29 16:55 GMT+02:00 Young, Matthew T matthew.t.yo

RE: IP2Location within spark jobs

2015-07-29 Thread Young, Matthew T
You can put the database files in a central location accessible to all the workers and build the GeoIP object once per-partition when you go to do a mapPartitions across your dataset, loading from the central location. ___ From: Filli Alem [alem.fi...@ti8m.ch] Sent: Wednesday, July 29, 2015

RE: How to read a Json file with a specific format?

2015-07-29 Thread Young, Matthew T
The built-in Spark JSON functionality cannot read normal JSON arrays. The format it expects is a bunch of individual JSON objects without any outer array syntax, with one complete JSON object per line of the input file. AFAIK your options are to read the JSON in the driver and parallelize it

RE: Issue with column named count in a DataFrame

2015-07-23 Thread Young, Matthew T
:26 PM To: Young, Matthew T Cc: user@spark.apache.org Subject: Re: Issue with column named count in a DataFrame Additionally have you tried enclosing count in `backticks`? On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust mich...@databricks.commailto:mich...@databricks.com wrote: I believe

Issue with column named count in a DataFrame

2015-07-22 Thread Young, Matthew T
I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior that looks like a bug. When I try to filter rows out of an RDD with a column name of count I get a large error message. I would just avoid naming things count, except

RE: Would driver shutdown cause app dead?

2015-07-21 Thread Young, Matthew T
ZhuGe, If you run your program in the cluster deploy-mode you get resiliency against driver failure, though there are some steps you have to take in how you write your streaming job to allow for transparent resume. Netflix did a nice writeup of this resiliency

RE: Spark and SQL Server

2015-07-20 Thread Young, Matthew T
Liu [dav...@databricks.com] Sent: Monday, July 20, 2015 9:08 AM To: Young, Matthew T Cc: user@spark.apache.org Subject: Re: Spark and SQL Server Sorry for the confusing. What's the other issues? On Mon, Jul 20, 2015 at 8:26 AM, Young, Matthew T matthew.t.yo...@intel.com wrote: Thanks Davies

RE: Spark and SQL Server

2015-07-20 Thread Young, Matthew T
...@databricks.com] Sent: Saturday, July 18, 2015 12:45 AM To: Young, Matthew T Cc: user@spark.apache.org Subject: Re: Spark and SQL Server I think you have a mistake on call jdbc(), it should be: jdbc(self, url, table, mode, properties) You had use properties as the third parameter. On Fri, Jul 17, 2015

Spark and SQL Server

2015-07-17 Thread Young, Matthew T
Hello, I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2 JDBC Driver. Reading from the database works ok, but I have encountered a couple of issues writing back. In Scala 2.10 I can write back to the database except for a couple of types. 1. When I read a