How is the predict() working in LogisticRegressionModel?

2015-11-13 Thread MEETHU MATHEW
Hi all,Can somebody point me to the implementation of predict() in LogisticRegressionModel of spark mllib? I could find a predictPoint() in the class LogisticRegressionModel, but where is predict()?  Thanks & Regards,  Meethu M

Re: how to run unit test for specific component only

2015-11-13 Thread Steve Loughran
try: mvn test -pl sql -DwildcardSuites=org.apache.spark.sql -Dtest=none On 12 Nov 2015, at 03:13, weoccc > wrote: Hi, I am wondering how to run unit test for specific spark component only. mvn test -DwildcardSuites="org.apache.spark.sql.*"

Re: large, dense matrix multiplication

2015-11-13 Thread Sabarish Sasidharan
We have done this by blocking but without using BlockMatrix. We used our own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What is the size of your block? How much memory are you giving to the executors? I assume you are running on YARN, if so you would want to make sure your

Re: HiveServer2 Thrift OOM

2015-11-13 Thread Steve Loughran
looks suspiciously like some thrift transport unmarshalling problem, THRIFT-2660 Spark 1.5 uses hive 1.2.1; it should have the relevant thrift JAR too. Otherwise, you could play with thrift JAR versions yourself —maybe it will work, maybe not... On 13 Nov 2015, at 00:29, Yana Kadiyska

Joining HDFS and JDBC data sources - benchmarks

2015-11-13 Thread Eran Medan
Hi I'm looking for some benchmarks on joining data frames where most of the data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is still in RDBMS. I am only looking at the very first join before any caching happens, and I assume there will be loss of parallelization because

a way to allow spark job to continue despite task failures?

2015-11-13 Thread Nicolae Marasoiu
Hi, I know a task can fail 2 times and only the 3rd breaks the entire job. I am good with this number of attempts. I would like that after trying a task 3 times, it continues with the other tasks. The job can be "failed", but I want all tasks run. Please see my use case. I read a hadoop

Please add us to the Powered by Spark page

2015-11-13 Thread Sujit Pal
Hello, We have been using Spark at Elsevier Labs for a while now. Would love to be added to the “Powered By Spark” page. Organization Name: Elsevier Labs URL: http://labs.elsevier.com Spark components: Spark Core, Spark SQL, MLLib, GraphX. Use Case: Building Machine Reading Pipeline, Knowledge

Re: spark 1.4 GC issue

2015-11-13 Thread Gaurav Kumar
Please have a look at http://spark.apache.org/docs/1.4.0/tuning.html You may also want to use the latest build of JDK 7/8 and use G1GC instead. I saw considerable reductions in GC time just by doing that. Rest of the tuning parameters are better explained in the link above. Best Regards, Gaurav

Re: Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-13 Thread Cody Koeninger
Unless you change maxRatePerPartition, a batch is going to contain all of the offsets from the last known processed to the highest available. Offsets are not time-based, and Kafka's time-based api currently has very poor granularity (it's based on filesystem timestamp of the log segment). There's

Re: large, dense matrix multiplication

2015-11-13 Thread Eilidh Troup
Hi Sab, Thanks for your response. We’re thinking of trying a bigger cluster, because we just started with 2 nodes. What we really want to know is whether the code will scale up with larger matrices and more nodes. I’d be interested to hear how large a matrix multiplication you managed to do?

hang correlated to number of shards Re: Checkpointing with Kinesis hangs with socket timeouts when driver is relaunched while transforming on a 0 event batch

2015-11-13 Thread Hster Geguri
Just an update that the kinesis checkpointing works well with orderly and kill -9 driver shutdowns when there is less than 4 shards. We use 20+. I created a case with Amazon support since it is the AWS kinesis getRecords API which is hanging. Regards, Heji On Thu, Nov 12, 2015 at 10:37 AM,

SequenceFile and object reuse

2015-11-13 Thread jeff saremi
So we tried reading a sequencefile in Spark and realized that all our records have ended up becoming the same. THen one of us found this: Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an

Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Have you use any partitioned columns when write as json or parquet? On Fri, Nov 6, 2015 at 6:53 AM, Rok Roskar wrote: > yes I was expecting that too because of all the metadata generation and > compression. But I have not seen performance this bad for other parquet > files

Columnar Statisics

2015-11-13 Thread sara mustafa
Hi, I am using Spark 1.5.2 and I notice the existence of the class org.apache.spark.sql.columnar.ColumnStatisticsSchema, How can I use it to calculate column statistics of a DataFrame? Thanks, -- View this message in context:

Re: What is difference btw reduce & fold?

2015-11-13 Thread firemonk9
Thes is very well explained. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-difference-btw-reduce-fold-tp22653p25376.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

SparkException: Could not read until the end sequence number of the range

2015-11-13 Thread Alan Dipert
Hi all, We're running Spark 1.5.0 on EMR 4.1.0 in AWS and consuming from Kinesis. We saw the following exception today - it killed the Spark "step": org.apache.spark.SparkException: Could not read until the end sequence number of the range We guessed it was because our Kinesis stream didn't

pyspark sql: number of partitions and partition by size?

2015-11-13 Thread Wei Chen
Hey Friends, I am trying to use sqlContext.write.parquet() to write dataframe to parquet files. I have the following questions. 1. number of partitions The default number of partition seems to be 200. Is there any way other than using df.repartition(n) to change this number? I was told

does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Andy Davidson
In R, its easy to split a data set into training, crossValidation, and test set. Is there something like this in spark.ml? I am using python of now. My real problem is I want to randomly select a relatively small data set to do some initial data exploration. Its not clear to me how using spark I

Join and HashPartitioner question

2015-11-13 Thread Alexander Pivovarov
Hi Everyone Is there any difference in performance btw the following two joins? val r1: RDD[(String, String]) = ??? val r2: RDD[(String, String]) = ??? val partNum = 80 val partitioner = new HashPartitioner(partNum) // Join 1 val res1 =

Re: Distributing Python code packaged as tar balls

2015-11-13 Thread Davies Liu
Python does not support library as tar balls, so PySpark may also not support that. On Wed, Nov 4, 2015 at 5:40 AM, Praveen Chundi wrote: > Hi, > > Pyspark/spark-submit offers a --py-files handle to distribute python code > for execution. Currently(version 1.5) only zip

Re: bin/pyspark SparkContext is missing?

2015-11-13 Thread Davies Liu
You forgot to create a SparkContext instance: sc = SparkContext() On Tue, Nov 3, 2015 at 9:59 AM, Andy Davidson wrote: > I am having a heck of a time getting Ipython notebooks to work on my 1.5.1 > AWS cluster I created using

Re: a way to allow spark job to continue despite task failures?

2015-11-13 Thread Ted Yu
I searched the code base and looked at: https://spark.apache.org/docs/latest/running-on-yarn.html I didn't find mapred.max.map.failures.percent or its counterpart. FYI On Fri, Nov 13, 2015 at 9:05 AM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > I know a task can fail 2

Re: very slow parquet file write

2015-11-13 Thread Rok Roskar
I'm not sure what you mean? I didn't do anything specifically to partition the columns On Nov 14, 2015 00:38, "Davies Liu" wrote: > Do you have partitioned columns? > > On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote: > > I'm writing a ~100 Gb

send transformed RDD to s3 from slaves

2015-11-13 Thread Walrus theCat
Hi, I have an RDD which crashes the driver when being collected. I want to send the data on its partitions out to S3 without bringing it back to the driver. I try calling rdd.foreachPartition, but the data that gets sent has not gone through the chain of transformations that I need. It's the

Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Do you have partitioned columns? On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote: > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a > parquet file on HDFS. I've got a few hundred nodes in the cluster, so for > the size of file this is way

out of memory error with Parquet

2015-11-13 Thread AlexG
I'm using Spark to read in a data from many files and write it back out in Parquet format for ease of use later on. Currently, I'm using this code: val fnamesRDD = sc.parallelize(fnames, ceil(fnames.length.toFloat/numfilesperpartition).toInt) val results =

Spak filestreaming issue

2015-11-13 Thread ravi.gawai
Hi, I am trying simple file streaming example using Sparkstreaming(spark-streaming_2.10,version:1.5.1) public class DStreamExample { public static void main(final String[] args) { final SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("SparkJob");

Re: Save GraphX to disk

2015-11-13 Thread SLiZn Liu
Hi Gaurav, Your graph can be saved to graph databases like Neo4j or Titan through their drivers, that eventually saved to the disk. BR, Todd Gaurav Kumar gauravkuma...@gmail.com>于2015年11月13日 周五22:08写道: > Hi, > > I was wondering how to save a graph to disk and load it back again. I know > how

Re: does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Sonal Goyal
The RDD has a takeSample method where you can supply the flag for replacement or not as well as the fraction to sample. On Nov 14, 2015 2:51 AM, "Andy Davidson" wrote: > In R, its easy to split a data set into training, crossValidation, and > test set. Is there

Re: large, dense matrix multiplication

2015-11-13 Thread Burak Yavuz
Hi, The BlockMatrix multiplication should be much more efficient on the current master (and will be available with Spark 1.6). Could you please give that a try if you have the chance? Thanks, Burak On Fri, Nov 13, 2015 at 10:11 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote:

RE: Connecting SparkR through Yarn

2015-11-13 Thread Sun, Rui
I guess this is not related to SparkR. It seems that Spark can’t pick hostname/IP address of RM. Make sure you have correctly set YARN_CONF_DIR env var and have configured address of RM in yarn-site.xml. From: Amit Behera [mailto:amit.bd...@gmail.com] Sent: Friday, November 13, 2015 9:38 PM

Re: out of memory error with Parquet

2015-11-13 Thread AlexG
Never mind; when I switched to Spark 1.5.0, my code works as written and is pretty fast! Looking at some Parquet related Spark jiras, it seems that Parquet is known to have some memory issues with buffering and writing, and that at least some were resolved in Spark 1.5.0. -- View this

Re: out of memory error with Parquet

2015-11-13 Thread Josh Rosen
Tip: jump straight to 1.5.2; it has some key bug fixes. Sent from my phone > On Nov 13, 2015, at 10:02 PM, AlexG wrote: > > Never mind; when I switched to Spark 1.5.0, my code works as written and is > pretty fast! Looking at some Parquet related Spark jiras, it seems that

Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-13 Thread kundan kumar
Hi, I am using spark streaming check-pointing mechanism and reading the data from kafka. The window duration for my application is 2 hrs with a sliding interval of 15 minutes. So, my batches run at following intervals... 09:45 10:00 10:15 10:30 and so on Suppose, my running batch dies at 09:55

Spark and Spring Integrations

2015-11-13 Thread Netai Biswas
Hi, I am facing issue while integrating spark with spring. I am getting "java.lang.IllegalStateException: Cannot deserialize BeanFactory with id" errors for all beans. I have tried few solutions available in web. Please help me out to solve this issue. Few details: Java : 8 Spark : 1.5.1

Spark Executors off-heap memory usage keeps increasing

2015-11-13 Thread Balthasar Schopman
Hi, The off-heap memory usage of the 3 Spark executor processes keeps increasing constantly until the boundaries of the physical RAM are hit. This happened two weeks ago, at which point the system comes to a grinding halt, because it's unable to spawn new processes. At such a moment restarting

Traing data sets storage requirement

2015-11-13 Thread Veluru, Aruna
Hi All, Just started understanding / getting hands on with Spark, Streaming and MLLIb. We are in the design phase and need suggestions on the training data storage requirement. Batch Layer: Our core systems generate data which we will be using as batch data, currently SQL

spark 1.4 GC issue

2015-11-13 Thread Renu Yadav
am using spark 1.4 and my application is taking much time in GC around 60-70% of time for each task I am using parallel GC. please help somebody as soon as possible. Thanks, Renu

Re: Stack Overflow Question

2015-11-13 Thread Sabarish Sasidharan
The reserved cores are to prevent starvation so that user B cam run jobs when user A's job is already running and using almost all of the cluster. You can change your scheduler configuration to use more cores. Regards Sab On 13-Nov-2015 6:56 pm, "Parin Choganwala" wrote: >

Save GraphX to disk

2015-11-13 Thread Gaurav Kumar
Hi, I was wondering how to save a graph to disk and load it back again. I know how to save vertices and edges to disk and construct the graph from them, not sure if there's any method to save the graph itself to disk. Best Regards, Gaurav Kumar Big Data • Data Science • Photography • Music +91

Stack Overflow Question

2015-11-13 Thread Parin Choganwala
EMR 4.1.0 + Spark 1.5.0 + YARN Resource Allocation http://stackoverflow.com/q/33488869/1366507?sem=2