Incremental model update

2016-09-27 Thread Debasish Ghosh
Hello - I have a question on how to handle incremental model updation in Spark .. We have a time series where we predict the future conditioned on the past. We can train a model offline based on historical data and then use that model during prediction. But say, if the underlying process is

Problems with new experimental Kafka Consumer for 0.10

2016-09-27 Thread Matthias Niehoff
Hi everybody, i am using the new Kafka Receiver for Spark Streaming for my Job. When running with old consumer it runs fine. The Job consumes 3 Topics, saves the data to Cassandra, cogroups the topic, calls mapWithState and stores the results in cassandra. After that I manually commit the Kafka

pyspark cluster mode on standalone deployment

2016-09-27 Thread Ofer Eliassaf
Is there any plan to support python spark running in "cluster mode" on a standalone deployment? There is this famous survey mentioning that more than 50% of the users are using the standalone configuration. Making pyspark work in cluster mode with standalone will help a lot for high availabilty

Parquet compression jars not found - both snappy and lzo - PySpark 2.0.0

2016-09-27 Thread Russell Jurney
In PySpark 2.0.0, despite adding snappy and lzo to my spark.jars path, I get errors that say these classes can't be found when I save to a parquet file. I tried switching from default snappy to lzo and added that jar and I get the same error. What am I to do? I can't figure out any other steps

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Adrian Bridgett
We use the spark-csv (a successor of which is built in to spark 2.0) for this. It doesn't cause crashes, failed parsing is logged. We run on Mesos so I have to pull back all the logs from all the executors and search for failed lines (so that we can ensure that the failure rate isn't too

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Chanh Le
The different between Stream vs Micro Batch is about Ordering of Messages > Spark Streaming guarantees ordered processing of RDDs in one DStream. Since > each RDD is processed in parallel, there is not order guaranteed within the > RDD. This is a tradeoff design Spark made. If you want to

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh
Thanks guys Actually these are the 7 rogue rows. The column 0 is the Volume column which means there was no trades on those days *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0 SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0 SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0 SAP SE,SAP,

CGroups and Spark

2016-09-27 Thread Harut
Hi. I'm running spark on YaRN without CGroups turned on, and have 2 questions: 1. Does anyone of spark/yarn guarantee that my spark tasks won't eat up more CPU cores than I've assigned? (I assume there are no guarantees, correct?) 2. What is the effect of setting --executor-cores when submitting

Trying to fetch S3 data

2016-09-27 Thread Hitesh Goyal
Hi team, I want to fetch data from Amazon S3 bucket. For this, I am trying to access it using scala. I have tried the basic wordcount application in scala. Now I want to retrieve s3 data using it. I have gone through the tutorials and I found solutions for uploading files to S3. Please tell me

Help required in validating an architecture using Structured Streaming

2016-09-27 Thread Aravindh
Hi, We are building an internal analytics application. Kind of an event store. We have all the basic analytics use cases like filtering, aggregation, segmentation etc. So far our architecture used ElasticSearch extensively but that is not scaling anymore. One unique requirement we have is an event

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mike Metzger
Hi Mich - Can you run a filter command on df1 prior to your map for any rows where p(3).toString != '-' then run your map command? Thanks Mike On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh wrote: > Thanks guys > > Actually these are the 7 rogue rows. The

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon
Hi Mich, I guess you could use nullValue option by setting it to null. If you are reading them into strings at the first please, then, you would meet https://github.com/apache/spark/pull/14118 first which is resolved from 2.0.1 Unfortunately, this bug also exists in external csv library for

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread ayan guha
You can read as string, write a map to fix rows and then convert back to your desired Dataframe. On 28 Sep 2016 06:49, "Mich Talebzadeh" wrote: > > I have historical prices for various stocks. > > Each csv file has 10 years trade one row per each day. > > These are the

Re: unsubscribe

2016-09-27 Thread Daniel Lopes
To unsubscribe e-mail: user-unsubscr...@spark.apache.org *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | http://www.daniellopes.com.br www.onematch.com.br On Mon, Sep 26, 2016 at 12:24

Question about executor memory setting

2016-09-27 Thread Dogtail L
Hi all, May I ask a question about executor memory setting? I was running PageRank with input size 2.8GB on one workstation for testing. I gave PageRank one executor. In case 1, I set --executor-cores to 4, and --executor-memory to 1GB, the stage (stage 2) completion time is 14 min, the the

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Anastasios Zouzias
Hi there, As Edward noted, if you ask a numerical analyst about matrix inversion, they will respond "you never invert a matrix, but you solve the linear system associated with the matrix". Linear system solving is usually done with iterative methods or matrix decompositions (as noted above). The

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali
I understand the difference between fraud detection and fraud prevention in general but I am not interested in the semantic war on what these terms precisely mean. I am more interested in understanding the difference between mini-batch vs real time streaming from CS perspective. On Tue, Sep

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Alonso Isidoro Roman
mini batch or near real time: processing frames within 500 ms or more real time: processing frames in 5 ms-10ms. The main difference is processing velocity, i think. Apache Spark Streaming is mini batch, not true real time. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: Problems with new experimental Kafka Consumer for 0.10

2016-09-27 Thread Cody Koeninger
What's the actual stacktrace / exception you're getting related to commit failure? On Tue, Sep 27, 2016 at 9:37 AM, Matthias Niehoff wrote: > Hi everybody, > > i am using the new Kafka Receiver for Spark Streaming for my Job. When > running with old consumer it

Re: read multiple files

2016-09-27 Thread Peter Figliozzi
If you're up for a fancy but excellent solution: - Store your data in Cassandra. - Use the expiring data feature (TTL) so data will automatically be removed a month later. - Now in your Spark process, just read

Re: Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-27 Thread Reth RM
Hi Ayan, Thank you for the response. I tried to connect to same "stand alone spark master" through spark-shell and it works as intended On shell, tried ./spark-shell --master spark://host:7077 Connection was established and It wrote an info on console as 'Spark context available as 'sc' (master

Re: Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-27 Thread Ding Fei
Have you checked version of spark library referenced in intelliJ project and compare it with the binary distribution version? On Tue, 2016-09-27 at 09:13 -0700, Reth RM wrote: > Hi Ayan, > > > Thank you for the response. I tried to connect to same "stand alone > spark master" through

log4j custom properties for spark project

2016-09-27 Thread KhajaAsmath Mohammed
Hello Everyone, I am using below log4j properies and it is dipalying the logs well but still I want to remove the spark logs and have only application logs to be printed on my console and file. could you please let me know the chages that are required for it. log4j.rootCategory=INFO, console

Access S3 buckets in multiple accounts

2016-09-27 Thread Daniel Siegmann
I am running Spark on Amazon EMR and writing data to an S3 bucket. However, the data is read from an S3 bucket in a separate AWS account. Setting the fs.s3a.access.key and fs.s3a.secret.key values is sufficient to get access to the other account (using the s3a protocol), however I then won't have

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Sean Owen
I don't recall any code in Spark that computes a matrix inverse. There is code that solves linear systems Ax = b with a decomposition. For example from looking at the code recently, I think the regression implementation actually solves AtAx = Atb using a Cholesky decomposition. But, A = n x k,

Re: Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-27 Thread ayan guha
can you run spark-shell and try what you are trying? It is probably intellij issue On Tue, Sep 27, 2016 at 3:59 PM, Reth RM wrote: > Hi, > > I have issue connecting spark master, receiving a RuntimeException: > java.io.InvalidClassException:

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Edward Fine
I have not found matrix inversion algorithms in Spark and I would be surprised to see them. Except for matrices with very special structure (like those nearly the identity), inverting and n*n matrix is slower than O(n^2), which does not scale. Whenever a matrix is inverted, usually a

Re: read multiple files

2016-09-27 Thread Mich Talebzadeh
Hi Divya, There are a number of ways you can do this Get today's date in epoch format. These are my package imports import java.util.Calendar import org.joda.time._ import java.math.BigDecimal import java.sql.{Timestamp, Date} import org.joda.time.format.DateTimeFormat // Get epoch time now

Re: Pyspark not working on yarn-cluster mode

2016-09-27 Thread ofer
I advice you to use livy for this purpose. Livy works well with yarn and it will decouple spark from your web app. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-not-working-on-yarn-cluster-mode-tp23755p27799.html Sent from the Apache Spark User

Incremental model update

2016-09-27 Thread debasishg
Hello - I have a question on how to handle incremental model updation in Spark ML .. We have a time series where we predict the future conditioned on the past. We can train a model offline based on historical data and then use that model during prediction. But say, if the underlying process is

Question about single/multi-pass execution in Spark-2.0 dataset/dataframe

2016-09-27 Thread Spark User
case class Record(keyAttr: String, attr1: String, attr2: String, attr3: String) val ds = sparkSession.createDataset(rdd).as[Record] val attr1Counts = ds.groupBy('keyAttr', 'attr1').count() val attr2Counts = ds.groupBy('keyAttr', 'attr2').count() val attr3Counts = ds.groupBy('keyAttr',

ORC file stripe statistics in Spark

2016-09-27 Thread Sudhir Babu Pothineni
I am trying to get number of rows each stripe of ORC file? hivecontext.orcFile doesn't exist anymore? I am using Spark 1.6.0 scala> val hiveSqlContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveSqlContext: org.apache.spark.sql.hive.HiveContext =

Re: How does chaining of Windowed Dstreams work?

2016-09-27 Thread Hemalatha A
Hello, Can anyone please answer the below question and help me understand the windowing operations. On Sun, Sep 4, 2016 at 4:42 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > Hello, > > I have a set of Dstreams on which I'm performing some computation on each > Dstreams which is

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Mich Talebzadeh
Replace mini-batch with micro-batching and do a search again. what is your understanding of fraud detection? Spark streaming can be used for risk calculation and fraud detection (including stopping fraud going through for example credit card fraud) effectively "in practice". it can even be used

why spark ml package doesn't contain svm algorithm

2016-09-27 Thread hxw黄祥为
I have found spark ml package have implement naivebayes algorithm and the source code is simple,. I am confusing why spark ml package doesn’t contain svm algorithm,it seems not very hard to do that.

What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali
What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini

DataFrame Rejection Directory

2016-09-27 Thread Mostafa Alaa Mohamed
Hi All, I have dataframe contains some data and I need to insert it into hive table. My questions 1- Where will spark save the rejected rows from the insertion statements? 2- Can spark failed if some rows rejected? 3- How can I specify the rejection directory? Regards,

read multiple files

2016-09-27 Thread Divya Gehlot
Hi, The input data files for my spark job generated at every five minutes file name follows epoch time convention as below : InputFolder/batch-147495960 InputFolder/batch-147495990 InputFolder/batch-147496020 InputFolder/batch-147496050 InputFolder/batch-147496080

Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh
I have historical prices for various stocks. Each csv file has 10 years trade one row per each day. These are the columns defined in the class case class columns(Stock: String, Ticker: String, TradeDate: String, Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer) The issue is

Re: why spark ml package doesn't contain svm algorithm

2016-09-27 Thread Nick Pentreath
There is a JIRA and PR for it - https://issues.apache.org/jira/browse/SPARK-14709 On Tue, 27 Sep 2016 at 09:10 hxw黄祥为 wrote: > I have found spark ml package have implement naivebayes algorithm and the > source code is simple,. > > I am confusing why spark ml package doesn’t