date:20160927

Trying to fetch S3 data

2016-09-27 Thread Hitesh Goyal

Hi team,

I want to fetch data from Amazon S3 bucket. For this, I am trying to access it 
using scala.
I have tried the basic wordcount application in scala.
Now I want to retrieve s3 data using it.
I have gone through the tutorials and I found solutions for uploading files to 
S3.
Please tell me how can I retrieve the data buckets stored in S3.

Regards,
Hitesh Goyal
Simpli5d Technologies
Cont No.: 9996588220

CGroups and Spark

2016-09-27 Thread Harut

Hi.

I'm running spark on YaRN without CGroups turned on, and have 2 questions:

1. Does anyone of spark/yarn guarantee that my spark tasks won't eat up more
CPU cores than I've assigned? (I assume there are no guarantees, correct?)
2. What is the effect of setting --executor-cores when submitting the job,
apart from parallelism, does it constrain executor process anyhow?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CGroups-and-Spark-tp27802.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Help required in validating an architecture using Structured Streaming

2016-09-27 Thread Aravindh

Hi, We are building an internal analytics application. Kind of an event
store. We have all the basic analytics use cases like filtering,
aggregation, segmentation etc. So far our architecture used ElasticSearch
extensively but that is not scaling anymore. One unique requirement we have
is an event should be available for querying within 5 seconds of the event.
We were thinking of a lambda architecture where streaming data still goes to
elastic search (only 1 day's data), batch pipeline goes to s3. Every day
one, a spark job will transform that data and store again in s3. One problem
we were not able to solve was when a query comes, how to aggregate results
from 2 data sources (ES for current data & s3 for old data). We felt this
approach wont scale.

Spark Structured Streaming seems to solve this. Correct me if i am wrong.
With structured streaming, will the following architecture work?
Read data from kafka using spark. For every batch of data, do the
transformations and store in s3. But when a query comes, query from both s3
& in memory batch at the same time. Will this approach work? Also one more
condition is, querying should respond immediately. With a max latency of 1s
for simple queries and 5s for complex queries. If the above method is not
the right way, please suggest an alternative to solve this.

Thanks
Aravindh.S

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-in-validating-an-architecture-using-Structured-Streaming-tp27801.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mike Metzger

Hi Mich -

   Can you run a filter command on df1 prior to your map for any rows where
p(3).toString != '-' then run your map command?

Thanks

Mike

On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh 
wrote:

> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
> which means there was no trades on those days
>
>
> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>
> So one way would be to exclude the rows that there was no volume of trade
> that day when cleaning up the csv file
>
> *cat stock.csv|grep -v **",0"*
>
> and that works. Bearing in mind that putting 0s in place of "-" will skew
> the price plot.
>
> BTW I am using Spark csv as well
>
> val df1 = spark.read.option("header", true).csv(location)
>
> This is the class and the mapping
>
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>
>
> In here I have
>
> p(3).toString.toFloat
>
> How can one check for rogue data in p(3)?
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 September 2016 at 21:49, Mich Talebzadeh 
> wrote:
>
>>
>> I have historical prices for various stocks.
>>
>> Each csv file has 10 years trade one row per each day.
>>
>> These are the columns defined in the class
>>
>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>
>> The issue is with Open, High, Low, Close columns that all are defined as
>> Float.
>>
>> Most rows are OK like below but the red one with "-" defined as Float
>> causes issues
>>
>>   Date Open High  Low   Close Volume
>> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
>> 23-Dec-11   - --40.56 0
>>
>> Because the prices are defined as Float, these rows cause the application
>> to crash
>> scala> val rs = df2.filter(changeToDate("TradeDate") >=
>> monthsago).select((changeToDate("TradeDate").as("TradeDate")
>> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>> 'High.as("Day's High")).orderBy("TradeDate").collect
>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0
>> (TID 260)
>> java.lang.NumberFormatException: For input string: "-"
>>
>>
>> One way is to define the prices as Strings but that is not
>> meaningful. Alternatively do the clean up before putting csv in HDFS but
>> that becomes tedious and error prone.
>>
>> Any ideas will be appreciated.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Question about executor memory setting

2016-09-27 Thread Dogtail L

Hi all,

May I ask a question about executor memory setting? I was running PageRank
with input size 2.8GB on one workstation for testing. I gave PageRank one
executor.

In case 1, I set --executor-cores to 4, and --executor-memory to 1GB, the
stage (stage 2) completion time is 14 min, the the detailed stage info is
below:


In case 2, I set --executor-cores to 4, and --executor-memory to 6GB, the
stage (stage 2) completion time is 34 min, the the detailed stage info is
below:


I am totally confused why when executor-memory gets larger, the stage
completion time is more than two times slower? From the web UI, I found
that when executor memory is 6GB, the shuffle spill (Disk) per task is
smaller, which means fewer IO operations, but weirdly, the task completion
time is longer though. Could anyone give me some hints? Great thanks!

Re: unsubscribe

2016-09-27 Thread Daniel Lopes

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | http://www.daniellopes.com.br

www.onematch.com.br


On Mon, Sep 26, 2016 at 12:24 PM, Karthikeyan Vasuki Balasubramaniam <
kvasu...@eng.ucsd.edu> wrote:

> unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon

Hi Mich,

I guess you could use nullValue option by setting it to null.

If you are reading them into strings at the first please, then, you would
meet https://github.com/apache/spark/pull/14118 first which is resolved
from 2.0.1

Unfortunately, this bug also exists in external csv library for strings if
I recall correctly.

However, it'd be fine if you set the schema explicitly when you load as
this bug does not exists for floats at least.

I hope this is helpful.

Thanks!

On 28 Sep 2016 7:06 a.m., "Mich Talebzadeh" 
wrote:

> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
> which means there was no trades on those days
>
>
> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>
> So one way would be to exclude the rows that there was no volume of trade
> that day when cleaning up the csv file
>
> *cat stock.csv|grep -v **",0"*
>
> and that works. Bearing in mind that putting 0s in place of "-" will skew
> the price plot.
>
> BTW I am using Spark csv as well
>
> val df1 = spark.read.option("header", true).csv(location)
>
> This is the class and the mapping
>
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>
>
> In here I have
>
> p(3).toString.toFloat
>
> How can one check for rogue data in p(3)?
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 September 2016 at 21:49, Mich Talebzadeh 
> wrote:
>
>>
>> I have historical prices for various stocks.
>>
>> Each csv file has 10 years trade one row per each day.
>>
>> These are the columns defined in the class
>>
>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>
>> The issue is with Open, High, Low, Close columns that all are defined as
>> Float.
>>
>> Most rows are OK like below but the red one with "-" defined as Float
>> causes issues
>>
>>   Date Open High  Low   Close Volume
>> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
>> 23-Dec-11   - --40.56 0
>>
>> Because the prices are defined as Float, these rows cause the application
>> to crash
>> scala> val rs = df2.filter(changeToDate("TradeDate") >=
>> monthsago).select((changeToDate("TradeDate").as("TradeDate")
>> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>> 'High.as("Day's High")).orderBy("TradeDate").collect
>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0
>> (TID 260)
>> java.lang.NumberFormatException: For input string: "-"
>>
>>
>> One way is to define the prices as Strings but that is not
>> meaningful. Alternatively do the clean up before putting csv in HDFS but
>> that becomes tedious and error prone.
>>
>> Any ideas will be appreciated.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh

Thanks guys

Actually these are the 7 rogue rows. The column 0 is the Volume column
which means there was no trades on those days


*cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0

So one way would be to exclude the rows that there was no volume of trade
that day when cleaning up the csv file

*cat stock.csv|grep -v **",0"*

and that works. Bearing in mind that putting 0s in place of "-" will skew
the price plot.

BTW I am using Spark csv as well

val df1 = spark.read.option("header", true).csv(location)

This is the class and the mapping


case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
Float, High: Float, Low: Float, Close: Float, Volume: Integer)
val df2 = df1.map(p => columns(p(0).toString, p(1).toString, p(2).toString,
p(3).toString.toFloat, p(4).toString.toFloat, p(5).toString.toFloat,
p(6).toString.toFloat, p(7).toString.toInt))


In here I have

p(3).toString.toFloat

How can one check for rogue data in p(3)?


Thanks





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 September 2016 at 21:49, Mich Talebzadeh 
wrote:

>
> I have historical prices for various stocks.
>
> Each csv file has 10 years trade one row per each day.
>
> These are the columns defined in the class
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>
> The issue is with Open, High, Low, Close columns that all are defined as
> Float.
>
> Most rows are OK like below but the red one with "-" defined as Float
> causes issues
>
>   Date Open High  Low   Close Volume
> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
> 23-Dec-11   - --40.56 0
>
> Because the prices are defined as Float, these rows cause the application
> to crash
> scala> val rs = df2.filter(changeToDate("TradeDate") >=
> monthsago).select((changeToDate("TradeDate").as("
> TradeDate")),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's
> Low"), 'High.as("Day's High")).orderBy("TradeDate").collect
> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
> 260)
> java.lang.NumberFormatException: For input string: "-"
>
>
> One way is to define the prices as Strings but that is not
> meaningful. Alternatively do the clean up before putting csv in HDFS but
> that becomes tedious and error prone.
>
> Any ideas will be appreciated.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread ayan guha

You can read as string, write a map to fix rows and then convert back to
your desired Dataframe.
On 28 Sep 2016 06:49, "Mich Talebzadeh"  wrote:

>
> I have historical prices for various stocks.
>
> Each csv file has 10 years trade one row per each day.
>
> These are the columns defined in the class
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>
> The issue is with Open, High, Low, Close columns that all are defined as
> Float.
>
> Most rows are OK like below but the red one with "-" defined as Float
> causes issues
>
>   Date Open High  Low   Close Volume
> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
> 23-Dec-11   - --40.56 0
>
> Because the prices are defined as Float, these rows cause the application
> to crash
> scala> val rs = df2.filter(changeToDate("TradeDate") >=
> monthsago).select((changeToDate("TradeDate").as("
> TradeDate")),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's
> Low"), 'High.as("Day's High")).orderBy("TradeDate").collect
> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
> 260)
> java.lang.NumberFormatException: For input string: "-"
>
>
> One way is to define the prices as Strings but that is not
> meaningful. Alternatively do the clean up before putting csv in HDFS but
> that becomes tedious and error prone.
>
> Any ideas will be appreciated.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Adrian Bridgett

We use the spark-csv (a successor of which is built in to spark 2.0) for 
this.  It doesn't cause crashes, failed parsing is logged.   We run on 
Mesos so I have to pull back all the logs from all the executors and 
search for failed lines (so that we can ensure that the failure rate 
isn't too high).


Hope this helps.

Adrian


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Parquet compression jars not found - both snappy and lzo - PySpark 2.0.0

2016-09-27 Thread Russell Jurney

In PySpark 2.0.0, despite adding snappy and lzo to my spark.jars path, I
get errors that say these classes can't be found when I save to a parquet
file. I tried switching from default snappy to lzo and added that jar and I
get the same error.

What am I to do?

I can't figure out any other steps to take. Note that this bug appeared
when I upgraded from Spark 1.6 to 2.0.0. Other than spark.jars, what steps
can I take to give Parquet access to its jars? Why isn't PySpark just
handling this, since Parquet is included in Spark? Is this a bug?

A gist of my code and the error is here:
https://gist.github.com/rjurney/6783d19397cf3b4b88af3603d6e256bd
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh

I have historical prices for various stocks.

Each csv file has 10 years trade one row per each day.

These are the columns defined in the class

case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
Float, High: Float, Low: Float, Close: Float, Volume: Integer)

The issue is with Open, High, Low, Close columns that all are defined as
Float.

Most rows are OK like below but the red one with "-" defined as Float
causes issues

  Date Open High  Low   Close Volume
27-Sep-16 80.91 80.93 79.87 80.85 1873158
23-Dec-11   - --40.56 0

Because the prices are defined as Float, these rows cause the application
to crash
scala> val rs = df2.filter(changeToDate("TradeDate") >=
monthsago).select((changeToDate("TradeDate").as("TradeDate")),(('Close+'Open)/2).as("AverageDailyPrice"),
'Low.as("Day's Low"), 'High.as("Day's High")).orderBy("TradeDate").collect
16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
260)
java.lang.NumberFormatException: For input string: "-"


One way is to define the prices as Strings but that is not
meaningful. Alternatively do the clean up before putting csv in HDFS but
that becomes tedious and error prone.

Any ideas will be appreciated.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Question about single/multi-pass execution in Spark-2.0 dataset/dataframe

2016-09-27 Thread Spark User

case class Record(keyAttr: String, attr1: String, attr2: String, attr3:
String)

val ds = sparkSession.createDataset(rdd).as[Record]

val attr1Counts = ds.groupBy('keyAttr', 'attr1').count()

val attr2Counts = ds.groupBy('keyAttr', 'attr2').count()

val attr3Counts = ds.groupBy('keyAttr', 'attr3').count()

//similar counts for 20 attributes

//code to merge attr1Counts and attr2Counts and attr3Counts
//translate it to desired output format and save the result.

Some more details:
1) The application is a spark streaming application with batch interval in
the order of 5 - 10 mins
2) Data set is large in the order of millions of records per batch
3) I'm using spark 2.0

The above implementation doesn't seem to be efficient at all, if data set
goes through the Rows for every count aggregation for computing
attr1Counts, attr2Counts and attr3Counts. I'm concerned about the
performance.

Questions:
1) Does the catalyst optimization handle such queries and does a single
pass on the dataset under the hood?
2) Is there a better way to do such aggregations , may be using UDAFs? Or
it is better to do RDD.reduceByKey for this use case?
RDD.reduceByKey performs well for the data and batch interval of 5 - 10
mins. Not sure if data set implementation as explained above will be
equivalent or better.

Thanks,
Bharath

ORC file stripe statistics in Spark

2016-09-27 Thread Sudhir Babu Pothineni

I am trying to get number of rows each stripe of ORC file?

hivecontext.orcFile doesn't exist anymore? I am using Spark 1.6.0

scala> val hiveSqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveSqlContext: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@83223af

scala> hiveSqlContext.
analyze   applySchema
asInstanceOf  baseRelationToDataFrame   cacheTable
clearCachecreateDataFrame
createDataset createExternalTable   dropTempTable
emptyDataFrameexperimental
getAllConfs   getConf   implicits
isCached  isInstanceOf
isRootContext jdbc  jsonFile
jsonRDD   listenerManager
load  newSessionparquetFile
range read
refreshTable  setConf   sparkContext
sql   table
tableNamestablestoString
udf   uncacheTable


Thanks
Sudhir

Incremental model update

2016-09-27 Thread debasishg

Hello -

I have a question on how to handle incremental model updation in Spark ML ..

We have a time series where we predict the future conditioned on the past.
We can train a model offline based on historical data and then use that
model during prediction. 

But say, if the underlying process is non-stationary, the probability
distribution changes with time. In such cases we need to update the model so
as to reflect the current change in distribution.

We have 2 options -

a. retrain periodically
b. update the model incrementally

a. is expensive. 

What about b. ? I think in Spark we have StreamingKMeans that takes care of
this incremental model update for this classifier. Is this true ?

But what about other classifiers that don't have these Streaming
counterparts ? How do we handle such incremental model changes with those
classifiers where the underlying distribution changes ?

regards.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Incremental-model-update-tp27800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: read multiple files

2016-09-27 Thread Mich Talebzadeh

Hi Divya,

There are a number of ways you can do this

Get today's date in epoch format. These are my package imports

import java.util.Calendar
import org.joda.time._
import java.math.BigDecimal
import java.sql.{Timestamp, Date}
import org.joda.time.format.DateTimeFormat

// Get epoch time now

scala> val epoch = System.currentTimeMillis
epoch: Long = 1474996552292

//get thirty days ago in epoch time

scala> val thirtydaysago = epoch - (30 * 24 * 60 * 60 * 1000L)
thirtydaysago: Long = 1472404552292

// *note that L for Long at the end*

// Define a function to convert date to str to double check if indeed it is
30 days ago

scala> def timeToStr(epochMillis: Long): String = {
 | DateTimeFormat.forPattern("-MM-dd HH:mm:ss").print(epochMillis)}
timeToStr: (epochMillis: Long)String


scala> timeToStr(epoch)
res4: String = 2016-09-27 18:15:52

So you need to pick files >= file_thirtydaysago UP to  file_epoch

Regardless I think you can do better with partitioning of directories. With
a file created every 5 minutes you will have 288 files generated daily
(12*24). Just partition the sub-directory daily. Flume can do that for you
or you can do it in a shell script.

HTH











Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 September 2016 at 15:53, Peter Figliozzi 
wrote:

> If you're up for a fancy but excellent solution:
>
>- Store your data in Cassandra.
>- Use the expiring data feature (TTL)
> so
>data will automatically be removed a month later.
>- Now in your Spark process, just read from the database and you don't
>have to worry about the timestamp.
>- You'll still have all your old files if you need to refer back them.
>
> Pete
>
> On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot 
> wrote:
>
>> Hi,
>> The input data files for my spark job generated at every five minutes
>> file name follows epoch time convention  as below :
>>
>> InputFolder/batch-147495960
>> InputFolder/batch-147495990
>> InputFolder/batch-147496020
>> InputFolder/batch-147496050
>> InputFolder/batch-147496080
>> InputFolder/batch-147496110
>> InputFolder/batch-147496140
>> InputFolder/batch-147496170
>> InputFolder/batch-147496200
>> InputFolder/batch-147496230
>>
>> As per requirement I need to read one month of data from current
>> timestamp.
>>
>> Would really appreciate if anybody could help me .
>>
>> Thanks,
>> Divya
>>
>
>

Re: Pyspark not working on yarn-cluster mode

2016-09-27 Thread ofer

I advice you to use livy for this purpose.
Livy works well with yarn and it will decouple spark from your web app.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-not-working-on-yarn-cluster-mode-tp23755p27799.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Problems with new experimental Kafka Consumer for 0.10

2016-09-27 Thread Cody Koeninger

What's the actual stacktrace / exception you're getting related to
commit failure?

On Tue, Sep 27, 2016 at 9:37 AM, Matthias Niehoff
 wrote:
> Hi everybody,
>
> i am using the new Kafka Receiver for Spark Streaming for my Job. When
> running with old consumer it runs fine.
>
> The Job consumes 3 Topics, saves the data to Cassandra, cogroups the topic,
> calls mapWithState and stores the results in cassandra. After that I
> manually commit the Kafka offsets using the commitAsync method of the
> KafkaDStream.
>
> With the new consumer I experience the following problem:
>
> After a certain amount of time (about 4-5 minutes, might be more or less)
> there are exceptions that the offset commit failed. The processing takes
> less than the batch interval. I also adjusted the session.timeout and
> request.timeout as well as the max.poll.records setting which did not help.
>
> After the first offset commit failed the time it takes from kafka until the
> microbatch is started increases, the processing time is constantly below the
> batch interval. Moreover further offset commits also fail and as result the
> delay time builds up.
>
> Has anybody made this experience as well?
>
> Thank you
>
> Relevant Kafka Parameters:
>
> "session.timeout.ms" -> s"${1 * 60 * 1000}",
> "request.timeout.ms" -> s"${2 * 60 * 1000}",
> "auto.offset.reset" -> "largest",
> "enable.auto.commit" -> "false",
> "max.poll.records" -> "1000"
>
>
>
> --
> Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
> codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
> tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
> 172.1702676
> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
> www.more4fi.de
>
> Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
>
> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
> und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
> Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
> bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter
> Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl.
> beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht
> gestattet

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-27 Thread Ding Fei

Have you checked version of spark library referenced in intelliJ
project and compare it with the binary distribution version?

On Tue, 2016-09-27 at 09:13 -0700, Reth RM wrote:
> Hi Ayan,
> 
> 
> Thank you for the response. I tried to connect to same "stand alone
> spark master" through spark-shell and it works as intended
> 
> 
> On shell, tried ./spark-shell --master spark://host:7077
> Connection was established and It wrote an info on console as 'Spark
> context available as 'sc' (master = spark://host:7077)
> 
> 
> But its issue when trying to connect to the same stand alone spark
> master through intellij scala code.
> val conf = new SparkConf()
> .setAppName("scala spark")
> .setMaster("spark://host:7077")
> 
> 
> when setMaster is local, it just works, but when trying to connect
> spark stand alone (passing spark master url), it reports the above
> error. 
> 
> 
> 
> 
> 
> 
> 
> On Mon, Sep 26, 2016 at 11:23 PM, ayan guha 
> wrote:
> can you run spark-shell and try what you are trying? It is
> probably intellij issue
> 
> On Tue, Sep 27, 2016 at 3:59 PM, Reth RM
>  wrote:
> Hi,
> 
> 
>  I have issue connecting spark master, receiving a
> RuntimeException: java.io.InvalidClassException:
> org.apache.spark.rpc.netty.RequestMessage.
> 
> 
> Followed the steps mentioned below. Can you please
> point me to where am I doing wrong?
> 
> 
> 1. Downloaded spark
> (version spark-2.0.0-bin-hadoop2.7)
> 2. Have scala installed (version  2.11.8)
> 3. navigated to /spark-2.0.0-bin-hadoop2.7/sbin
> 4../start-master.sh
> 5../start-slave.sh spark://http://host:7077/
> 6. Intellij has simple 2 lines code for scala as it
> is here 
> 
> 
> Error
> https://jpst.it/NOUE
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha
> 
> 
> 



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-27 Thread Reth RM

Hi Ayan,

Thank you for the response. I tried to connect to same "stand alone spark
master" through spark-shell and it works as intended

On shell, tried ./spark-shell --master spark://host:7077
Connection was established and It wrote an info on console as 'Spark
context available as 'sc' (master = spark://host:7077)

But its issue when trying to connect to the same stand alone spark master
through intellij scala code.

   1. val  conf = new 
   SparkConf()
   2. .setAppName("scala spark")
   3. .setMaster("spark://host:7077")


when setMaster is local, it just works, but when trying to connect spark
stand alone (passing spark master url), it reports the above error.




On Mon, Sep 26, 2016 at 11:23 PM, ayan guha  wrote:

> can you run spark-shell and try what you are trying? It is probably
> intellij issue
>
> On Tue, Sep 27, 2016 at 3:59 PM, Reth RM  wrote:
>
>> Hi,
>>
>>  I have issue connecting spark master, receiving a RuntimeException:
>> java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage.
>>
>> Followed the steps mentioned below. Can you please point me to where am I
>> doing wrong?
>>
>> 1. Downloaded spark (version spark-2.0.0-bin-hadoop2.7)
>> 2. Have scala installed (version  2.11.8)
>> 3. navigated to /spark-2.0.0-bin-hadoop2.7/sbin
>> 4../start-master.sh
>> 5../start-slave.sh spark://http://host:7077/
>> 6. Intellij has simple 2 lines code for scala as it is here
>> 
>>
>> Error
>> https://jpst.it/NOUE
>>
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

log4j custom properties for spark project

2016-09-27 Thread KhajaAsmath Mohammed

Hello Everyone,

I am using below log4j properies and it is dipalying the logs well but
still I want to remove the spark logs and have only application logs to be
printed on my console and file. could you please let me know the chages
that are required for it.

log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss}
%p %c{1}: %m%n


log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR


Thanks,

Asmath

Re: read multiple files

2016-09-27 Thread Peter Figliozzi

If you're up for a fancy but excellent solution:

   - Store your data in Cassandra.
   - Use the expiring data feature (TTL)
    so
   data will automatically be removed a month later.
   - Now in your Spark process, just read from the database and you don't
   have to worry about the timestamp.
   - You'll still have all your old files if you need to refer back them.

Pete

On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot 
wrote:

> Hi,
> The input data files for my spark job generated at every five minutes file
> name follows epoch time convention  as below :
>
> InputFolder/batch-147495960
> InputFolder/batch-147495990
> InputFolder/batch-147496020
> InputFolder/batch-147496050
> InputFolder/batch-147496080
> InputFolder/batch-147496110
> InputFolder/batch-147496140
> InputFolder/batch-147496170
> InputFolder/batch-147496200
> InputFolder/batch-147496230
>
> As per requirement I need to read one month of data from current timestamp.
>
> Would really appreciate if anybody could help me .
>
> Thanks,
> Divya
>

Access S3 buckets in multiple accounts

2016-09-27 Thread Daniel Siegmann

I am running Spark on Amazon EMR and writing data to an S3 bucket. However,
the data is read from an S3 bucket in a separate AWS account. Setting the
fs.s3a.access.key and fs.s3a.secret.key values is sufficient to get access
to the other account (using the s3a protocol), however I then won't have
access to the S3 bucket in the EMR cluster's AWS account.

Is there any way for Spark to access S3 buckets in multiple accounts? If
not, is there any best practice for how to work around this?

--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001

Problems with new experimental Kafka Consumer for 0.10

2016-09-27 Thread Matthias Niehoff

Hi everybody,

i am using the new Kafka Receiver for Spark Streaming for my Job. When
running with old consumer it runs fine.

The Job consumes 3 Topics, saves the data to Cassandra, cogroups the topic,
calls mapWithState and stores the results in cassandra. After that I
manually commit the Kafka offsets using the commitAsync method of the
KafkaDStream.

With the new consumer I experience the following problem:

After a certain amount of time (about 4-5 minutes, might be more or less)
there are exceptions that the offset commit failed. The processing takes
less than the batch interval. I also adjusted the session.timeout and
request.timeout as well as the max.poll.records setting which did not help.

After the first offset commit failed the time it takes from kafka until the
microbatch is started increases, the processing time is constantly below
the batch interval. Moreover further offset commits also fail and as result
the delay time builds up.

Has anybody made this experience as well?

Thank you

Relevant Kafka Parameters:

"session.timeout.ms" -> s"${1 * 60 * 1000}",
"request.timeout.ms" -> s"${2 * 60 * 1000}",
"auto.offset.reset" -> "largest",
"enable.auto.commit" -> "false",
"max.poll.records" -> "1000"



-- 
Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
172.1702676
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
www.more4fi.de

Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
nicht gestattet

pyspark cluster mode on standalone deployment

2016-09-27 Thread Ofer Eliassaf

Is there any plan to support python spark running in "cluster mode" on a
standalone deployment?

There is this famous survey mentioning that more than 50% of the users are
using the standalone configuration.
Making pyspark work in cluster mode with standalone will help a lot for
high availabilty in python spark.

Cuurently only Yarn deployment supports it. Bringing the huge Yarn
installation just for this feature is not fun at all

Does someone have time estimation for this?



-- 
Regards,
Ofer Eliassaf

Incremental model update

2016-09-27 Thread Debasish Ghosh

Hello -

I have a question on how to handle incremental model updation in Spark ..

We have a time series where we predict the future conditioned on the past.
We can train a model offline based on historical data and then use that
model during prediction.

But say, if the underlying process is non-stationary, the probability
distribution changes with time. In such cases we need to update the model
so as to reflect the current change in distribution.

We have 2 options -

a. retrain periodically
b. update the model incrementally

a. is expensive.

What about b. ? I think in Spark we have StreamingKMeans that takes care of
this incremental model update for this classifier. Is this true ?

But what about other classifiers that don't have these Streaming
counterparts ? How do we handle such incremental model changes with those
classifiers where the underlying distribution changes ?

regards.

-- 
Debasish Ghosh
http://manning.com/ghosh2
http://manning.com/ghosh

Twttr: @debasishg
Blog: http://debasishg.blogspot.com
Code: http://github.com/debasishg

Re: why spark ml package doesn't contain svm algorithm

2016-09-27 Thread Nick Pentreath

There is a JIRA and PR for it -
https://issues.apache.org/jira/browse/SPARK-14709

On Tue, 27 Sep 2016 at 09:10 hxw黄祥为  wrote:

> I have found spark ml package have implement naivebayes algorithm and the
> source code is simple,.
>
> I am confusing why spark ml package doesn’t contain svm algorithm,it seems
> not very hard to do that.
>

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Chanh Le

The different between Stream vs Micro Batch is about Ordering of Messages
> Spark Streaming guarantees ordered processing of RDDs in one DStream. Since 
> each RDD is processed in parallel, there is not order guaranteed within the 
> RDD. This is a tradeoff design Spark made. If you want to process the 
> messages in order within the RDD, you have to process them in one thread, 
> which does not have the benefit of parallelism.

More about that 
http://samza.apache.org/learn/documentation/0.10/comparisons/spark-streaming.html
 






> On Sep 27, 2016, at 2:12 PM, kant kodali  wrote:
> 
> What is the difference between mini-batch vs real time streaming in practice 
> (not theory)? In theory, I understand mini batch is something that batches in 
> the given time frame whereas real time streaming is more like do something as 
> the data arrives but my biggest question is why not have mini batch with 
> epsilon time frame (say one millisecond) or I would like to understand reason 
> why one would be an effective solution than other?
> I recently came across one example where mini-batch (Apache Spark) is used 
> for Fraud detection and real time streaming (Apache Flink) used for Fraud 
> Prevention. Someone also commented saying mini-batches would not be an 
> effective solution for fraud prevention (since the goal is to prevent the 
> transaction from occurring as it happened) Now I wonder why this wouldn't be 
> so effective with mini batch (Spark) ? Why is it not effective to run 
> mini-batch with 1 millisecond latency? Batching is a technique used 
> everywhere including the OS and the Kernel TCP/IP stack where the data to the 
> disk or network are indeed buffered so what is the convincing factor here to 
> say one is more effective than other?
> Thanks,
> kant
>

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Alonso Isidoro Roman

mini batch or near real time: processing frames within 500 ms or more

real time: processing frames in 5 ms-10ms.

The main difference is processing velocity, i think.

Apache Spark Streaming is mini batch, not true real time.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-09-27 11:15 GMT+02:00 kant kodali :

> I understand the difference between fraud detection and fraud prevention
> in general but I am not interested in the semantic war on what these terms
> precisely mean. I am more interested in understanding the difference
> between mini-batch vs real time streaming from CS perspective.
>
>
>
> On Tue, Sep 27, 2016 12:54 AM, Mich Talebzadeh mich.talebza...@gmail.com
> wrote:
>
>> Replace mini-batch with micro-batching and do a search again. what is
>> your understanding of fraud detection?
>>
>> Spark streaming can be used for risk calculation and fraud detection
>> (including stopping fraud going through for example credit card
>> fraud) effectively "in practice". it can even be used for Complex Event
>> Processing.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 September 2016 at 08:12, kant kodali  wrote:
>>
>> What is the difference between mini-batch vs real time streaming in
>> practice (not theory)? In theory, I understand mini batch is something that
>> batches in the given time frame whereas real time streaming is more like do
>> something as the data arrives but my biggest question is why not have mini
>> batch with epsilon time frame (say one millisecond) or I would like to
>> understand reason why one would be an effective solution than other?
>> I recently came across one example where mini-batch (Apache Spark) is
>> used for Fraud detection and real time streaming (Apache Flink) used for
>> Fraud Prevention. Someone also commented saying mini-batches would not be
>> an effective solution for fraud prevention (since the goal is to prevent
>> the transaction from occurring as it happened) Now I wonder why this
>> wouldn't be so effective with mini batch (Spark) ? Why is it not effective
>> to run mini-batch with 1 millisecond latency? Batching is a technique used
>> everywhere including the OS and the Kernel TCP/IP stack where the data to
>> the disk or network are indeed buffered so what is the convincing factor
>> here to say one is more effective than other?
>> Thanks,
>> kant
>>
>>
>>

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali

I understand the difference between fraud detection and fraud prevention in
general but I am not interested in the semantic war on what these terms
precisely mean. I am more interested in understanding the difference
between mini-batch vs real time streaming from CS perspective.
 





On Tue, Sep 27, 2016 12:54 AM, Mich Talebzadeh mich.talebza...@gmail.com
wrote:
Replace mini-batch with micro-batching and do a search again. what is your
understanding of fraud detection?
Spark streaming can be used for risk calculation and fraud detection (including
stopping fraud going through for example credit card fraud) effectively "in
practice". it can even be used for Complex Event Processing.

HTH
Dr Mich Talebzadeh



LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com




Disclaimer: Use it at your own risk.  Any and all responsibility for any loss,
damage or destruction
of data or any other property which may arise from relying on this
email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction. 




On 27 September 2016 at 08:12, kant kodali   wrote:
What is the difference between mini-batch vs real time streaming in practice
(not theory)? In theory, I understand mini batch is something that batches in
the given time frame whereas real time streaming is more like do something as
the data arrives but my biggest question is why not have mini batch with epsilon
time frame (say one millisecond) or I would like to understand reason why one
would be an effective solution than other?I recently came across one example
where mini-batch (Apache Spark) is used for Fraud detection and real time
streaming (Apache Flink) used for Fraud Prevention. Someone also commented
saying mini-batches would not be an effective solution for fraud prevention
(since the goal is to prevent the transaction from occurring as it happened) Now
I wonder why this wouldn't be so effective with mini batch (Spark) ? Why is it
not effective to run mini-batch with 1 millisecond latency? Batching is a
technique used everywhere including the OS and the Kernel TCP/IP stack where the
data to the disk or network are indeed buffered so what is the convincing factor
here to say one is more effective than other?Thanks,kant

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Anastasios Zouzias

Hi there,

As Edward noted, if you ask a numerical analyst about matrix inversion,
they will respond "you never invert a matrix, but you solve the linear
system associated with the matrix". Linear system solving is usually done
with iterative methods or matrix decompositions (as noted above). The
reason why people avoid matrix inversion is because of its inherited poor
numerical stability.

Best,
Anastasios

On Tue, Sep 27, 2016 at 8:42 AM, Edward Fine  wrote:

> I have not found matrix inversion algorithms in Spark and I would be
> surprised to see them.  Except for matrices with very special structure
> (like those nearly the identity), inverting and n*n matrix is slower than
> O(n^2), which does not scale.  Whenever a matrix is inverted, usually a
> decomposition or a low rank approximation is used, just as Sean pointed
> out.  See further https://en.wikipedia.org/wiki/Computational_
> complexity_of_mathematical_operations#Matrix_algebra
> or if you really want to dig into it
> Stoer and Bulirsch http://www.springer.com/us/book/9780387954523
>
> On Mon, Sep 26, 2016 at 11:00 PM Sean Owen  wrote:
>
>> I don't recall any code in Spark that computes a matrix inverse. There is
>> code that solves linear systems Ax = b with a decomposition. For example
>> from looking at the code recently, I think the regression implementation
>> actually solves AtAx = Atb using a Cholesky decomposition. But, A = n x k,
>> where n is large but k is smallish (number of features), so AtA is k x k
>> and can be solved in-memory with a library.
>>
>> On Tue, Sep 27, 2016 at 3:05 AM, Cooper  wrote:
>> > How is the problem of large-scale matrix inversion approached in Apache
>> Spark
>> > ?
>> >
>> > This linear algebra operation is obviously the very base of a lot of
>> other
>> > algorithms (regression, classification, etc). However, I have not been
>> able
>> > to find a Spark API on parallel implementation of matrix inversion. Can
>> you
>> > please clarify approaching this operation on the Spark internals ?
>> >
>> > Here    is a
>> paper on
>> > the parallelized matrix inversion in Spark, however I am trying to use
>> an
>> > existing code instead of implementing one from scratch, if available.
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Large-scale-matrix-inverse-in-Spark-tp27796.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>


-- 
-- Anastasios Zouzias

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread Mich Talebzadeh

Replace mini-batch with micro-batching and do a search again. what is your
understanding of fraud detection?

Spark streaming can be used for risk calculation and fraud detection
(including stopping fraud going through for example credit card
fraud) effectively "in practice". it can even be used for Complex Event
Processing.

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 27 September 2016 at 08:12, kant kodali  wrote:

> What is the difference between mini-batch vs real time streaming in
> practice (not theory)? In theory, I understand mini batch is something that
> batches in the given time frame whereas real time streaming is more like do
> something as the data arrives but my biggest question is why not have mini
> batch with epsilon time frame (say one millisecond) or I would like to
> understand reason why one would be an effective solution than other?
> I recently came across one example where mini-batch (Apache Spark) is used
> for Fraud detection and real time streaming (Apache Flink) used for Fraud
> Prevention. Someone also commented saying mini-batches would not be an
> effective solution for fraud prevention (since the goal is to prevent the
> transaction from occurring as it happened) Now I wonder why this wouldn't
> be so effective with mini batch (Spark) ? Why is it not effective to run
> mini-batch with 1 millisecond latency? Batching is a technique used
> everywhere including the OS and the Kernel TCP/IP stack where the data to
> the disk or network are indeed buffered so what is the convincing factor
> here to say one is more effective than other?
> Thanks,
> kant
>

DataFrame Rejection Directory

2016-09-27 Thread Mostafa Alaa Mohamed

Hi All,
I have dataframe contains some data and I need to insert it into hive table. My 
questions

1- Where will spark save the rejected rows from the insertion statements?
2- Can spark failed if some rows rejected?
3- How can I specify the rejection directory?

Regards,


The content of this email together with any attachments, statements and 
opinions expressed herein contains information that is private and confidential 
are intended for the named addressee(s) only. If you are not the addressee of 
this email you may not copy, forward, disclose or otherwise use it or any part 
of it in any form whatsoever. If you have received this message in error please 
notify postmas...@etisalat.ae by email immediately and delete the message 
without making any copies.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

read multiple files

2016-09-27 Thread Divya Gehlot

Hi,
The input data files for my spark job generated at every five minutes file
name follows epoch time convention  as below :

InputFolder/batch-147495960
InputFolder/batch-147495990
InputFolder/batch-147496020
InputFolder/batch-147496050
InputFolder/batch-147496080
InputFolder/batch-147496110
InputFolder/batch-147496140
InputFolder/batch-147496170
InputFolder/batch-147496200
InputFolder/batch-147496230

As per requirement I need to read one month of data from current timestamp.

Would really appreciate if anybody could help me .

Thanks,
Divya

Re: How does chaining of Windowed Dstreams work?

2016-09-27 Thread Hemalatha A

Hello,

Can anyone please answer the below question and help me understand the
windowing operations.

On Sun, Sep 4, 2016 at 4:42 PM, Hemalatha A <
hemalatha.amru...@googlemail.com> wrote:

> Hello,
>
> I have a set of Dstreams on which I'm  performing some computation on each
> Dstreams which is widowed on one other from the base stream based on the
> order of window intervals. I want to find out the best Strem on which I
> could window a particular stream on?
>
> Suppose, I have a spark Dstream, with batch interval as 10sec and other
> streams are windowed on base steams as below:
>
> *Stream*
>
> *Window*
>
> *Sliding*
>
> *Windowed On*
>
> StreamA
>
> 30
>
> 10
>
> Base Stream
>
> StreamB
>
> 20
>
> 20
>
> Base Stream
>
> StreamC
>
> 90
>
> 20
>
> ?
>
>
>
> Now, should I base the StreamC on StreamA since its window is multiple of
> StreamA or base it on  StreamB since it has a higher and same sliding
> interval. Which would be a better choice?
>
>
> Or is it the same as window on Base stream? How does it basically work?
>
>
> --
>
>
> Regards
> Hemalatha
>



-- 


Regards
Hemalatha

What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali

What is the difference between mini-batch vs real time streaming in practice
(not theory)? In theory, I understand mini batch is something that batches in
the given time frame whereas real time streaming is more like do something as
the data arrives but my biggest question is why not have mini batch with epsilon
time frame (say one millisecond) or I would like to understand reason why one
would be an effective solution than other?I recently came across one example
where mini-batch (Apache Spark) is used for Fraud detection and real time
streaming (Apache Flink) used for Fraud Prevention. Someone also commented
saying mini-batches would not be an effective solution for fraud prevention
(since the goal is to prevent the transaction from occurring as it happened) Now
I wonder why this wouldn't be so effective with mini batch (Spark) ? Why is it
not effective to run mini-batch with 1 millisecond latency? Batching is a
technique used everywhere including the OS and the Kernel TCP/IP stack where the
data to the disk or network are indeed buffered so what is the convincing factor
here to say one is more effective than other?Thanks,kant

why spark ml package doesn't contain svm algorithm

2016-09-27 Thread hxw黄祥为

I have found spark ml package have implement naivebayes algorithm and the 
source code is simple,.
I am confusing why spark ml package doesn’t contain svm algorithm,it seems not 
very hard to do that.

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Edward Fine

I have not found matrix inversion algorithms in Spark and I would be
surprised to see them.  Except for matrices with very special structure
(like those nearly the identity), inverting and n*n matrix is slower than
O(n^2), which does not scale.  Whenever a matrix is inverted, usually a
decomposition or a low rank approximation is used, just as Sean pointed
out.  See further
https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra

or if you really want to dig into it
Stoer and Bulirsch http://www.springer.com/us/book/9780387954523

On Mon, Sep 26, 2016 at 11:00 PM Sean Owen  wrote:

> I don't recall any code in Spark that computes a matrix inverse. There is
> code that solves linear systems Ax = b with a decomposition. For example
> from looking at the code recently, I think the regression implementation
> actually solves AtAx = Atb using a Cholesky decomposition. But, A = n x k,
> where n is large but k is smallish (number of features), so AtA is k x k
> and can be solved in-memory with a library.
>
> On Tue, Sep 27, 2016 at 3:05 AM, Cooper  wrote:
> > How is the problem of large-scale matrix inversion approached in Apache
> Spark
> > ?
> >
> > This linear algebra operation is obviously the very base of a lot of
> other
> > algorithms (regression, classification, etc). However, I have not been
> able
> > to find a Spark API on parallel implementation of matrix inversion. Can
> you
> > please clarify approaching this operation on the Spark internals ?
> >
> > Here    is a
> paper on
> > the parallelized matrix inversion in Spark, however I am trying to use an
> > existing code instead of implementing one from scratch, if available.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-matrix-inverse-in-Spark-tp27796.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>

Re: Newbie Q: Issue related to connecting Spark Master Standalone through Scala app

2016-09-27 Thread ayan guha

can you run spark-shell and try what you are trying? It is probably
intellij issue

On Tue, Sep 27, 2016 at 3:59 PM, Reth RM  wrote:

> Hi,
>
>  I have issue connecting spark master, receiving a RuntimeException:
> java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage.
>
> Followed the steps mentioned below. Can you please point me to where am I
> doing wrong?
>
> 1. Downloaded spark (version spark-2.0.0-bin-hadoop2.7)
> 2. Have scala installed (version  2.11.8)
> 3. navigated to /spark-2.0.0-bin-hadoop2.7/sbin
> 4../start-master.sh
> 5../start-slave.sh spark://http://host:7077/
> 6. Intellij has simple 2 lines code for scala as it is here
> 
>
> Error
> https://jpst.it/NOUE
>
>
>
>


-- 
Best Regards,
Ayan Guha

Re: Large-scale matrix inverse in Spark

2016-09-27 Thread Sean Owen

I don't recall any code in Spark that computes a matrix inverse. There is
code that solves linear systems Ax = b with a decomposition. For example
from looking at the code recently, I think the regression implementation
actually solves AtAx = Atb using a Cholesky decomposition. But, A = n x k,
where n is large but k is smallish (number of features), so AtA is k x k
and can be solved in-memory with a library.

On Tue, Sep 27, 2016 at 3:05 AM, Cooper  wrote:
> How is the problem of large-scale matrix inversion approached in Apache
Spark
> ?
>
> This linear algebra operation is obviously the very base of a lot of other
> algorithms (regression, classification, etc). However, I have not been
able
> to find a Spark API on parallel implementation of matrix inversion. Can
you
> please clarify approaching this operation on the Spark internals ?
>
> Here    is a paper
on
> the parallelized matrix inversion in Spark, however I am trying to use an
> existing code instead of implementing one from scratch, if available.
>
>
>
> --
> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-matrix-inverse-in-Spark-tp27796.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>