Re: How can we connect RDD from previous job to next job

2016-08-28 Thread Roger Marin
Hi Sachin, Have a look at the spark job server project, it allows you to share rdds & dataframes between spark jobs running in the same context, the catch is you have to implement your spark job as a spark job server spark job.

How can we connect RDD from previous job to next job

2016-08-28 Thread Sachin Mittal
Hi, I would need some thoughts or inputs or any starting point to achieve following scenario. I submit a job using spark-submit with a certain set of parameters. It reads data from a source, does some processing on RDDs and generates some output and completes. Then I submit same job again with

Re: Automating lengthy command to pyspark with configuration?

2016-08-28 Thread ayan guha
Best to create alias and place in your bashrc On 29 Aug 2016 08:30, "Russell Jurney" wrote: > In order to use PySpark with MongoDB and ElasticSearch, I currently run > the rather long commands of: > > 1) pyspark --executor-memory 10g --jars

Issue with Spark HBase connector streamBulkGet method

2016-08-28 Thread BiksN
I'm trying to use streamBulkGet method in my Spark application. I want to save the result of this streamBulkGet to a JavaDStream object, but I'm unable to get the code compiled. I get error: Required: org.apacheJavaDStream Found: void Here is the code: JavaDStream x =

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Chanh Le
> Does parquet file has limit in size ( 1TB ) ? I did’t see any problem but 1TB is too big to operation need to divide into small pieces. > Should we use SaveMode.APPEND for long running streaming app ? Yes, but you need to partition it by time so it easy to maintain like update or delete a

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
Hi Mich, My stack is as following: Data sources: * IBM MQ * Oracle database Kafka to store all messages from data sources Spark Streaming fetching messages from Kafka and do a bit transform and write parquet files to HDFS Hive / SparkSQL / Impala will query on parquet files. Do you have any

Re: S3A + EMR failure when writing Parquet?

2016-08-28 Thread Everett Anderson
(Sorry, typo -- I was using spark.hadoop.mapreduce. fileoutputcommitter.algorithm.version=2 not 'hadooop', of course) On Sun, Aug 28, 2016 at 12:51 PM, Everett Anderson wrote: > Hi, > > I'm having some trouble figuring out a failure when using S3A when writing > a DataFrame as

Automating lengthy command to pyspark with configuration?

2016-08-28 Thread Russell Jurney
In order to use PySpark with MongoDB and ElasticSearch, I currently run the rather long commands of: 1) pyspark --executor-memory 10g --jars ../lib/mongo-hadoop-spark-2.0.0-rc0.jar,../lib/mongo-java-driver-3.2.2.jar,../lib/mongo-hadoop-2.0.0-rc0.jar --driver-class-path

S3A + EMR failure when writing Parquet?

2016-08-28 Thread Everett Anderson
Hi, I'm having some trouble figuring out a failure when using S3A when writing a DataFrame as Parquet on EMR 4.7.2 (which is Hadoop 2.7.2 and Spark 1.6.2). It works when using EMRFS (s3://), though. I'm using these extra conf params, though I've also tried without everything but the encryption

Suggestions for calculating MAU/WAU/DAU

2016-08-28 Thread Tal Grynbaum
Hi I'm struggling with the following issue. I need to build a cube with 6 dimensions for app usage for example: ---+---+--+-+--+-- user | app | d3 | d4 | d5 | d6 ---+---+--+-+--+-- u1 | a1 | x| y | z | 5

Re: Design patterns involving Spark

2016-08-28 Thread Sivakumaran S
Spark best fits for processing. But depending on the use case, you could expand the scope of Spark to moving data using the native connectors. The only that Spark is not, is Storage. Connectors are available for most storage options though. Regards, Sivakumaran S > On 28-Aug-2016, at 6:04

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Mich Talebzadeh
Hi, Can you explain about you particular stack. Example what is the source of streaming data and the role that Spark plays. Are you dealing with Real Time and Batch and why Parquet and not something like Hbase to ingest data real time. HTH Dr Mich Talebzadeh LinkedIn *

Design patterns involving Spark

2016-08-28 Thread Ashok Kumar
Hi, There are design patterns that use Spark extensively. I am new to this area so I would appreciate if someone explains where Spark fits in especially within faster or streaming use case. What are the best practices involving Spark. Is it always best to deploy it for processing engine,  For

Re: Spark StringType could hold how many characters ?

2016-08-28 Thread Sean Owen
No, it is just being truncated for display as the ... implies. Pass truncate=false to the show command. On Sun, Aug 28, 2016, 15:24 Kevin Tran wrote: > Hi, > I wrote to parquet file as following: > > ++ > |word| > ++ >

Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
Hi, Does anyone know what is the best practises to store data to parquet file? Does parquet file has limit in size ( 1TB ) ? Should we use SaveMode.APPEND for long running streaming app ? How should we store in HDFS (directory structure, ... )? Thanks, Kevin.

Spark StringType could hold how many characters ?

2016-08-28 Thread Kevin Tran
Hi, I wrote to parquet file as following: ++ |word| ++ |THIS IS MY CHARACTERS ...| |// ANOTHER LINE OF CHAC...| ++ These lines are not full text and it is being trimmed down. Does anyone know how many chacters StringType

UDF/UDAF performance

2016-08-28 Thread AssafMendelson
I am trying to do a high performance calculations which require custom functions. As a first stage I am trying to profile the effect of using UDF and I am getting weird results. I created a simple test (in

Equivalent of "predict" function from LogisticRegressionWithLBFGS in OneVsRest with LogisticRegression classifier (Spark 2.0)

2016-08-28 Thread yaroslav
Hi, We use such kind of logic for training our model val model = new LogisticRegressionWithLBFGS() .setNumClasses(3) .run(train) Next, during spark streaming, we load model and apply incoming data to this model to get specific class, for example:

Re: Spark 2.0 - Join statement compile error

2016-08-28 Thread Mich Talebzadeh
Yes I realised that. Actually I thought it was s not $. it has been around in shell for years say for actual values --> ${LOG_FILE}, for position 's/ etc cat ${LOG_FILE} | egrep -v 'rows affected|return status|&&&' | sed -e 's/^[]*//g' -e 's/^//g' -e '/^$/d' > temp.out Dr

Re: Spark 2.0 - Join statement compile error

2016-08-28 Thread Jacek Laskowski
Hi Mich, This is Scala's string interpolation which allow for replacing $-prefixed expressions with their values. It's what cool kids use in Scala to do templating and concatenation  Jacek On 23 Aug 2016 9:21 a.m., "Mich Talebzadeh" wrote: > What is --> s below