Hi Sachin,
Have a look at the spark job server project, it allows you to share rdds &
dataframes between spark jobs running in the same context, the catch is you
have to implement your spark job as a spark job server spark job.
Hi,
I would need some thoughts or inputs or any starting point to achieve
following scenario.
I submit a job using spark-submit with a certain set of parameters.
It reads data from a source, does some processing on RDDs and generates
some output and completes.
Then I submit same job again with
Best to create alias and place in your bashrc
On 29 Aug 2016 08:30, "Russell Jurney" wrote:
> In order to use PySpark with MongoDB and ElasticSearch, I currently run
> the rather long commands of:
>
> 1) pyspark --executor-memory 10g --jars
I'm trying to use streamBulkGet method in my Spark application. I want to
save the result of this streamBulkGet to a JavaDStream object, but I'm
unable to get the code compiled. I get error:
Required: org.apacheJavaDStream
Found: void
Here is the code:
JavaDStream x =
> Does parquet file has limit in size ( 1TB ) ?
I did’t see any problem but 1TB is too big to operation need to divide into
small pieces.
> Should we use SaveMode.APPEND for long running streaming app ?
Yes, but you need to partition it by time so it easy to maintain like update or
delete a
Hi Mich,
My stack is as following:
Data sources:
* IBM MQ
* Oracle database
Kafka to store all messages from data sources
Spark Streaming fetching messages from Kafka and do a bit transform and
write parquet files to HDFS
Hive / SparkSQL / Impala will query on parquet files.
Do you have any
(Sorry, typo -- I was using spark.hadoop.mapreduce.
fileoutputcommitter.algorithm.version=2 not 'hadooop', of course)
On Sun, Aug 28, 2016 at 12:51 PM, Everett Anderson wrote:
> Hi,
>
> I'm having some trouble figuring out a failure when using S3A when writing
> a DataFrame as
In order to use PySpark with MongoDB and ElasticSearch, I currently run the
rather long commands of:
1) pyspark --executor-memory 10g --jars
../lib/mongo-hadoop-spark-2.0.0-rc0.jar,../lib/mongo-java-driver-3.2.2.jar,../lib/mongo-hadoop-2.0.0-rc0.jar
--driver-class-path
Hi,
I'm having some trouble figuring out a failure when using S3A when writing
a DataFrame as Parquet on EMR 4.7.2 (which is Hadoop 2.7.2 and Spark
1.6.2). It works when using EMRFS (s3://), though.
I'm using these extra conf params, though I've also tried without
everything but the encryption
Hi
I'm struggling with the following issue.
I need to build a cube with 6 dimensions for app usage
for example:
---+---+--+-+--+--
user | app | d3 | d4 | d5 | d6
---+---+--+-+--+--
u1 | a1 | x| y | z | 5
Spark best fits for processing. But depending on the use case, you could expand
the scope of Spark to moving data using the native connectors. The only that
Spark is not, is Storage. Connectors are available for most storage options
though.
Regards,
Sivakumaran S
> On 28-Aug-2016, at 6:04
Hi,
Can you explain about you particular stack.
Example what is the source of streaming data and the role that Spark plays.
Are you dealing with Real Time and Batch and why Parquet and not something
like Hbase to ingest data real time.
HTH
Dr Mich Talebzadeh
LinkedIn *
Hi,
There are design patterns that use Spark extensively. I am new to this area so
I would appreciate if someone explains where Spark fits in especially within
faster or streaming use case.
What are the best practices involving Spark. Is it always best to deploy it for
processing engine,
For
No, it is just being truncated for display as the ... implies. Pass
truncate=false to the show command.
On Sun, Aug 28, 2016, 15:24 Kevin Tran wrote:
> Hi,
> I wrote to parquet file as following:
>
> ++
> |word|
> ++
>
Hi,
Does anyone know what is the best practises to store data to parquet file?
Does parquet file has limit in size ( 1TB ) ?
Should we use SaveMode.APPEND for long running streaming app ?
How should we store in HDFS (directory structure, ... )?
Thanks,
Kevin.
Hi,
I wrote to parquet file as following:
++
|word|
++
|THIS IS MY CHARACTERS ...|
|// ANOTHER LINE OF CHAC...|
++
These lines are not full text and it is being trimmed down.
Does anyone know how many chacters StringType
I am trying to do a high performance calculations which require custom
functions.
As a first stage I am trying to profile the effect of using UDF and I am
getting weird results.
I created a simple test (in
Hi,
We use such kind of logic for training our model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(3)
.run(train)
Next, during spark streaming, we load model and apply incoming data to this
model to get specific class, for example:
Yes I realised that. Actually I thought it was s not $. it has been around
in shell for years say for actual values --> ${LOG_FILE}, for position 's/
etc
cat ${LOG_FILE} | egrep -v 'rows affected|return status|&&&' | sed -e
's/^[]*//g' -e 's/^//g' -e '/^$/d' > temp.out
Dr
Hi Mich,
This is Scala's string interpolation which allow for replacing $-prefixed
expressions with their values.
It's what cool kids use in Scala to do templating and concatenation
Jacek
On 23 Aug 2016 9:21 a.m., "Mich Talebzadeh"
wrote:
> What is --> s below
20 matches
Mail list logo