I mentioned parquet as input format.
On Oct 10, 2016 11:06 PM, "ayan guha" wrote:
> It really depends on the input format used.
> On 11 Oct 2016 08:46, "Selvam Raman" wrote:
>
>> Hi,
>>
>> How spark reads data from s3 and runs parallel task.
>>
>> Assume I
Hello,
I have a set of data representing various network connections. Each vertex
is represented by a single id, while the edges have a source id,
destination id, and a relationship (peer to peer, customer to provider, or
provider to customer). I am trying to create a sub graph build around a
Hi All,
I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version
1.5.0). I customized the Spark interpreter to use org.apache.spark.
serializer.KryoSerializer as spark.serializer. And in the dependency I
added Kyro-3.0.3 as following:
com.esotericsoftware:kryo:3.0.3
When I
Hi All,
I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version
1.5.0). I customized the Spark interpreter to use
org.apache.spark.serializer.KryoSerializer as spark.serializer. And in the
dependency I added Kyro-3.0.3 as following:
com.esotericsoftware:kryo:3.0.3
When I
Hi,
Your RDD is 5GB, perhaps it is too large to fit into executor's storage
memory. You can refer to the Executors tab in Spark UI to check the
available memory for storage for each of the executor.
Regards,
Chin Wei
On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru
FYI, it supports
[{...}, {...} ...]
Or
{...}
format as input.
On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin" wrote:
> Thanks Luciano - I think this is my issue :(
>
> On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote:
>
> Please take a look at
>
Repartition almost always involves a shuffle.
Let me see if I can explain the recovery stuff...
Say you start with two kafka partitions, topic-0 and topic-1.
You shuffle those across 3 spark parittions, we'll label them A B and C.
Your job is has written
fileA: results for A, offset ranges
Just thought of another potential issue: you should use the "provided"
scope when depending on spark. I.e in your project's pom:
org.apache.spark
spark-core_2.11
2.0.1
provided
On Mon, Oct 10, 2016 at 2:00 PM, Jakob Odersky
Hi,
I have been working on some Lambda Architecture for trading systems.
I think I have completed the dry runs for testing the modules.
For batch layer the criteria is a day's lag (one day old data). This is
acceptable for the users who come from BI background using Tableau but I
think we can
On 10/10/2016 8:14 PM, Cody Koeninger wrote:
Glad it was helpful :)
As far as executors, my expectation is that if you have multiple
executors running, and one of them crashes, the failed task will be
submitted on a different executor. That is typically what I observe
in spark apps, if
Hello team,
Spark version: 1.6.0
I'm trying to persist done data into memory for reusing them. However, when
I call rdd.cache() OR rdd.persist(StorageLevel.MEMORY_ONLY()) it does not
store the data as I can not see any rdd information under WebUI (Storage
Tab).
Therefore I tried
It really depends on the input format used.
On 11 Oct 2016 08:46, "Selvam Raman" wrote:
> Hi,
>
> How spark reads data from s3 and runs parallel task.
>
> Assume I have a s3 bucket size of 35 GB( parquet file).
>
> How the sparksession will read the data and process the data
Your file size is too small this has a significant impact on the namenode. Use
Hbase or maybe hawq to store small writes.
> On 10 Oct 2016, at 16:25, Kevin Mellott wrote:
>
> Whilst working on this application, I found a setting that drastically
> improved the
Hi,
How spark reads data from s3 and runs parallel task.
Assume I have a s3 bucket size of 35 GB( parquet file).
How the sparksession will read the data and process the data parallel. How
it splits the s3 data and assign to each executor task.
Please share me your points.
Note:
if we have
Ho do you submit the application? A version mismatch between the launcher,
driver and workers could lead to the bug you're seeing. A common reason for
a mismatch is if the SPARK_HOME environment variable is set. This will
cause the spark-submit script to use the launcher determined by that
The batch interval was set to 30 seconds; however, after getting the
parquet files to save faster I lowered the interval to 10 seconds. The
number of log messages contained in each batch varied from just a few up to
around 3500, with the number of partitions ranging from 1 to around 15.
I will
Thanks Luciano - I think this is my issue :(
> On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote:
>
> Please take a look at
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
This should give you hints on the necessary cast:
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#tab_java_2
The main ugly thing there is that the java rdd is wrapping the scala
rdd, so you need to unwrap one layer via rdd.rdd()
If anyone wants to work on a PR to update
Thanks!
I am ok with strict rules (despite being French), but even:
[{
"red": "#f00",
"green": "#0f0"
},{
"red": "#f01",
"green": "#0f1"
}]
is not going through…
Is there a way to see what he does not like?
the JSON parser has been pretty good to me until
Glad it was helpful :)
As far as executors, my expectation is that if you have multiple
executors running, and one of them crashes, the failed task will be
submitted on a different executor. That is typically what I observe
in spark apps, if that's not what you're seeing I'd try to get help on
Please take a look at
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
Particularly the note at the required format :
Note that the file that is offered as *a json file* is not a typical JSON
file. Each line must contain a separate, self-contained valid JSON object.
Hi,
I am trying to do some operation on an Hbase table that is being populated
by Spark Streaming.
Now this is just Spark on Hbase as opposed to Spark on Hive -> view on
Hbase etc. I also have Phoenix view on this Hbase table.
This is sample code
scala> val tableName =
Hi All,
I am seeing a huge variation on spark Task Deserialization Time for my
collect and reduce operations. while most tasks complete within 100ms a few
take mote than a couple of seconds which slows the entire program down. I
have attached a screen shot of the web ui where you can see the
Hello,
I am having some interesting issues with a consistent error in spark that
occurs when I'm working with dataframes that are the result of some amounts
of joining and other transformations.
PartitioningCollection requires all of its partitionings have the same
numPartitions.
It seems
I just noticed that you're the author of the code I linked in my previous
email. :) It's helpful.
When using `foreachPartition` or `mapPartitions`, I noticed I can't ask Spark
to write the data on the disk using `df.write()` but I need to use the iterator
to do so, which means losing the
Jean,
I see your point. For the incremental data, which is very small, I should make
sure that the PARTITION BY in the OVER(PARTITION BY ...) is left out so that
all the data will be in one partition when assigned a row number. The query
below should avoid any problems.
“SELECT ROW_NUMBER()
Hi folks,
I am trying to parse JSON arrays and it’s getting a little crazy (for me at
least)…
1)
If my JSON is:
{"vals":[100,500,600,700,800,200,900,300]}
I get:
++
|vals|
++
|[100, 500, 600, 7...|
++
root
|-- vals:
That's enough. Did you see any error?
On Mon, Oct 10, 2016 at 5:08 AM, Joey Echeverria wrote:
> Hi Ryan!
>
> Do you know where I need to configure Kryo for this? I already have
> spark.serializer=org.apache.spark.serializer.KryoSerializer in my
> SparkConf and I registered the
Hi Kevin,
What is the streaming interval (batch interval) above?
I do analytics on streaming trade data but after manipulation of individual
messages I store the selected on in Hbase. Very fast.
HTH
Dr Mich Talebzadeh
LinkedIn *
AFAIK, we can guarantee with/without standardization, the models always
converged to the same solution if there is no regularization. You can refer
the test casts at:
Yes, without commiting the data the consumer rebalances.
The job consumes 3 streams process them. When consuming only one stream it
runs fine. But when consuming three streams, even without joining them,
just deserialize the payload and trigger an output action it fails.
I will prepare code
Is there only one process adding rows? because this seems a little risky if you
have multiple threads doing that…
> On Oct 8, 2016, at 1:43 PM, Benjamin Kim wrote:
>
> Mich,
>
> After much searching, I found and am trying to use “SELECT ROW_NUMBER()
> OVER() + b.id_max
Hi,
by following this article I managed to consume messages from Kafka 0.10 in
Spark 2.0:
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
However, the Java examples are missing and I would like to commit the
offset myself after processing the RDD. Does anybody have a
Yeah, I should be more precise. Those are two direct dependencies.
On Mon, Oct 10, 2016 at 1:15 PM, Sean Owen wrote:
> Usually this sort of thing happens because the two versions are in
> different namespaces in different major versions and both are needed. That
> is true of
What is it you're actually trying to accomplish?
On Mon, Oct 10, 2016 at 5:26 AM, Samy Dindane wrote:
> I managed to make a specific executor crash by using
> TaskContext.get.partitionId and throwing an exception for a specific
> executor.
>
> The issue I have now is that the
Whilst working on this application, I found a setting that drastically
improved the performance of my particular Spark Streaming application. I'm
sharing the details in hopes that it may help somebody in a similar
situation.
As my program ingested information into HDFS (as parquet files), I
I have a question regarding how the default standardization in the ML
version of the Logistic Regression (Spark 1.6) works.
Specifically about the next comments in the Spark Code:
/**
* Whether to standardize the training features before fitting the model.
* The coefficients of models will be
37 matches
Mail list logo