I am using the latest Spark version 1.6
I have increased the maximum number of open files using this command *sysctl
-w fs.file-max=3275782*
Also I increased the limit for the user who run the spark job by updating
the /etc/security/limits.conf file. Soft limit is 1024 and Hard limit
is 65536.
T
Thanks Reynold for the reason as to why sortBykey invokes a Job
When you say "DataFrame/Dataset does not have this issue" is it right to
assume you are referring to Spark 2.0 or Spark 1.6 DF already has built-in
it?
Thanking You
--
Hi,
I have a streaming program with the block as below [ref:
https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala
]
1 val lines = messages.map(_._2)
2 val hashTags = lines.flatMap(status => status.split(" "
).filter(_.startsWit
Usually no - but sortByKey does because it needs the range boundary to be
built in order to have the RDD. It is a long standing problem that's
unfortunately very difficult to solve without breaking the RDD API.
In DataFrame/Dataset we don't have this issue though.
On Sun, Apr 24, 2016 at 10:54 P
Hi,
After joining two dataframes, saving dataframe using Spark CSV.
But all the result data is being written to only one part file whereas
there are 200 part files being created, rest 199 part files are empty.
What is the cause of uneven partitioning ? How can I evenly distribute the
data ?
Would
Hi,
You can use YourKit to profile workloads and please see:
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit
// maropu
On Mon, Apr 25, 2016 at 10:24 AM, Edmon Begoli wrote:
> I am working on an experimental research into memory use and profiling of
I am working on an experimental research into memory use and profiling of
memory use and allocation by machine learning functions across number of
popular libraries.
Is there a facility within Spark, and MLlib specifically to track the
allocation and use of data frames/memory by MLlib?
Please adv
Maybe this is due to config spark.scheduler.minRegisteredResourcesRatio,
you can try set it as 1 to see the behavior.
// Submit tasks only after (registered resources / total expected resources)
// is equal to at least this value, that is double between 0 and 1.
var minRegisteredRatio =
math.m
Could you change numPartitions to {16, 32, 64} and run your program for
each to see how many partitions are allocated to each worker? Let's see if
you experience an all-nothing imbalance that way; if so, my guess is that
something else is odd in your program logic or spark runtime environment,
but
I have data in a DataFrame loaded from a CSV file. I need to load this data
into HBase using an RDD formatted in a certain way.
val rdd = sc.parallelize(
Array(key1,
(ColumnFamily, ColumnName1, Value1),
(ColumnFamily, ColumnName2, Value2),
(
Mike, All,
It turns out that the second time we encountered the uneven-partition issue
is not due to spark-submit. It was resolved with the change in placement of
count().
Case-1:
val numPartitions = 8
// read uAxioms from HDFS, use hash partitioner on it and persist it
// read type1Axioms from
Have you taken a look at:
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
On Sun, Apr 24, 2016 at 8:18 AM, coder wrote:
> JavaRDD prdd = sc.textFile("c:\\fls\\people.txt").map(
> new Function() {
> public Person call(String line) throws Exception {
>
JavaRDD prdd = sc.textFile("c:\\fls\\people.txt").map(
new Function() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
We have similar jobs consuming from Kafka and writing to elastic search and the
culprit is usually not enough memory for the executor or driver or not enough
executors in general to process the job try using dynamic allocation if you're
not too sure about how many cores/executors you actually ne
Hi Daniel,
How did you get the Phoenix plugin to work? I have CDH 5.5.2 installed which
comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work?
Thanks,
Ben
> On Apr 24, 2016, at 1:43 AM, Daniel Haviv
> wrote:
>
> Hi,
> I tried saving DF to HBase using a hive table with hbase s
So in that case then the result will be following:
[1,[1,1]][3,[3,1]][2,[2,1]]Thanks for explaining the meaning of the it. But the
question is that how first() will be [3,[1,1]]? In fact, if there were any
ordering in the final result, it will be [1,[1,1]], instead of [3,[1,1]],
correct?
Yong
S
Which version of Spark are you using ?
How did you increase the open file limit ?
Which operating system do you use ?
Please see Example 6. ulimit Settings on Ubuntu under:
http://hbase.apache.org/book.html#basic.prerequisites
On Sun, Apr 24, 2016 at 2:34 AM, fanooos wrote:
> I have a spark s
2016-04-24 13:38 GMT+02:00 Stefan Falk :
> sc.parallelize(cfile.toString()
> .split("\n"), 1)
Try `sc.textFile(pathToFile)` instead.
>java.io.IOException: Broken pipe
>at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>at sun.nio.ch.SocketDispatcher.write(SocketD
I try to save a large text file of a approx. 5GB
sc.parallelize(cfile.toString()
.split("\n"), 1)
.saveAsTextFile(new Path(path+".cs", "data").toUri.toString)
but I keep getting
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
I have a spark streaming job that read tweets stream from gnip and write it
to Kafak.
Spark and kafka are running on the same cluster.
My cluster consists of 5 nodes. Kafka-b01 ... Kafka-b05
Spark master is running on Kafak-b05.
Here is how we submit the spark job
*nohup sh $SPZRK_HOME/bin/s
Hi,
I tried saving DF to HBase using a hive table with hbase storage handler and
hiveContext but it failed due to a bug.
I was able to persist the DF to hbase using Apache Pheonix which was pretty
simple.
Thank you.
Daniel
> On 21 Apr 2016, at 16:52, Benjamin Kim wrote:
>
> Has anyone found
21 matches
Mail list logo