Thanks Krishna, but I believe the memory consumed on the executors is being
exhausted in my case. I've allocated the max 10g that I can to both the
driver and executors. Are there any alternatives solutions to fetching the
top 1M rows after ordering the dataset?
Thanks!
On Fri, Apr 29, 2016 at
I recently encountered similar network related errors and was able to fix
it by applying the ethtool updates decribed here [
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-5085]
On Friday, April 29, 2016, Buntu Dev wrote:
> Just to provide more details, I
So connecting to the cluster (port 7077) works. That is to say, the Spark
Context is created. The timeout occurs on the worker side when I run any
command with .collect(). The client (local machine) basically waits
forever). I'm wondering if maybe I'm not understanding the architecture
correctly
Have you tried connecting to the port 7077 on the cluster from your local
machine to see if it works ok ?
Sent from my iPhone
> On Apr 29, 2016, at 5:58 PM, Richard Han wrote:
>
> I have an EC2 installation of Spark Standalone Master/Worker set up. The two
> can talk to
I have an EC2 installation of Spark Standalone Master/Worker set up. The
two can talk to one another, and all ports are open in the security group
(just to make sure it isn't the port). When I run spark-shell on the master
node (setting it to --master spark://ip:7077) it runs everything correctly.
I'd like to echo a question that was asked earlier this year:
If we do a global sort of a dataframe (with two columns: col_1, col_2) by
(col_1, col_2/desc) and then dropDuplicates on col_1, will it retain the
first row of each sorted group? i.e. Will it return the row with the
greatest value of
Just to provide more details, I have 200 blocks (parquet files) with avg
block size of 70M. When limiting the result set to 100k ("select * from tbl
order by c1 limit 10") works but when increasing it to say 1M I keep
running into this error:
Connection reset by peer: socket write error
I
I have the following simple example that I can't get to work correctly.
In [1]:
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType
from pyspark.sql.functions import asc, desc, sum, count
sqlContext = SQLContext(sc)
error_schema
Do you need 2 --num-executors ?
Sent from my iPhone
> On 29 Apr 2016, at 20:25, Ashish Sharma [via Apache Spark User List]
> wrote:
>
> Submit Command1:
>
> spark-submit --class working.path.to.Main \
> --master yarn \
>
OK. Thanks Cody!
On Fri, Apr 29, 2016 at 12:41 PM, Cody Koeninger wrote:
> If worker to broker communication breaks down, the worker will sleep
> for refresh.leader.backoff.ms before throwing an error, at which point
> normal spark task retry (spark.task.maxFailures) comes
If worker to broker communication breaks down, the worker will sleep
for refresh.leader.backoff.ms before throwing an error, at which point
normal spark task retry (spark.task.maxFailures) comes into play.
If driver to broker communication breaks down, the driver will sleep
for
Hello, I think I may have jumped to the wrong conclusion about symlinks,
and I was able to get what I want working perfectly.
I added these two settings in my importer application:
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs",
"false")
Any thoughts?
I can explain more on problem but basically shuffle data doesn't seem to
fit in reducer memory (32GB) and I am looking ways to process them on
disk+memory.
Thanks
On Thu, Apr 28, 2016 at 10:07 AM, Nirav Patel wrote:
> Hi,
>
> I tried to convert a
Hello,
I have a parquet dataset, partitioned by a column 'a'. I want to take
advantage
of Spark SQL's ability to filter to the partition when you filter on 'a'. I
also
want to periodically update individual partitions without disrupting any
jobs
that are querying the data.
The obvious solution
what does the WebUI show? What do you see when you click on "stderr" and
"stdout" links ? These links must contain stdoutput and stderr for each
executor.
About your custom logging in executor, are you sure you checked "${spark.
yarn.app.container.log.dir}/spark-app.log"
Actual location of this
If you’re getting the logs, then it really isn’t CEP unless you consider the
event to be the log from the bus.
This doesn’t sound like there is a time constraint.
Your bus schedule is fairly fixed and changes infrequently.
Your bus stops are relatively fixed points. (Within a couple of
OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a
default for rebalancing and they say that refresh.leader.backoff.ms of 200
to refresh leader is very aggressive and suggested us to increase it to
2000. Even after increasing to 2500 I still get Leader Lost Errors.
Is
Hello, Spark is executing a create table sentence (using JDBC) to
MSSQLServer with a mapping column type like ColName Bit(1) for boolean
types, This create table cannot be executed on MSSQLServer.
In class JdbcDialect the mapping for Boolean type is Bit(1), so the
question is, this is a problem
OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a
default for rebalancing and they say that refresh.leader.backoff.ms of 200
to refresh leader is very aggressive and suggested us to increase it to
2000. Even after increasing to 2500 I still get Leader Lost Errors.
Is
Hi Ted & Nguyen,
@Ted , I was under the belief that if the log4j.properties file would be
taken from the application classpath if file path is not specified.
Please correct me if I am wrong. I tried your approach as well still I
couldn't find the logs.
@nguyen I am running it on a Yarn cluster
These are executor's logs, not the driver logs. To see this log files, you
have to go to executor machines where tasks is running. To see what you
will print to stdout or stderr you can either go to the executor machines
directly (will store in "stdout" and "stderr" files somewhere in the
executor
Hi Spark Team,
I have asked the same question on stack overflow , no luck yet.
http://stackoverflow.com/questions/36923949/where-to-find-logs-within-spark-rdd-processing-function-yarn-cluster-mode?noredirect=1#comment61419406_36923949
I am running my Spark Application on Yarn Cluster. No
ok like any work you need to start this from a simple model. take one bus
only (identified by bus number which is unique).
for any bus no N you have two logs LOG A and LOG B and LOG C the
coordinator from Central computer that sends estimated time of arrival
(ETA) to the bus stops. Pretty simple.
On 28 Apr 2016, at 22:59, Alexander Pivovarov
> wrote:
Spark works well with S3 (read and write). However it's recommended to set
spark.speculation true (it's expected that some tasks fail if you read large S3
folder, so speculation should
Hi
I try to explain my case ..
Situation is not so simple in my logs and solution. There also many
types of logs and there are from many sources.
They are as csv-format and header line includes names of the columns.
This is simplified description of input logs.
LOG A's: bus coordinate logs
Hi,
Two points here.
1) Beeline uses JDBC connection to connect to hive server. You can actually
put your code in an hql file and run it as
beeline -u jdbc:hive2://:10010/default
org.apache.hive.jdbc.HiveDriver -n hduser -p x -f mycode,hql
2) How do you want to split your output file. Is
Hi ,
I observed if I use subset of same dataset or data set is small its
writing to many part files .
If data set grows its writing to only part files rest all part files empty.
Thanks,
Divya
On 25 April 2016 at 23:15, nguyen duc tuan wrote:
> Maybe the problem is the
27 matches
Mail list logo