Currently, I use rdd.isEmpty()
Thanks,
Patanachai
On 08/06/2015 12:02 PM, gpatcham wrote:
Is there a way to filter out empty partitions before I write to HDFS other
than using reparition and colasce ?
--
View this message in context:
I'm having the same problem here.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-xml-bundled-in-jar-vs-log4-properties-in-spark-conf-tp23923p24158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Well. I managed to solve that issue after running my tests on a linux
system instead of windows (which I was originally using). However, now I
have an error when I try to reset the hive context using hc.reset(). It
tries to create a file inside directory /user/my_user_name instead of the
usual
Code:
import java.text.SimpleDateFormat
import java.util.Calendar
import java.sql.Date
import org.apache.spark.storage.StorageLevel
def extract(array: Array[String], index: Integer) = {
if (index array.length) {
array(index).replaceAll(\, )
} else {
}
}
case class GuidSess(
Well, I try this approach, and still have issues. Apparently TestHive can
not delete the hive metastore directory. The complete error that I have is:
15/08/06 15:01:29 ERROR Driver: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
Not that I'm aware of. We ran into the similar issue where we didn't want
to keep accumulating all these empty part files in storage on S3 or HDFS.
There didn't seem to be any performance free way to do it with an RDD, so
we just run a non-spark post-batch operation to delete empty files from the
Hi All,
I am using Spark 1.4.1, and I want to know how can I find the complete
function list supported in Spark SQL, currently I only know
'sum','count','min','max'. Thanks a lot.
Hi All,
i am working with spark streaming and twitter's user api.
i used this code to stop streaming
ssc.addStreamingListener(new StreamingListener{
var count=1
override def onBatchCompleted(batchCompleted:
StreamingListenerBatchCompleted) {
count += 1
What is the JIRA number if a JIRA has been logged for this ?
Thanks
On Jan 20, 2015, at 11:30 AM, Cheng Lian lian.cs@gmail.com wrote:
Hey Yi,
I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like
to investigate this issue later. Would you please open an
Any inputs?
In case of following message, is there a way to check which resources is
not sufficient through some logs?
[Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl -
Initial job has not accepted any resources; check your cluster UI to
ensure that workers are
Hi andy,
Is there any method to convert ipython notebook file(.ipynb) to spark notebook
file(.snb) or vice versa?
BR
Jun
At 2015-07-13 02:45:57, andy petrella andy.petre...@gmail.com wrote:
Heya,
You might be looking for something like this I guess:
Figured out the root cause. Master was randomly assigning port to worker
for communication. Because of the firewall on master, worker couldn't
send out messages to master (maybe like resource details). Weird worker
didn't even bother to throw any error also.
On 8/6/2015 3:24 PM, Kushal
Yep, most of the things will work just by renaming it :-D
You can even use nbconvert afterwards
On Thu, Aug 6, 2015 at 12:09 PM jun kit...@126.com wrote:
Hi andy,
Is there any method to convert ipython notebook file(.ipynb) to spark
notebook file(.snb) or vice versa?
BR
Jun
At
I got it running by myself
On Wed, Aug 5, 2015 at 10:27 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Have you tried reading the spark documentation?
http://spark.apache.org/docs/latest/programming-guide.html
Thank you,
Ilya Ganelin
-Original Message-
*From: *ÐΞ€ρ@Ҝ
This isn't really a Spark question. You're trying to parse a string to an
integer, but it contains an invalid character. The exception message
explains this.
On Wed, Aug 5, 2015 at 11:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Code:
import java.text.SimpleDateFormat
import
Worth noting that Spark 1.5 is extending that list of Spark SQL functions
quite a bit. Not sure where in the docs they would be yet, but the JIRA is
here: https://issues.apache.org/jira/browse/SPARK-8159
On Thu, Aug 6, 2015 at 7:27 PM, Netwaver wanglong_...@163.com wrote:
Thanks for your kindly
Thanks for your kindly help
At 2015-08-06 19:28:10, Todd Nist tsind...@gmail.com wrote:
They are covered here in the docs:
http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.functions$
On Thu, Aug 6, 2015 at 5:52 AM, Netwaver wanglong_...@163.com wrote:
Hi
Hi All,
We're trying to run spark with mesos and docker in client mode (since mesos
doesn't support cluster mode) and load the application Jar from HDFS. The
following is the command we're running:
We're getting the following warning before an exception from that command:
Before I debug
I am starting off with classification models, Logistic,RandomForest.
Basically wanted to learn Machine learning.
Since I have a java background I started off with MLib, but later heard R
works as well ( with scaling issues - only).
So, with SparkR was wondering the scaling issue would be resolved
Hi
I am running one application using activator where I am retrieving tweets
and storing them to mysql database using below code.
I get OOM error after 5-6 hour with xmx1048M. If I increase the memory the
OOM get delayed only.
Can anybody give me clue. Here is the code
var tweetStream =
Hi
I am using spark stream 1.3 and using custom checkpoint to save kafka
offsets.
1.Is doing
Runtime.getRuntime().addShutdownHook(new Thread() {
@Override
public void run() {
jssc.stop(true, true);
System.out.println(Inside Add Shutdown Hook);
}
});
to handle stop is safe ?
2.And I
Hi
I have a spark/cassandra setup where I am using a spark cassandra java
connector to query on a table. So far, I have 1 spark master node (2
cores) and 1 worker node (4 cores). Both of them have following
spark-env.sh under conf/:
|#!/usr/bin/env bash
export SPARK_LOCAL_IP=127.0.0.1
I think you can. Give it a try and see.
Thanks
Best Regards
On Tue, Aug 4, 2015 at 7:02 AM, swetha swethakasire...@gmail.com wrote:
Hi,
Can I use multiple UpdateStateByKey Functions in the Streaming job? Suppose
I need to maintain the state of the user session in the form of a Json and
I can reproduce this issue, so looks like a bug of Random Forest, I will
try to find some clue.
2015-08-05 1:34 GMT+08:00 Patrick Lam pkph...@gmail.com:
Yes, I rechecked and the label is correct. As you can see in the code
posted, I used the exact same features for all the classifiers so
Code:
import java.text.SimpleDateFormat
import java.util.Calendar
import java.sql.Date
import org.apache.spark.storage.StorageLevel
def formatStringAsDate(dateStr: String) = new java.sql.Date(new
SimpleDateFormat(-MM-dd).parse(dateStr).getTime())
For some reason you are having two different versions of spark jars in your
classpath.
Thanks
Best Regards
On Tue, Aug 4, 2015 at 12:37 PM, Deepesh Maheshwari
deepesh.maheshwar...@gmail.com wrote:
Hi,
I am trying to read data from kafka and process it using spark.
i have attached my source
Hi,
Recently I gave a talk on a deep dive into data frame api and sql catalyst
. Video of the same is available on Youtube with slides and code. Please
have a look if you are interested.
*http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/
Hi Sadaf
Which version of spark are you using? And whats in the spark-env.sh file?
I think you are using both SPARK_CLASSPATH (which is deprecated) and
spark.executor.extraClasspath (may be set in spark-defaults.sh file).
Thanks
Best Regards
On Wed, Aug 5, 2015 at 6:22 PM, Sadaf Khan
Works like a charm. Thanks Reynold for the quick and efficient response!
Alexis
2015-08-05 19:19 GMT+02:00 Reynold Xin r...@databricks.com:
In Spark 1.5, we have a new way to manage memory (part of Project
Tungsten). The default unit of memory allocation is 64MB, which is way too
high when
Yes you should use orc it is much faster and more compact. Additionally you
can apply compression (snappy) to increase performance. Your data
processing pipeline seems to be not.very optimized. You should use the
newest hive version enabling storage indexes and bloom filters on
appropriate
Given the following command line to spark-submit:
bin/spark-submit --verbose --master local[2]--class
org.yardstick.spark.SparkCoreRDDBenchmark
/shared/ysgood/target/yardstick-spark-uber-0.0.1.jar
Here is the output:
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes
Well the creation of a thrift server would be to allow external access to
the data from JDBC / ODBC type connections. The sparkstreaming-sql
leverages a standard spark sql context and then provides a means of
converting an incoming dstream into a row, look at the MessageToRow trait
in KafkaSource
Hi, here I got two things to know.
FIRST:
In our project we use hive.
We daily get new data. We need to process this new data only once. And send
this processed data to RDBMS. Here in processing we majorly use many
complex queries with joins with where condition and grouping functions.
There are
Additionally it is of key importance to use the right data types for the
columns. Use int for ids, int or decimal or float or double etc for
numeric values etc. - A bad data model using varchars and string where not
appropriate is a significant bottle neck.
Furthermore include partition columns
They are covered here in the docs:
http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.functions$
On Thu, Aug 6, 2015 at 5:52 AM, Netwaver wanglong_...@163.com wrote:
Hi All,
I am using Spark 1.4.1, and I want to know how can I find the
complete function
Thank you Todd,
How is the sparkstreaming-sql project different from starting a thrift
server on a streaming app ?
Thanks again.
Daniel
On Thu, Aug 6, 2015 at 1:53 AM, Todd Nist tsind...@gmail.com wrote:
Hi Danniel,
It is possible to create an instance of the SparkSQL Thrift server,
Have you looked at this?
http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$
On Aug 6, 2015, at 2:52 AM, Netwaver wanglong_...@163.com wrote:
Hi All,
I am using Spark 1.4.1, and I want to know how can I find the
complete function list
Hi,Try using coalesce(1) before calling saveAsTextFile() Thanks Regards,
Meethu M
On Wednesday, 5 August 2015 7:53 AM, Brandon White
bwwintheho...@gmail.com wrote:
What is the best way to make saveAsTextFile save as only a single file?
Hi,
I actually run into the same problem although our endpoint is not
ElasticSearch. When the spark job is dead, we lose some data because
Kinesis checkpoint is already beyond the last point that spark is processed.
Currently, our workaround is to use spark's checkpoint mechanism with
write
+Xiangrui
I am not sure exposing the entire GraphX API would make sense as it
contains a lot of low level functions. However we could expose some
high level functions like PageRank etc. Xiangrui, who has been working
on similar techniques to expose MLLib functions like GLM might have
more to add.
There's no support for IAM roles in the s3n:// client code in Apache Hadoop (
HADOOP-9384 ); Amazon's modified EMR distro may have it..
The s3a filesystem adds it, —this is ready for production use in Hadoop 2.7.1+
(implicitly HDP 2.3; CDH 5.4 has cherrypicked the relevant patches.) I don't
With DEBUG, the log output was over 10MB, so I opted for just INFO output.
The (sanitized) log is attached.
The driver is essentially this code:
info(A)
val t = System.currentTimeMillis
val df = sqlContext.read.parquet(dir).select(...).cache
val elapsed =
Hi! I'm using Spark + Kinesis ASL to process and persist stream data to
ElasticSearch. For the most part it works nicely.
There is a subtle issue I'm running into about how failures are handled.
For example's sake, let's say I am processing a Kinesis stream that produces
400 records per second,
Wanted to use the GRaphX from SparkR , is there a way to do it ?.I think as
of now it is not possible.I was thinking if one can write a wrapper in R
that can call Scala Graphx libraries .
Any thought on this please.
--
View this message in context:
I'm really sorry, by mistake I posted in spark mailing list.
Jorn Frankie Thanks for your reply.
I have many joins, many complex queries and all are table scans. So I think
HBase do not work for me.
On Thursday, August 6, 2015, Jörn Franke jornfra...@gmail.com wrote:
Additionally it is of key
Welcome aboard!
Thanks
Best Regards
On Thu, Aug 6, 2015 at 11:21 AM, Franc Carter franc.car...@rozettatech.com
wrote:
subscribe
See http://spark.apache.org/community.html
Cheers
On Aug 5, 2015, at 10:51 PM, Franc Carter franc.car...@rozettatech.com
wrote:
subscribe
Thanks a lot Igor,
the following hashCode function is stable:
@Override
public int hashCode() {
int hash = 5;
hash = 41 * hash + this.myEnum.ordinal();
return hash;
}
For anyone having the same problem:
Hi, - yes - it's great that you wrote it yourself - it means you have more
control. I have the feeling that the most efficient point to discard as
much data as possible - or even modify your subscription protocol to - your
spark input source - not even receive the other 50 seconds of data is the
Re-reading your description - I guess you could potentially make your input
source to connect for 10 seconds, pause for 50 and then reconnect.
On Thu, Aug 6, 2015 at 10:32 AM, Dimitris Kouzis - Loukas look...@gmail.com
wrote:
Hi, - yes - it's great that you wrote it yourself - it means you
Hi team,
I am very new to SPARK, actually today is my first day.
I have a nested json string which contains timestamp and lot of
other details. I have json messages from which I need to write multiple
aggregation but for now I need to write one aggregation. If code
You just pasted your twitter credentials, consider changing it. :/
Thanks
Best Regards
On Wed, Aug 5, 2015 at 10:07 PM, narendra narencs...@gmail.com wrote:
Thanks Akash for the answer. I added endpoint to the listener and now it is
working.
--
View this message in context:
I built spark from the v1.5.0-snapshot-20150803 tag in the repo and tried
again.
The initialization time is about 1 minute now, which is still pretty
terrible.
On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver philip.wea...@gmail.com
wrote:
Absolutely, thanks!
On Wed, Aug 5, 2015 at 9:07 PM,
Hi,
Is there a way to instantiate multiple Thrift servers on one Spark Cluster?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Thrift-servers-on-one-Spark-cluster-tp24148.html
Sent from the Apache Spark User List mailing list archive at
Hi everyone,
I was working with Spark for a little while now and have encountered a very
strange behaviour that caused me a lot of headaches:
I have written my own POJOs to encapsulate my data and this data is held in
some JavaRDDs. Part of these POJOs is a member variable of a custom enum
type.
enums hashcode is jvm instance specific(ie. different jvms will give you
different values), so you can use ordinal in hashCode computation or use
hashCode on enums ordinal as part of hashCode computation
On 6 August 2015 at 11:41, Warfish sebastian.ka...@gmail.com wrote:
Hi everyone,
I was
Thanks. Repartitioning to a smaller number of partitions seems to fix my
issue, but I'll keep broadcasting in mind (droprows is an integer array
with about 4 million entries).
On Wed, Aug 5, 2015 at 12:34 PM, Philip Weaver philip.wea...@gmail.com
wrote:
How big is droprows?
Try explicitly
Hello,
I am using the Python API to perform a grid search and train models using
LogisticRegressionWithSGD.
I am using r3.xl machines in EC2, running on top of YARN in cluster mode.
The training RDD is persisted in memory and on disk. Some of the models
train successfully, but then at some
I have a set of data based on which I want to create a classification
model. Each row has the following form:
user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1
etc
There are about 1M users, 2 classes, and 1M
Hello, everyone!
I have a case, when running standalone cluster: on master
stop-all.sh/star-all.sh are invoked, streaming app loses all it's
executors, but does not interrupt.
Since it is a streaming app, expected to get it's results ASAP, an
downtime is undesirable.
Is there any workaround
Would you mind to provide the driver log?
On 8/6/15 3:58 PM, Philip Weaver wrote:
I built spark from the v1.5.0-snapshot-20150803 tag in the repo and
tried again.
The initialization time is about 1 minute now, which is still pretty
terrible.
On Wed, Aug 5, 2015 at 9:08 PM, Philip Weaver
61 matches
Mail list logo