What about /blah/*/blah/out*.avro?
On 27 May 2015 18:08, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I am doing that now.
Is there no other way ?
On Wed, May 27, 2015 at 12:40 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
How about creating two and union [ sc.union(first, second) ] them?
Try something like that:
def readGenericRecords(sc: SparkContext, inputDir: String, startDate:
Date, endDate: Date) = {
// assuming a list of paths
val paths: Seq[String] = getInputPaths(inputDir, startDate, endDate)
val job = Job.getInstance(new
You can request number of cores and amount of memory for each executor.
On 27 May 2015 18:25, canan chen ccn...@gmail.com wrote:
Thanks Arush.
My scenario is that In standalone mode, if I have one worker, when I start
spark-shell, there will be one executor launched. But if I have 2 workers,
def readGenericRecords(sc: SparkContext, inputDir: String, startDate:
Date, endDate: Date) = {
val path = getInputPaths(inputDir, startDate, endDate)
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](/A/B/C/D/D/2015/05/22/out-r-*.avro)
}
Thanks Arush.
My scenario is that In standalone mode, if I have one worker, when I start
spark-shell, there will be one executor launched. But if I have 2 workers,
there will be 2 executors launched, so I am wondering the mechanism of
executor allocation.
Is it possible to specify how many
Till 1.3, you have to prepare the DF appropriately
def setupCondition(t):
if t[1] 100:
v = 1
else:
v = 0
return Row(col1=t[0],col2=t[1],col3=t[2],col4=v)
d1=[[1001,100,50],[1001,200,100],[1002,100,99]]
d1RDD = sc.parallelize(d1).map(setupCondition)
d1DF =
Can someone share me some code to use CombineInputFormat to read avro data.
Today I use
def readGenericRecords(sc: SparkContext, inputDir: String, startDate: Date,
endDate: Date) = {
val path = getInputPaths(inputDir, startDate, endDate)
sc.newAPIHadoopFile[AvroKey[GenericRecord],
Hi Akhil and all
My previous code has some problems,all the executors are looping and
running the same command. That's not what I am expecting.previous code:
val shellcompare = List(run,sort.sh)
val shellcompareRDD = sc.makeRDD(shellcompare)
val result = List(aggregate,result)
You can do that using FileInputFormat.addInputPath
2015-05-27 10:41 GMT+02:00 ayan guha guha.a...@gmail.com:
What about /blah/*/blah/out*.avro?
On 27 May 2015 18:08, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I am doing that now.
Is there no other way ?
On Wed, May 27, 2015 at 12:40 PM,
No one using History server? :)
Am I the only one need to see all user's logs?
Jianshi
On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I'm using Spark 1.4.0-rc1 and I'm using default settings for history
server.
But I can only see my own logs. Is it
Name: Objectway.com
URL: www.objectway.com
Description:
We're building a Big Data solution on Spark. We use Apache Flume for
parallel message queuing infrastructure and Apache Spark Streaming for near
real time datastream processing combined with a rule engine for complex
events catching.
Same as you, there are lots of people coming from MapReduce world, and try to
understand the internals of Spark. Hope below can help you some way.
For the end users, they only have concept of Job. I want to run a word count
job from this one big file, that is the job I want to run. How many
Evan,
could you please look into this post.Below is the link.Any thoughts or
suggestion is really appreciated
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-partition-issue-with-Stanford-NLP-td23048.html
--
View this message in context:
Thanks Yana,
My current experience here is after running some small spark-submit
based tests the Master once again stopped being reachable. No change in
the test setup. I restarted Master/Worker and still not reachable.
What might be the variables here in which association with the
Dear Spark users,
Given a DataFrame df with a column named foo bar, I can call a Spark SQL
built-in function on it like so:
df.select(functions.max(df(foo bar)))
However, if I want to apply a Hive UDF named myCustomFunction, I need to
write
df.selectExpr(myCustomFunction(`foo bar`))
which
Hi Zhipeng,
yes, your understanding is correct. the SER portion just refers to how
its stored in memory. On disk, the data always has to be serialized.
On Fri, May 22, 2015 at 10:40 PM, Jiang, Zhipeng zhipeng.ji...@intel.com
wrote:
Hi Todd, Howard,
Thanks for your reply, I might not
You may be the only one not seeing all the logs. Are you sure all the users
are writing to the same log directory? The HS can only read from a single
log directory.
On Wed, May 27, 2015 at 5:33 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
No one using History server? :)
Am I the only one
Yes, all written to the same directory on HDFS.
Jianshi
On Wed, May 27, 2015 at 11:57 PM, Marcelo Vanzin van...@cloudera.com
wrote:
You may be the only one not seeing all the logs. Are you sure all the
users are writing to the same log directory? The HS can only read from a
single log
Then:
- Are all files readable by the user running the history server?
- Did all applications call sc.stop() correctly (i.e. files do not have the
.inprogress suffix)?
Other than that, always look at the logs first, looking for any errors that
may be thrown.
On Wed, May 27, 2015 at 9:10 AM,
I have not run into this particular issue but I'm not using latest bits in
production. However, testing your theory should be easy -- MySQL is just a
database, so you should be able to use a regular mysql client and see how
many connections are active. You can then compare to the maximum allowed
Hi all
I believe I have created a multi-label classifier using LogisticRegression
but there is one snag. No matter what features I use to get the prediction,
it will always return the label. I feel like I need to set a threshold but
can't seem to figure out how to do that. I attached the code
Hi.
I think that it's possible to do:
*df.select($*, lit(null).as(col17, lit(null).as(col18,
lit(null).as(col19,, lit(null).as(col26)*
Any other advice?
Miguel.
On Wed, May 27, 2015 at 5:02 PM, Masf masfwo...@gmail.com wrote:
Hi.
I have a DataFrame with 16 columns (df1) and
I'm wondering how logging works in Spark.
I see that there's the log4j.properties.template file in the conf directory.
Safe to assume Spark is using log4j 1? What's the approach if we're using
log4j 2? I've got a log4j2.xml file in the job jar which seems to be
working for my log statements
I'm new to Spark and I'm getting bad performance with classification
methods on Spark MLlib (worse than R in terms of AUC).
I am trying to put my own parameters rather than the default parameters.
Here is the method I want to use :
train(RDD
I'm new to Spark and I'm getting bad performance with classification methods
on Spark MLlib (worse than R in terms of AUC).
I am trying to put my own parameters rather than the default parameters.
Here is the method I want to use :
train(RDDLabeledPoint input,
int numIterations,
Hi.
I have a DataFrame with 16 columns (df1) and another with 26 columns(df2).
I want to do a UnionAll. So, I want to add 10 columns to df1 in order to
have the same number of columns in both dataframes.
Is there some alternative to withColumn?
Thanks
--
Regards.
Miguel Ángel
only an answer to one of your questions:
What about log statements in the
partition processing functions? Will their log statements get logged into
a
file residing on a given 'slave' machine, or will Spark capture this log
output and divert it into the log file of the driver's machine?
A colleague and I were having a discussion and we were disagreeing about
something in Spark/Mesos that perhaps someone can shed some light into.
We have a mesos cluster that runs spark via a sparkHome, rather than
downloading an executable and such.
My colleague says that say we have parquet
Hi all,
I am setting up a spark standalone server with an external hive metastore
(using mysql), there is an issue that after 5 minutes inactivity, if I try
to reconnect to the metastore (i.e. by executing a new query), it hangs for
about 10 mins then times out. My guess is that datanucleus does
i am facing a similar issue. when the job runs with partitions num of cores
then sometimes the executors are getting lost and the job doesnt complete.
is there any additional logging that can be turned on to see the exact cause
of this issue?
--
View this message in context:
Hi There,
I am trying to process millions of data with spark/scala integrated with
stanford NLP (3.4.1).
Since I am using social media data I have to use NLP for the themes
generation (pos tagging) and Sentiment calulation.
I have to deal with Twitter data and NON Twitter data separately.So I
Hi All,
I'm new to Spark and I'd like some help understanding if a particular use
case would be a good fit for Spark Streaming.
I have an imaginary stream of sensor data consisting of integers 1-10.
Every time the sensor reads 10 I'd like to average all the numbers that
were received since the
It looks like you are training each model i (for label i) by only using
data with label i. You need to use all of your data to train each model so
the models can compare each label i with the other labels (roughly
speaking).
However, what you're doing is multiclass (not multilabel)
Can you tell us a bit more about (schema of) your JSON ?
You can find sample JSON
in sql/core/src/test//scala/org/apache/spark/sql/json/TestJsonData.scala
Cheers
On Wed, May 27, 2015 at 12:33 PM, Michael Stone mst...@mathom.us wrote:
Can anyone provide some suggestions on how to debug this?
Here is example after git clone-ing latest 1.4.0-SNAPSHOT. The first 3
runs (FINISHED) were successful and connected quickly. Fourth run (ALIVE)
is failing on connection/association.
URL: spark://mellyrn.local:7077
REST URL: spark://mellyrn.local:6066 (cluster mode)
Workers: 1
Cores: 8 Total,
Can anyone provide some suggestions on how to debug this? Using spark
1.3.1. The json itself seems to be valid (other programs can parse it)
and the problem seems to lie in jsonRDD trying to describe use a
schema.
scala sqlContext.jsonRDD(rdd).count()
java.util.NoSuchElementException:
Looks like the exception was caused by resolved.get(prefix ++ a) returning
None :
a = StructField(a.head, resolved.get(prefix ++ a).get, nullable =
true)
There are three occurrences of resolved.get() in createSchema() - None
should be better handled in these places.
My two cents.
On
The model is learned using an iterative convex optimization algorithm.
numIterations, stepSize and miniBatchFraction are for those; you can
see details here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
On Wed, May 27, 2015 at 01:13:43PM -0700, Ted Yu wrote:
Can you tell us a bit more about (schema of) your JSON ?
It's fairly simple, consisting of 22 fields with values that are mostly
strings or integers, except that some of the fields are objects
with http header/value pairs. I'd guess it's
http://www.meetup.com/Bay-Area-Stream-Processing/events/219086133/
Thursday, June 4, 2015
6:45 PM
TubeMogul
http://maps.google.com/maps?f=qhl=enq=1250+53rd%2C+Emeryville%2C+CA%2C+94608%2C+us
1250 53rd
St #1
Emeryville, CA
6:45PM to 7:00PM - Socializing
7:00PM to 8:00PM - Talks
8:00PM to
After submitting the job, if you do a ps aux | grep spark-submit then you
can see all JVM params. Are you using the highlevel consumer (receiver
based) for receiving data from Kafka? In that case if your throughput is
high and the processing delay exceeds batch interval then you will hit this
How about creating two and union [ sc.union(first, second) ] them?
Thanks
Best Regards
On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have this piece
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](
It depends on how you use Spark, if you use Spark with Yarn and enable
dynamic allocation, the number of executor is not fixed, will change
dynamically according to the load.
Thanks
Jerry
2015-05-27 14:44 GMT+08:00 canan chen ccn...@gmail.com:
It seems the executor number is fixed for the
How does the dynamic allocation works ? I mean does it related
with parallelism of my RDD and how does driver know how many executor it
needs ?
On Wed, May 27, 2015 at 2:49 PM, Saisai Shao sai.sai.s...@gmail.com wrote:
It depends on how you use Spark, if you use Spark with Yarn and enable
Yes. I think that your sql solution will work but I was looking for a
solution with DataFrame API. I had thought to use UDF such as:
val myFunc = udf {(input: Int) = {if (input 100) 1 else 0}}
Although I'd like to know if it's possible to do it directly in the
aggregation inserting a lambda
You can throttle the no receiver direct Kafka stream using
spark.streaming.kafka.maxRatePerPartition
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
On Wed, May 27, 2015 at 4:34 PM, Ted Yu yuzhih...@gmail.com wrote:
Have you seen
Thanks Yong, this is very helpful. And found ShuffleMemoryManager which is
used to allocate memory across tasks in one executor.
These 2 tasks have to share the 2G heap memory. I don't think specifying
the memory per task is a good idea, as task is running in the Thread level,
and Memory only
Could you try to see which phase is causing the hang ? i.e. If you do a
count() after flatMap does that work correctly ? My guess is that the hang
is somehow related to data not fitting in the R process memory but its hard
to say without more diagnostic information.
Thanks
Shivaram
On Tue, May
Got it, thank you, Tathagata and Ted.
Could you comment on my other question
http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tt23062.html
as well? Basically, I'm trying to get a handle on a good
My executor has the following spec (lscpu):
CPU(s): 16
Core(s) per socket: 4
Socket(s): 2
Thread(s) per code: 2
The CPU count is obviously 4*2*2 = 16. My question is what value is Spark
expecting in SPARK_EXECUTOR_CORES ? The CPU count (16) or total # of cores
(2 * 2 = 4) ?
Thanks
What does your master log say -- normally the master should NEVER shut
down...you should be able to spark-submit to infinity with no issues...So
the question about high variance on upstart is one issue, but the other
thing that's puzzling to me is why your master is ever down to begin
Thanks for the information. I'll try that out with Spark 1.4.
On Thu, May 28, 2015 at 9:54 AM, DB Tsai dbt...@dbtsai.com wrote:
LinearRegressionWithSGD requires to tune the step size and # of
iteration very carefully. Please try Linear Regression with elastic
net implementation in Spark 1.4
Have you seen
http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application
?
Cheers
On Wed, May 27, 2015 at 4:11 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,
With the no receivers approach to streaming from Kafka, is there a way to
set something
Yes, you are at right path. Only thing to remember is placing hive site XML
to correct path so spark can talk to hive metastore.
Best
Ayan
On 28 May 2015 10:53, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
hey guys
On the Hive/Hadoop ecosystem we have using Cloudera
bq. detect the presence of a new node and start utilizing it
My understanding is that Spark is concerned with managing executors.
Whether request for an executor is fulfilled on an existing node or a new
node is up to the underlying cluster manager (YARN e.g.).
Assuming the cluster is single
Sorry. I mean the parameter step.
2015-05-28 12:21 GMT+08:00 Maheshakya Wijewardena mahesha...@wso2.com:
What is the parameter for the learning rate alpha? LinearRegressionWithSGD
has only following parameters.
@param data: The training data.
@param iterations:The
Yes, but be sure you put the hive-site.xml under your class path.
Any problem you meet?
Cheng Hao
From: Sanjay Subramanian [mailto:sanjaysubraman...@yahoo.com.INVALID]
Sent: Thursday, May 28, 2015 8:53 AM
To: user
Subject: Pointing SparkSQL to existing Hive Metadata with data file locations
in
Hi,
I'm trying to use Sparks' *LinearRegressionWithSGD* in PySpark with the
attached dataset. The code is attached. When I check the model weights
vector after training, it contains `nan` values.
[nan,nan,nan,nan,nan,nan,nan,nan]
But for some data sets, this problem does not occur. What might
LinearRegressionWithSGD requires to tune the step size and # of
iteration very carefully. Please try Linear Regression with elastic
net implementation in Spark 1.4 in ML framework, which uses quasi
newton method and step size will be automatically determined. That
implementation also matches the
Hi Joseph,
I looked at that but it seems that LogisticRegressionWithLBFGS's run
method takes RDD[LabeledPoint] objects so I'm not sure it's exactly
how one would use it in the way I think you're describing
On Wed, May 27, 2015 at 4:04 PM, Joseph Bradley jos...@databricks.com wrote:
It looks
hey guys
On the Hive/Hadoop ecosystem we have using Cloudera distribution CDH 5.2.x ,
there are about 300+ hive tables.The data is stored an text (moving slowly to
Parquet) on HDFS.I want to use SparkSQL and point to the Hive metadata and be
able to define JOINS etc using a programming
Dell - Internal Use - Confidential
Did you check https://drive.google.com/file/d/0B7tmGAdbfMI2OXl6azYySk5iTGM/edit
and
http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Not sure if the spark kafka receiver emits metrics on the lag, check this link
out
Hi,
Yes, I'm using createStream, but the storageLevel param is by default
MEMORY_AND_DISK_SER_2. Besides, the driver's memory is also growing. I
don't think Kafka messages will be cached in driver.
On Thu, May 28, 2015 at 12:24 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Are you using the
Hi,
With the no receivers approach to streaming from Kafka, is there a way to
set something like spark.streaming.receiver.maxRate so as not to overwhelm
the Spark consumers?
What would be some of the ways to throttle the streamed messages so that the
consumers don't run out of memory?
--
Hi,
I'm trying to understand if there are design patterns for autoscaling Spark
(add/remove slave machines to the cluster) based on the throughput.
Assuming we can throttle Spark consumers, the respective Kafka topics we
stream data from would start growing. What are some of the ways to
The drive has a heuristic mechanism to decide the number of executors in
the run-time according the pending tasks. You could enable with
configuration, you could refer to spark document to find the details.
2015-05-27 15:00 GMT+08:00 canan chen ccn...@gmail.com:
How does the dynamic allocation
If with mesos, how do we control the number of executors? In our cluster,
each node only has one executor with very big JVM. Sometimes, if the
executor dies, all the concurrent running tasks will be gone. We would like
to have multiple executors in one node but can not figure out a way to do
it in
Typo. We can not figure a way to increase the number of executor in one
node in mesos.
On Wednesday, May 27, 2015, DB Tsai dbt...@dbtsai.com wrote:
If with mesos, how do we control the number of executors? In our cluster,
each node only has one executor with very big JVM. Sometimes, if the
Hi Akhil,
Thanks for your reply. Accoding to the Streaming tab of Web UI, the
Processing Time is around 400ms, and there's no Scheduling Delay, so I
suppose it's not the Kafka messages that eat up the off-heap memory. Or
maybe it is, but how to tell?
I googled about how to check the off-heap
I'm not sure about Mesos, maybe someone has the Mesos experience can help
answer this.
Thanks
Jerry
2015-05-27 15:21 GMT+08:00 DB Tsai dbt...@dbtsai.com:
Typo. We can not figure a way to increase the number of executor in one
node in mesos.
On Wednesday, May 27, 2015, DB Tsai
Does anyone can answer my question ? I am curious to know if there's
multiple reducer tasks in one executor, how to allocate memory between
these reducers tasks since each shuffle will consume a lot of memory ?
On Tue, May 26, 2015 at 7:27 PM, Evo Eftimov evo.efti...@isecc.com wrote:
the link
Hello,
I am trying out 1.4.0 and notice there are some differences in behavior
with Timestamp between 1.3.1 and 1.4.0.
In 1.3.1, I can compare a Timestamp with string.
scala val df = sqlContext.createDataFrame(Seq((1,
Timestamp.valueOf(2015-01-01 00:00:00)), (2,
Timestamp.valueOf(2014-01-01
I am doing that now.
Is there no other way ?
On Wed, May 27, 2015 at 12:40 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
How about creating two and union [ sc.union(first, second) ] them?
Thanks
Best Regards
On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
wrote:
I
73 matches
Mail list logo