Hey all,
In working with the DecisionTree classifier, I found it difficult to extract
rules that could easily facilitate visualization with libraries like D3.
So for example, using : print(model.toDebugString()), I get the following
result =
If (feature 0 = -35.0)
If (feature 24 = 176.0)
Interestingly, if there is nothing running on dev spark-shell, it recovers
successfully and regains the lost executors. Attaching the log for that.
Notice, the Registering block manager .. statements in the very end after
all executors were lost.
On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood
I'd be less concerned about what the streaming ui shows than what's
actually going on with the job. When you say you were losing messages, how
were you observing that? The UI, or actual job output?
The log lines you posted indicate that the checkpoint was restored and
those offsets were
Using TABLESAMPLE(0.1) is actually way worse. Spark first spends 12 minutes
to discover all split files on all hosts (for some reason) before it even
starts the job, and then it creates 3.5 million tasks (the partition has
~32k split files).
On Wed, Aug 26, 2015 at 9:36 AM, Jörn Franke
Hi,
I'm trying to query hive table which is based on avro in spark SQL and
seeing below errors.
15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException
determining schema. Returning signal schema to indicate problem
org.apache.hadoop.hive.serde2.avro.AvroSerdeException:
I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from
and I don't particularly care which rows. Doing a LIMIT unfortunately
results in two stages where the first stage reads the whole table, and the
second then performs the limit with a single worker, which is not very
Have you tried tablesample? You find the exact syntax in the documentation,
but it exlxactly does what you want
Le mer. 26 août 2015 à 18:12, Thomas Dudziak tom...@gmail.com a écrit :
Sorry, I meant without reading from all splits. This is a single partition
in the table.
On Wed, Aug 26,
spark-shell-hang-on-exit.tdump
http://apache-spark-user-list.1001560.n3.nabble.com/file/n24461/spark-shell-hang-on-exit.tdump
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-saveAsParquetFile-hangs-on-app-exit-tp24460p24461.html
Sent from
Thanks for the suggestions! I tried the following:
I removed
createOnError = true
And reran the same process to reproduce. Double checked that checkpoint is
loading:
15/08/26 10:10:40 INFO
DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring
KafkaRDD for time 1440608825000
Compared offsets, and it continues from checkpoint loading:
15/08/26 11:24:54 INFO
DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring
KafkaRDD for time 1440612035000 ms [(install-json,5,826112083,826112446),
(install-json,4,825772921,825773536),
Attaching log for when the dev job gets stuck (once all its executors are
lost due to preemption). This is a spark-shell job running in yarn-client
mode.
On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood sadhan.s...@gmail.com wrote:
Hi All,
We've set up our spark cluster on aws running on yarn
Hi, Guys,
Is it possible that RDD created by driver A be used driver B?
Thanks!
Thanks Sonal.. I shall try doing that..
On 26-Aug-2015, at 1:05 pm, Sonal Goyal sonalgoy...@gmail.com wrote:
You can try using wholeTextFile which will give you a pair rdd of fileName,
content. flatMap through this and manipulate the content.
Best Regards,
Sonal
Founder, Nube
Note:
In the code (org.apache.spark.sql.parquet.DefaultSource) I've found this:
val relation = if (doInsertion) {
// This is a hack. We always set
nullable/containsNull/valueContainsNull to true
// for the schema of a parquet data.
val df =
sqlContext.createDataFrame(
Hi,
I have applied mapToPair and then a reduceByKey on a DStream to obtain a
JavaPairDStreamString, MapString, Object.
I have to apply a flatMapToPair and reduceByKey on the DSTream Obtained
above.
But i do not see any logs from reduceByKey operation.
Can anyone explain why is this happening..?
I am using tachyon in the spark program below,but I encounter a
BlockNotFoundxception.
Does someone know what's wrong and also is there guide on how to configure
spark to work with Tackyon?Thanks!
conf.set(spark.externalBlockStore.url, tachyon://10.18.19.33:19998)
A first use case of gap constraint is included in the article.
Another application would be customer-shopping sequence analysis where you
want to put a constraint on the duration between two purchases for them to
be considered as a pertinent sequence.
Additional question regarding the code :
Sometime back I was playing with Spark and Tachyon and I also found this
issue . The issue here is TachyonBlockManager put the blocks in
WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted
from Tachyon Cache when Memory is full and when Spark try to find the block
it throws
I believe that is done explicitly while the final API is being figured out.
For the moment you could use DataFrame read.df()
From: csgilles...@gmail.com
Date: Tue, 25 Aug 2015 18:26:50 +0100
Subject: SparkR: exported functions
To: user@spark.apache.org
Hi,
I've just started playing
The URL seems to have changed .. here is the one ..
http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html
On Wed, Aug 26, 2015 at 12:32 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Sometime back I was playing with Spark and Tachyon and I also found this
ReversedPrefix is used because scala's List uses a linked list, which has
constant time append to head but linear time append to tail.
I'm aware that there are use cases for the gap constraints. My question was
more about whether any users of Spark/MLlib have an immediate application
for these
Hi All,
We've set up our spark cluster on aws running on yarn (running on hadoop
2.3) with fair scheduling and preemption turned on. The cluster is shared
for prod and dev work where prod runs with a higher fair share and can
preempt dev jobs if there are not enough resources available for it.
It
Yes
On Wed, Aug 26, 2015 at 10:23 AM, Chen Song chen.song...@gmail.com wrote:
Thanks Cody.
Are you suggesting to put the cache in global context in each executor
JVM, in a Scala object for example. Then have a scheduled task to refresh
the cache (or triggered by the expiry if Guava)?
Chen
whats the default buffer in spark streaming 1.3 for kafka messages.
Say In this run it has to fetch messages from offset 1 to 1. will it
fetch all in one go or internally it fetches messages in few messages
batch.
Is there any setting to configure this no of offsets fetched in one batch?
Can you provide a bit more information ?
Are Spark artifacts packaged by you have the same names / paths (in maven
repo) as the ones published by Apache Spark ?
Is Zinc running on the machine where you performed the build ?
Cheers
On Wed, Aug 26, 2015 at 7:56 AM, Muhammad Haseeb Javed
Thanks Davies. HiveContext seems neat to use :)
On Thu, Aug 20, 2015 at 3:02 PM, Davies Liu dav...@databricks.com wrote:
As Aram said, there two options in Spark 1.4,
1) Use the HiveContext, then you got datediff from Hive,
df.selectExpr(datediff(d2, d1))
2) Use Python UDF:
```
from
see http://kafka.apache.org/documentation.html#consumerconfigs
fetch.message.max.bytes
in the kafka params passed to the constructor
On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora shushantaror...@gmail.com
wrote:
whats the default buffer in spark streaming 1.3 for kafka messages.
Say In
Sorry, I meant without reading from all splits. This is a single partition
in the table.
On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak tom...@gmail.com wrote:
I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from
and I don't particularly care which rows. Doing a LIMIT
I want to persist a large _sorted_ table to Parquet on S3 and then read
this in and join it using the Sorted Merge Join strategy against another
large sorted table.
The problem is: even though I sort these tables on the join key beforehand,
once I persist them to Parquet, they lose the
Thanks for your response Yana,
I can increase the MaxPermSize parameter and it will allow me to run the
unit test a few more times before I run out of memory.
However, the primary issue is that running the same unit test in the same
JVM (multiple times) results in increased memory (each run of
Dear All,
Just now released the 1.0.4 version of Low Level Receiver based
Kafka-Spark-Consumer in spark-packages.org . You can find the latest
release here :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Here is github location :
I would use Sqoop. It has been designed exactly for these types of
scenarios. Spark streaming does not make sense here
Le dim. 5 juil. 2015 à 1:59, ayan guha guha.a...@gmail.com a écrit :
Hi All
I have a requireent to connect to a DB every few minutes and bring data to
HBase. Can anyone
On Wed, Aug 26, 2015 at 2:03 PM, Jerry jerry.c...@gmail.com wrote:
Assuming your submitting the job from terminal; when main() is called, if I
try to open a file locally, can I assume the machine is always the one I
submitted the job from?
See the --deploy-mode option. client works as you
This simple comand call:
val final_df = data.select(month_balance).withColumn(month_date,
data.col(month_date_curr))
Is throwing:
org.apache.spark.sql.AnalysisException: resolved attribute(s)
month_date_curr#324 missing from month_balance#234 in operator !Project
[month_balance#234,
Ah, I was using the UI coupled with the job logs indicating that offsets
were being processed even though it corresponded to 0 events. Looks like
I wasn't matching up timestamps correctly: the 0 event batches were
queued/processed when offsets were getting skipped:
15/08/26 11:26:05 INFO
Hi I have a Ubuntu box with 4GB memory and duo cores. Do you think it
won't be enough to run spark streaming and kafka? I try to install
standalone mode spark kafka so I can debug them in IDE. Do I need to
install hadoop?
Thanks!
J
Thanks!
On Wed, Aug 26, 2015 at 2:06 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Wed, Aug 26, 2015 at 2:03 PM, Jerry jerry.c...@gmail.com wrote:
Assuming your submitting the job from terminal; when main() is called,
if I
try to open a file locally, can I assume the machine is always
I can reproduce this even simpler with the following:
val gf = sc.parallelize(Array(3,6,4,7,3,4,5,5,31,4,5,2)).toDF(ASD)
val ff = sc.parallelize(Array(4,6,2,3,5,1,4,6,23,6,4,7)).toDF(GFD)
gf.withColumn(DSA, ff.col(GFD))
org.apache.spark.sql.AnalysisException: resolved attribute(s) GFD#421
Hi,
We are in the process of developing a new product/Spark application. While
the official Spark 1.4.1 page
http://spark.apache.org/docs/latest/ml-guide.html invites users and
developers to use *Spark.mllib* and optionally contribute to *Spark.ml*,
this
Hey all,
I'm trying to do some stuff with a YAML file in the Spark driver using
SnakeYAML library in scala.
When I put the snakeyaml v1.14 jar on the SPARK_DIST_CLASSPATH and try to
de-serialize some objects from YAML into classes in my app JAR on the
driver (only the driver). I get the
You can try using wholeTextFile which will give you a pair rdd of fileName,
content. flatMap through this and manipulate the content.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Check out Reifier at Spark Summit 2015
Sorry for the noise, It's my bad...I have worked it out now.
At 2015-08-26 13:20:57, Todd bit1...@163.com wrote:
I think the answer is No. I only see such message on the console..and #2 is the
thread stack trace。
I am thinking is that in Spark SQL Perf forks many dsdgen process to generate
The same as
https://mail.google.com/mail/#label/Spark%2Fuser/14f64c75c15f5ccd
Please follow the discussion there.
On Tue, Aug 25, 2015 at 5:02 PM, Petr Novak oss.mli...@gmail.com wrote:
Hi all,
when I read parquet files with required fields aka nullable=false they
are read correctly. Then I
Hi Samya,
When submitting an application with spark-submit the cores per executor can
be set with --executor-cores, meaning you can run that many tasks per
executor concurrently. The page below has some more details on submitting
applications:
Thanks Jem, I do understand your suggestion. Actually --executor-cores alone
doesn’t control the number of tasks, but is also governed by spark.task.cpus
(amount of cores dedicated for each task’s execution).
Reframing my Question, How many threads can be spawned per executor core? Is it
in
Hi,
I'm trying to perform an ETL using Spark, but as soon as I start performing
joins performance degrades a lot. Let me explain what I'm doing and what I
found out until now.
First of all, I'm reading avro files that are on a Cloudera cluster, using
commands like this:
/val tab1 =
Spark joins are different than traditional database joins because of the
lack of support of indexes. Spark has to shuffle data between various
nodes to perform joins. Hence joins are bound to be much slower than count
which is just a parallel scan of the data.
Still, to ensure that nothing is
Mind sharing how you fixed the issue ?
Cheers
On Aug 26, 2015, at 1:50 AM, Todd bit1...@163.com wrote:
Sorry for the noise, It's my bad...I have worked it out now.
At 2015-08-26 13:20:57, Todd bit1...@163.com wrote:
I think the answer is No. I only see such message on the
Hi Folks,
My Spark application interacts with kafka for getting data through Java Api.
I am using Direct Approach (No Receivers) - which use Kafka’s simple
consumer API to Read data.
So, kafka offsets need to be handles explicitly.
In case of Spark failure i need to save the offset state of
Dear all,
I'm trying to find an efficient way to build a k-NN graph for a large
dataset. Precisely, I have a large set of high dimensional vector (say d
1) and I want to build a graph where those high dimensional points
are the vertices and each one is linked to the k-nearest neighbor based
Sam,
This may be of interest, as far as i can see it suggests that a spark
'task' is always executed as a single thread in the JVM.
http://0x0fff.com/spark-architecture/
Thanks,
Jem
On Wed, Aug 26, 2015 at 10:06 AM Samya MAITI samya.ma...@amadeus.com
wrote:
Thanks Jem, I do understand
You could try dimensionality reduction (PCA or SVD) first. I would imagine that
even if you could successfully compute similarities in the high-dimensional
space you would probably run into the curse of dimensionality.
On 26 Aug 2015, at 12:35, Jaonary Rabarisoa jaon...@gmail.com wrote:
Dear
Hi,
I've developed a ScalaCheck property for testing Spark Streaming
transformations. To do that I had to develop a custom InputDStream, which
is very similar to QueueInputDStream but has a method for adding new test
cases for dstreams, which are objects of type Seq[Seq[A]], to the DStream.
You
Have you run dev/change-version-to-2.11.sh ?
Cheers
On Wed, Aug 26, 2015 at 7:07 AM, Felix Neutatz neut...@googlemail.com
wrote:
Hi everybody,
I tried to build Spark v1.4.1-rc4 with Scala 2.11:
../apache-maven-3.3.3/bin/mvn -Dscala-2.11 -DskipTests clean install
Before running this, I
As I was writing a long-ish message to explain how it doesn't work, it
dawned on me that maybe driver connects to executors only after there's
some work to do (while I was trying to find the number of executors BEFORE
starting the actual work).
So the solution was to simply execute a dummy task (
When running long-lived job on YARN like Spark Streaming, I found that
container logs gone after days on executor nodes, although the job itself
is still running.
I am using cdh5.4.0 and have aggregated logs enabled. Because the local
logs are gone on executor nodes, I don't see any aggregated
I have a simple test that is hanging when using s3a with spark 1.3.1. Is
there something I need to do to cleanup the S3A file system? The write to S3
appears to have worked but this job hangs in the spark-shell and using
spark-submit. Any help would be greatly appreciated. TIA.
import
Could you please show jstack result of the hanged process? Thanks!
Cheng
On 8/26/15 10:46 PM, cingram wrote:
I have a simple test that is hanging when using s3a with spark 1.3.1. Is
there something I need to do to cleanup the S3A file system? The write to S3
appears to have worked but this job
Hi everybody,
I tried to build Spark v1.4.1-rc4 with Scala 2.11:
../apache-maven-3.3.3/bin/mvn -Dscala-2.11 -DskipTests clean install
Before running this, I deleted:
../.m2/repository/org/apache/spark
../.m2/repository/org/spark-project
My changes to the code:
I just changed line 174 of
I don't see that you invoke any action in this code. It won't do
anything unless you tell it to perform an action that requires the
transformations.
On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari
deepesh.maheshwar...@gmail.com wrote:
Hi,
I have applied mapToPair and then a reduceByKey on a
Hi All,
Few basic queries :-
1. Is there a way we can control the number of threads per executor core?
2. Does this parameter “executor-cores” also has say in deciding how many
threads to be run?
Regards,
Sam
--
View this message in context:
Increase the number of executors, :-)
At 2015-08-26 16:57:48, Ted Yu yuzhih...@gmail.com wrote:
Mind sharing how you fixed the issue ?
Cheers
On Aug 26, 2015, at 1:50 AM, Todd bit1...@163.com wrote:
Sorry for the noise, It's my bad...I have worked it out now.
At 2015-08-26 13:20:57,
The first thing that stands out to me is
createOnError = true
Are you sure the checkpoint is actually loading, as opposed to failing and
starting the job anyway? There should be info lines that look like
INFO DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData:
Restoring KafkaRDD for
If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:
That argument takes a function from MessageAndMetadata to whatever you want
your stream to contain.
See
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala#L57
On Wed, Aug 26, 2015 at 7:55 AM, Deepesh Maheshwari
Hey ,
I need to set the number of cores from inside the topology . Its working
fine by setting in spark-env.sh but unable to do via setting key/value
for conf .
SparkConf sparkConf = new
SparkConf().setAppName(JavaCustomReceiver).setMaster(local[4]);
if(toponame.equals(IdentityTopology))
I checked out the master branch and started playing around with the
examples. I want to build a jar of the examples as I wish run them using
the modified spark jar that I have. However, packaging spark-examples takes
too much time as maven tries to download the jar dependencies rather than
use
Yes. And a paper that describes using grids (actually varying grids) is
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
In the Spark GraphX In Action book that Robin East and I are writing, we
implement a drastically simplified version of this in chapter
+1 to all of the above esp. Dimensionality reduction and locality sensitive
hashing / min hashing.
There's also an algorithm implemented in MLlib called DIMSUM which was
developed at Twitter for this purpose. I've been meaning to try it and would be
interested to hear about results you get.
Thanks Michael, much appreciated!
Nothing should be held in memory for a query like this (other than a single
count per partition), so I don't think that is the problem. There is
likely an error buried somewhere.
For your above comments - I don't get any error but just get the NULL as
return
Hi,
I am having issues with /tmp space filling up during Spark jobs because
Spark-on-YARN uses the yarn.nodemanager.local-dirs for shuffle space. I noticed
this message appears when submitting Spark-on-YARN jobs:
WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by
the
Hello,
I'm seeing a strange behavior where count() on a DataFrame errors as shown
below but collect() works fine.
This is what I tried from spark-shell. solrRDD.queryShards() return a
javaRDD.
val rdd = solrRDD.queryShards(sc, query, _version_, 2).rdd
rdd:
Would be interested to know the answer too.
On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood sadhan.s...@gmail.com wrote:
Interestingly, if there is nothing running on dev spark-shell, it recovers
successfully and regains the lost executors. Attaching the log for that.
Notice, the Registering
-dev +user
I'd suggest running .explain() on both dataframes to understand the
performance better. The problem is likely that we have a pattern that
looks for cases where you have an equality predicate where either side can
be evaluated using one side of the join. We turn this into a hash join.
Hi Saif,
In both cases you’re referencing columns that don’t exist in the current
DataFrame.
The first email you did a select and then a withColumn for ‘month_date_cur' on
the resulting DF, but that column does not exist, because you did a select for
only ‘month_balance’.
In the second email
Davies, I created an issue - SPARK-10246
https://issues.apache.org/jira/browse/SPARK-10246
On Tue, Aug 25, 2015 at 12:53 PM, Davies Liu dav...@databricks.com wrote:
It's good to support this, could you create a JIRA for it and target for
1.6?
On Tue, Aug 25, 2015 at 11:21 AM, Michal
Hi all, question on an issue im having with a vertexRDD. If i kick of my
spark shell with something like this:
then run:
it will finish and give me the count but is see a few errors (see below).
This is okay for this small dataset but when trying with a large data set it
doesnt finish because
I'd suggest looking at
http://spark-packages.org/package/databricks/spark-avro
On Wed, Aug 26, 2015 at 11:32 AM, gpatcham gpatc...@gmail.com wrote:
Hi,
I'm trying to query hive table which is based on avro in spark SQL and
seeing below errors.
15/08/26 17:51:12 WARN avro.AvroSerdeUtils:
I'd suggest setting sbt to fork when running tests.
On Wed, Aug 26, 2015 at 10:51 AM, Mike Trienis mike.trie...@orcsol.com
wrote:
Thanks for your response Yana,
I can increase the MaxPermSize parameter and it will allow me to run the
unit test a few more times before I run out of memory.
Thanks Cody.
Are you suggesting to put the cache in global context in each executor JVM,
in a Scala object for example. Then have a scheduled task to refresh the
cache (or triggered by the expiry if Guava)?
Chen
On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger c...@koeninger.org wrote:
If
Hi
My streaming application gets killed with below error
5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream:
ArrayBuffer(kafka.common.NotLeaderForPartitionException,
kafka.common.NotLeaderForPartitionException,
kafka.common.NotLeaderForPartitionException,
Can I change this param fetch.message.max.bytes or
spark.streaming.kafka.maxRatePerPartition
at run time across batches.
Say I detected some fail condition in my system and I decided to sonsume i
next batch interval only 10 messages per partition and if that succeed I
reset the max limit to
82 matches
Mail list logo