Feedback: Feature request

2015-08-26 Thread Murphy, James
Hey all, In working with the DecisionTree classifier, I found it difficult to extract rules that could easily facilitate visualization with libraries like D3. So for example, using : print(model.toDebugString()), I get the following result = If (feature 0 = -35.0) If (feature 24 = 176.0)

Re: Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Interestingly, if there is nothing running on dev spark-shell, it recovers successfully and regains the lost executors. Attaching the log for that. Notice, the Registering block manager .. statements in the very end after all executors were lost. On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Cody Koeninger
I'd be less concerned about what the streaming ui shows than what's actually going on with the job. When you say you were losing messages, how were you observing that? The UI, or actual job output? The log lines you posted indicate that the checkpoint was restored and those offsets were

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Using TABLESAMPLE(0.1) is actually way worse. Spark first spends 12 minutes to discover all split files on all hosts (for some reason) before it even starts the job, and then it creates 3.5 million tasks (the partition has ~32k split files). On Wed, Aug 26, 2015 at 9:36 AM, Jörn Franke

query avro hive table in spark sql

2015-08-26 Thread gpatcham
Hi, I'm trying to query hive table which is based on avro in spark SQL and seeing below errors. 15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem org.apache.hadoop.hive.serde2.avro.AvroSerdeException:

Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from and I don't particularly care which rows. Doing a LIMIT unfortunately results in two stages where the first stage reads the whole table, and the second then performs the limit with a single worker, which is not very

Re: Efficient sampling from a Hive table

2015-08-26 Thread Jörn Franke
Have you tried tablesample? You find the exact syntax in the documentation, but it exlxactly does what you want Le mer. 26 août 2015 à 18:12, Thomas Dudziak tom...@gmail.com a écrit : Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26,

Re: Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread cingram
spark-shell-hang-on-exit.tdump http://apache-spark-user-list.1001560.n3.nabble.com/file/n24461/spark-shell-hang-on-exit.tdump -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-saveAsParquetFile-hangs-on-app-exit-tp24460p24461.html Sent from

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Susan Zhang
Thanks for the suggestions! I tried the following: I removed createOnError = true And reran the same process to reproduce. Double checked that checkpoint is loading: 15/08/26 10:10:40 INFO DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring KafkaRDD for time 1440608825000

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Susan Zhang
Compared offsets, and it continues from checkpoint loading: 15/08/26 11:24:54 INFO DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring KafkaRDD for time 1440612035000 ms [(install-json,5,826112083,826112446), (install-json,4,825772921,825773536),

Re: Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Attaching log for when the dev job gets stuck (once all its executors are lost due to preemption). This is a spark-shell job running in yarn-client mode. On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood sadhan.s...@gmail.com wrote: Hi All, We've set up our spark cluster on aws running on yarn

Can RDD be shared accross the cluster by other drivers?

2015-08-26 Thread Tao Lu
Hi, Guys, Is it possible that RDD created by driver A be used driver B? Thanks!

Re: Question on take function - Spark Java API

2015-08-26 Thread Pankaj Wahane
Thanks Sonal.. I shall try doing that.. On 26-Aug-2015, at 1:05 pm, Sonal Goyal sonalgoy...@gmail.com wrote: You can try using wholeTextFile which will give you a pair rdd of fileName, content. flatMap through this and manipulate the content. Best Regards, Sonal Founder, Nube

Re: SparkSQL saveAsParquetFile does not preserve AVRO schema

2015-08-26 Thread storm
Note: In the code (org.apache.spark.sql.parquet.DefaultSource) I've found this: val relation = if (doInsertion) { // This is a hack. We always set nullable/containsNull/valueContainsNull to true // for the schema of a parquet data. val df = sqlContext.createDataFrame(

reduceByKey not working on JavaPairDStream

2015-08-26 Thread Deepesh Maheshwari
Hi, I have applied mapToPair and then a reduceByKey on a DStream to obtain a JavaPairDStreamString, MapString, Object. I have to apply a flatMapToPair and reduceByKey on the DSTream Obtained above. But i do not see any logs from reduceByKey operation. Can anyone explain why is this happening..?

BlockNotFoundException when running spark word count on Tachyon

2015-08-26 Thread Todd
I am using tachyon in the spark program below,but I encounter a BlockNotFoundxception. Does someone know what's wrong and also is there guide on how to configure spark to work with Tackyon?Thanks! conf.set(spark.externalBlockStore.url, tachyon://10.18.19.33:19998)

Re: MLlib Prefixspan implementation

2015-08-26 Thread alexis GILLAIN
A first use case of gap constraint is included in the article. Another application would be customer-shopping sequence analysis where you want to put a constraint on the duration between two purchases for them to be considered as a pertinent sequence. Additional question regarding the code :

Re: BlockNotFoundException when running spark word count on Tachyon

2015-08-26 Thread Dibyendu Bhattacharya
Sometime back I was playing with Spark and Tachyon and I also found this issue . The issue here is TachyonBlockManager put the blocks in WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted from Tachyon Cache when Memory is full and when Spark try to find the block it throws

RE: SparkR: exported functions

2015-08-26 Thread Felix Cheung
I believe that is done explicitly while the final API is being figured out. For the moment you could use DataFrame read.df() From: csgilles...@gmail.com Date: Tue, 25 Aug 2015 18:26:50 +0100 Subject: SparkR: exported functions To: user@spark.apache.org Hi, I've just started playing

Re: BlockNotFoundException when running spark word count on Tachyon

2015-08-26 Thread Dibyendu Bhattacharya
The URL seems to have changed .. here is the one .. http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html On Wed, Aug 26, 2015 at 12:32 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Sometime back I was playing with Spark and Tachyon and I also found this

Re: MLlib Prefixspan implementation

2015-08-26 Thread Feynman Liang
ReversedPrefix is used because scala's List uses a linked list, which has constant time append to head but linear time append to tail. I'm aware that there are use cases for the gap constraints. My question was more about whether any users of Spark/MLlib have an immediate application for these

Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Hi All, We've set up our spark cluster on aws running on yarn (running on hadoop 2.3) with fair scheduling and preemption turned on. The cluster is shared for prod and dev work where prod runs with a higher fair share and can preempt dev jobs if there are not enough resources available for it. It

Re: JDBC Streams

2015-08-26 Thread Cody Koeninger
Yes On Wed, Aug 26, 2015 at 10:23 AM, Chen Song chen.song...@gmail.com wrote: Thanks Cody. Are you suggesting to put the cache in global context in each executor JVM, in a Scala object for example. Then have a scheduled task to refresh the cache (or triggered by the expiry if Guava)? Chen

spark streaming 1.3 kafka buffer size

2015-08-26 Thread Shushant Arora
whats the default buffer in spark streaming 1.3 for kafka messages. Say In this run it has to fetch messages from offset 1 to 1. will it fetch all in one go or internally it fetches messages in few messages batch. Is there any setting to configure this no of offsets fetched in one batch?

Re: Building spark-examples takes too much time using Maven

2015-08-26 Thread Ted Yu
Can you provide a bit more information ? Are Spark artifacts packaged by you have the same names / paths (in maven repo) as the ones published by Apache Spark ? Is Zinc running on the machine where you performed the build ? Cheers On Wed, Aug 26, 2015 at 7:56 AM, Muhammad Haseeb Javed

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-26 Thread Dhaval Patel
Thanks Davies. HiveContext seems neat to use :) On Thu, Aug 20, 2015 at 3:02 PM, Davies Liu dav...@databricks.com wrote: As Aram said, there two options in Spark 1.4, 1) Use the HiveContext, then you got datediff from Hive, df.selectExpr(datediff(d2, d1)) 2) Use Python UDF: ``` from

Re: spark streaming 1.3 kafka buffer size

2015-08-26 Thread Cody Koeninger
see http://kafka.apache.org/documentation.html#consumerconfigs fetch.message.max.bytes in the kafka params passed to the constructor On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora shushantaror...@gmail.com wrote: whats the default buffer in spark streaming 1.3 for kafka messages. Say In

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak tom...@gmail.com wrote: I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from and I don't particularly care which rows. Doing a LIMIT

Persisting sorted parquet tables for future sort merge joins

2015-08-26 Thread Jason
I want to persist a large _sorted_ table to Parquet on S3 and then read this in and join it using the Sorted Merge Join strategy against another large sorted table. The problem is: even though I sort these tables on the join key beforehand, once I persist them to Parquet, they lose the

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Mike Trienis
Thanks for your response Yana, I can increase the MaxPermSize parameter and it will allow me to run the unit test a few more times before I run out of memory. However, the primary issue is that running the same unit test in the same JVM (multiple times) results in increased memory (each run of

Just Released V1.0.4 Low Level Receiver Based Kafka-Spark-Consumer in Spark Packages having built-in Back Pressure Controller

2015-08-26 Thread Dibyendu Bhattacharya
Dear All, Just now released the 1.0.4 version of Low Level Receiver based Kafka-Spark-Consumer in spark-packages.org . You can find the latest release here : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Here is github location :

Re: JDBC Streams

2015-08-26 Thread Jörn Franke
I would use Sqoop. It has been designed exactly for these types of scenarios. Spark streaming does not make sense here Le dim. 5 juil. 2015 à 1:59, ayan guha guha.a...@gmail.com a écrit : Hi All I have a requireent to connect to a DB every few minutes and bring data to HBase. Can anyone

Re: Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Marcelo Vanzin
On Wed, Aug 26, 2015 at 2:03 PM, Jerry jerry.c...@gmail.com wrote: Assuming your submitting the job from terminal; when main() is called, if I try to open a file locally, can I assume the machine is always the one I submitted the job from? See the --deploy-mode option. client works as you

Help! Stuck using withColumn

2015-08-26 Thread Saif.A.Ellafi
This simple comand call: val final_df = data.select(month_balance).withColumn(month_date, data.col(month_date_curr)) Is throwing: org.apache.spark.sql.AnalysisException: resolved attribute(s) month_date_curr#324 missing from month_balance#234 in operator !Project [month_balance#234,

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Susan Zhang
Ah, I was using the UI coupled with the job logs indicating that offsets were being processed even though it corresponded to 0 events. Looks like I wasn't matching up timestamps correctly: the 0 event batches were queued/processed when offsets were getting skipped: 15/08/26 11:26:05 INFO

suggest configuration for debugging spark streaming, kafka

2015-08-26 Thread Joanne Contact
Hi I have a Ubuntu box with 4GB memory and duo cores. Do you think it won't be enough to run spark streaming and kafka? I try to install standalone mode spark kafka so I can debug them in IDE. Do I need to install hadoop? Thanks! J

Re: Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Jerry
Thanks! On Wed, Aug 26, 2015 at 2:06 PM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Aug 26, 2015 at 2:03 PM, Jerry jerry.c...@gmail.com wrote: Assuming your submitting the job from terminal; when main() is called, if I try to open a file locally, can I assume the machine is always

RE: Help! Stuck using withColumn

2015-08-26 Thread Saif.A.Ellafi
I can reproduce this even simpler with the following: val gf = sc.parallelize(Array(3,6,4,7,3,4,5,5,31,4,5,2)).toDF(ASD) val ff = sc.parallelize(Array(4,6,2,3,5,1,4,6,23,6,4,7)).toDF(GFD) gf.withColumn(DSA, ff.col(GFD)) org.apache.spark.sql.AnalysisException: resolved attribute(s) GFD#421

Spark.ml vs Spark.mllib

2015-08-26 Thread njoshi
Hi, We are in the process of developing a new product/Spark application. While the official Spark 1.4.1 page http://spark.apache.org/docs/latest/ml-guide.html invites users and developers to use *Spark.mllib* and optionally contribute to *Spark.ml*, this

SPARK_DIST_CLASSPATH, primordial class loader app ClassNotFound

2015-08-26 Thread Night Wolf
Hey all, I'm trying to do some stuff with a YAML file in the Spark driver using SnakeYAML library in scala. When I put the snakeyaml v1.14 jar on the SPARK_DIST_CLASSPATH and try to de-serialize some objects from YAML into classes in my app JAR on the driver (only the driver). I get the

Re: Question on take function - Spark Java API

2015-08-26 Thread Sonal Goyal
You can try using wholeTextFile which will give you a pair rdd of fileName, content. flatMap through this and manipulate the content. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015

Re:Re:Re: How to increase data scale in Spark SQL Perf

2015-08-26 Thread Todd
Sorry for the noise, It's my bad...I have worked it out now. At 2015-08-26 13:20:57, Todd bit1...@163.com wrote: I think the answer is No. I only see such message on the console..and #2 is the thread stack trace。 I am thinking is that in Spark SQL Perf forks many dsdgen process to generate

Re: DataFrame Parquet Writer doesn't keep schema

2015-08-26 Thread Petr Novak
The same as https://mail.google.com/mail/#label/Spark%2Fuser/14f64c75c15f5ccd Please follow the discussion there. On Tue, Aug 25, 2015 at 5:02 PM, Petr Novak oss.mli...@gmail.com wrote: Hi all, when I read parquet files with required fields aka nullable=false they are read correctly. Then I

Re: Relation between threads and executor core

2015-08-26 Thread Jem Tucker
Hi Samya, When submitting an application with spark-submit the cores per executor can be set with --executor-cores, meaning you can run that many tasks per executor concurrently. The page below has some more details on submitting applications:

RE: Relation between threads and executor core

2015-08-26 Thread Samya MAITI
Thanks Jem, I do understand your suggestion. Actually --executor-cores alone doesn’t control the number of tasks, but is also governed by spark.task.cpus (amount of cores dedicated for each task’s execution). Reframing my Question, How many threads can be spawned per executor core? Is it in

Performance issue with Spark join

2015-08-26 Thread lucap
Hi, I'm trying to perform an ETL using Spark, but as soon as I start performing joins performance degrades a lot. Let me explain what I'm doing and what I found out until now. First of all, I'm reading avro files that are on a Cloudera cluster, using commands like this: /val tab1 =

Re: Performance issue with Spark join

2015-08-26 Thread Hemant Bhanawat
Spark joins are different than traditional database joins because of the lack of support of indexes. Spark has to shuffle data between various nodes to perform joins. Hence joins are bound to be much slower than count which is just a parallel scan of the data. Still, to ensure that nothing is

Re: How to increase data scale in Spark SQL Perf

2015-08-26 Thread Ted Yu
Mind sharing how you fixed the issue ? Cheers On Aug 26, 2015, at 1:50 AM, Todd bit1...@163.com wrote: Sorry for the noise, It's my bad...I have worked it out now. At 2015-08-26 13:20:57, Todd bit1...@163.com wrote: I think the answer is No. I only see such message on the

Custom Offset Management

2015-08-26 Thread Deepesh Maheshwari
Hi Folks, My Spark application interacts with kafka for getting data through Java Api. I am using Direct Approach (No Receivers) - which use Kafka’s simple consumer API to Read data. So, kafka offsets need to be handles explicitly. In case of Spark failure i need to save the offset state of

Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
Dear all, I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d 1) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor based

Re: Relation between threads and executor core

2015-08-26 Thread Jem Tucker
Sam, This may be of interest, as far as i can see it suggests that a spark 'task' is always executed as a single thread in the JVM. http://0x0fff.com/spark-architecture/ Thanks, Jem On Wed, Aug 26, 2015 at 10:06 AM Samya MAITI samya.ma...@amadeus.com wrote: Thanks Jem, I do understand

Re: Build k-NN graph for large dataset

2015-08-26 Thread Robin East
You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you could successfully compute similarities in the high-dimensional space you would probably run into the curse of dimensionality. On 26 Aug 2015, at 12:35, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear

JobScheduler: Error generating jobs for time for custom InputDStream

2015-08-26 Thread Juan Rodríguez Hortalá
Hi, I've developed a ScalaCheck property for testing Spark Streaming transformations. To do that I had to develop a custom InputDStream, which is very similar to QueueInputDStream but has a method for adding new test cases for dstreams, which are objects of type Seq[Seq[A]], to the DStream. You

Re: Issue with building Spark v1.4.1-rc4 with Scala 2.11

2015-08-26 Thread Ted Yu
Have you run dev/change-version-to-2.11.sh ? Cheers On Wed, Aug 26, 2015 at 7:07 AM, Felix Neutatz neut...@googlemail.com wrote: Hi everybody, I tried to build Spark v1.4.1-rc4 with Scala 2.11: ../apache-maven-3.3.3/bin/mvn -Dscala-2.11 -DskipTests clean install Before running this, I

Re: Finding the number of executors.

2015-08-26 Thread Virgil Palanciuc
As I was writing a long-ish message to explain how it doesn't work, it dawned on me that maybe driver connects to executors only after there's some work to do (while I was trying to find the number of executors BEFORE starting the actual work). So the solution was to simply execute a dummy task (

application logs for long lived job on YARN

2015-08-26 Thread Chen Song
When running long-lived job on YARN like Spark Streaming, I found that container logs gone after days on executor nodes, although the job itself is still running. I am using cdh5.4.0 and have aggregated logs enabled. Because the local logs are gone on executor nodes, I don't see any aggregated

Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread cingram
I have a simple test that is hanging when using s3a with spark 1.3.1. Is there something I need to do to cleanup the S3A file system? The write to S3 appears to have worked but this job hangs in the spark-shell and using spark-submit. Any help would be greatly appreciated. TIA. import

Re: Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread Cheng Lian
Could you please show jstack result of the hanged process? Thanks! Cheng On 8/26/15 10:46 PM, cingram wrote: I have a simple test that is hanging when using s3a with spark 1.3.1. Is there something I need to do to cleanup the S3A file system? The write to S3 appears to have worked but this job

Fwd: Issue with building Spark v1.4.1-rc4 with Scala 2.11

2015-08-26 Thread Felix Neutatz
Hi everybody, I tried to build Spark v1.4.1-rc4 with Scala 2.11: ../apache-maven-3.3.3/bin/mvn -Dscala-2.11 -DskipTests clean install Before running this, I deleted: ../.m2/repository/org/apache/spark ../.m2/repository/org/spark-project My changes to the code: I just changed line 174 of

Re: reduceByKey not working on JavaPairDStream

2015-08-26 Thread Sean Owen
I don't see that you invoke any action in this code. It won't do anything unless you tell it to perform an action that requires the transformations. On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, I have applied mapToPair and then a reduceByKey on a

Relation between threads and executor core

2015-08-26 Thread Samya
Hi All, Few basic queries :- 1. Is there a way we can control the number of threads per executor core? 2. Does this parameter “executor-cores” also has say in deciding how many threads to be run? Regards, Sam -- View this message in context:

Re:Re: How to increase data scale in Spark SQL Perf

2015-08-26 Thread Todd
Increase the number of executors, :-) At 2015-08-26 16:57:48, Ted Yu yuzhih...@gmail.com wrote: Mind sharing how you fixed the issue ? Cheers On Aug 26, 2015, at 1:50 AM, Todd bit1...@163.com wrote: Sorry for the noise, It's my bad...I have worked it out now. At 2015-08-26 13:20:57,

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Cody Koeninger
The first thing that stands out to me is createOnError = true Are you sure the checkpoint is actually loading, as opposed to failing and starting the job anyway? There should be info lines that look like INFO DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring KafkaRDD for

Re: Build k-NN graph for large dataset

2015-08-26 Thread Kristina Rogale Plazonic
If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation:

Re: Custom Offset Management

2015-08-26 Thread Cody Koeninger
That argument takes a function from MessageAndMetadata to whatever you want your stream to contain. See https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala#L57 On Wed, Aug 26, 2015 at 7:55 AM, Deepesh Maheshwari

Setting number of CORES from inside the Topology (JAVA code )

2015-08-26 Thread anshu shukla
Hey , I need to set the number of cores from inside the topology . Its working fine by setting in spark-env.sh but unable to do via setting key/value for conf . SparkConf sparkConf = new SparkConf().setAppName(JavaCustomReceiver).setMaster(local[4]); if(toponame.equals(IdentityTopology))

Building spark-examples takes too much time using Maven

2015-08-26 Thread Muhammad Haseeb Javed
I checked out the master branch and started playing around with the examples. I want to build a jar of the examples as I wish run them using the modified spark jar that I have. However, packaging spark-examples takes too much time as maven tries to download the jar dependencies rather than use

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak
Yes. And a paper that describes using grids (actually varying grids) is  http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf  In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter

Re: Build k-NN graph for large dataset

2015-08-26 Thread Charlie Hack
+1 to all of the above esp.  Dimensionality reduction and locality sensitive hashing / min hashing.  There's also an algorithm implemented in MLlib called DIMSUM which was developed at Twitter for this purpose. I've been meaning to try it and would be interested to hear about results you get. 

Re: DataFrame/JDBC very slow performance

2015-08-26 Thread Dhaval Patel
Thanks Michael, much appreciated! Nothing should be held in memory for a query like this (other than a single count per partition), so I don't think that is the problem. There is likely an error buried somewhere. For your above comments - I don't get any error but just get the NULL as return

Spark-on-YARN LOCAL_DIRS location

2015-08-26 Thread michael.england
Hi, I am having issues with /tmp space filling up during Spark jobs because Spark-on-YARN uses the yarn.nodemanager.local-dirs for shuffle space. I noticed this message appears when submitting Spark-on-YARN jobs: WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the

Dataframe collect() work but count() fails

2015-08-26 Thread Srikanth
Hello, I'm seeing a strange behavior where count() on a DataFrame errors as shown below but collect() works fine. This is what I tried from spark-shell. solrRDD.queryShards() return a javaRDD. val rdd = solrRDD.queryShards(sc, query, _version_, 2).rdd rdd:

Re: Spark cluster multi tenancy

2015-08-26 Thread Jerrick Hoang
Would be interested to know the answer too. On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood sadhan.s...@gmail.com wrote: Interestingly, if there is nothing running on dev spark-shell, it recovers successfully and regains the lost executors. Attaching the log for that. Notice, the Registering

Re: Differing performance in self joins

2015-08-26 Thread Michael Armbrust
-dev +user I'd suggest running .explain() on both dataframes to understand the performance better. The problem is likely that we have a pattern that looks for cases where you have an equality predicate where either side can be evaluated using one side of the join. We turn this into a hash join.

Re: Help! Stuck using withColumn

2015-08-26 Thread Silvio Fiorito
Hi Saif, In both cases you’re referencing columns that don’t exist in the current DataFrame. The first email you did a select and then a withColumn for ‘month_date_cur' on the resulting DF, but that column does not exist, because you did a select for only ‘month_balance’. In the second email

Re: Join with multiple conditions (In reference to SPARK-7197)

2015-08-26 Thread Michal Monselise
Davies, I created an issue - SPARK-10246 https://issues.apache.org/jira/browse/SPARK-10246 On Tue, Aug 25, 2015 at 12:53 PM, Davies Liu dav...@databricks.com wrote: It's good to support this, could you create a JIRA for it and target for 1.6? On Tue, Aug 25, 2015 at 11:21 AM, Michal

error accessing vertexRDD

2015-08-26 Thread dizzy5112
Hi all, question on an issue im having with a vertexRDD. If i kick of my spark shell with something like this: then run: it will finish and give me the count but is see a few errors (see below). This is okay for this small dataset but when trying with a large data set it doesnt finish because

Re: query avro hive table in spark sql

2015-08-26 Thread Michael Armbrust
I'd suggest looking at http://spark-packages.org/package/databricks/spark-avro On Wed, Aug 26, 2015 at 11:32 AM, gpatcham gpatc...@gmail.com wrote: Hi, I'm trying to query hive table which is based on avro in spark SQL and seeing below errors. 15/08/26 17:51:12 WARN avro.AvroSerdeUtils:

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Michael Armbrust
I'd suggest setting sbt to fork when running tests. On Wed, Aug 26, 2015 at 10:51 AM, Mike Trienis mike.trie...@orcsol.com wrote: Thanks for your response Yana, I can increase the MaxPermSize parameter and it will allow me to run the unit test a few more times before I run out of memory.

Re: JDBC Streams

2015-08-26 Thread Chen Song
Thanks Cody. Are you suggesting to put the cache in global context in each executor JVM, in a Scala object for example. Then have a scheduled task to refresh the cache (or triggered by the expiry if Guava)? Chen On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger c...@koeninger.org wrote: If

spark streaming 1.3 kafka topic error

2015-08-26 Thread Shushant Arora
Hi My streaming application gets killed with below error 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream: ArrayBuffer(kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException,

Re: spark streaming 1.3 kafka buffer size

2015-08-26 Thread Shushant Arora
Can I change this param fetch.message.max.bytes or spark.streaming.kafka.maxRatePerPartition at run time across batches. Say I detected some fail condition in my system and I decided to sonsume i next batch interval only 10 messages per partition and if that succeed I reset the max limit to