Re: Apache Beam Spark runner

2016-03-19 Thread Jean-Baptiste Onofré
Hi Amit, well done ;) I'm reviewing it now (as I didn't have to do it before, sorry about that). Regards JB On 03/17/2016 06:26 PM, Sela, Amit wrote: Hi all, The Apache Beam Spark runner is now available at: https://github.com/apache/incubator-beam/tree/master/runners/spark Check it out!

RE: Extra libs in executor classpath

2016-03-19 Thread Silvio Fiorito
Could you publish it as a library (to an internal repo) then you can simply use the “--packages" option? Also will help with versioning as you make changes, that way you’re not having to manually ship JARs around to your machines and users. From: Леонид Поляков

Re: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
Hi Davies > > What's the type of `created`? TimestampType? The Œcreated¹ column in cassandra is a timestamp https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html In the spark data frame it is a a timestamp > > If yes, when created is compared to a string, it will be

Re: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
For completeness. Clearly spark sql returned a different data set In [4]: rawDF.selectExpr("count(row_key) as num_samples", "sum(count) as total", "max(count) as max ").show() +---++-+ |num_samples|total|max|

Re: help coercing types

2016-03-19 Thread Jacek Laskowski
Hi, Just a side question: why do you convert DataFrame to RDD? It's like driving backwards (possible but ineffective and dangerous at times) P. S. I'd even go for Dataset. Jacek 18.03.2016 5:20 PM "Bauer, Robert" napisał(a): > I have data that I pull in using a sql

Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Mridul Muralidharan
We use it in executors to get to : a) spark conf (for getting to hadoop config in map doing custom writing of side-files) b) Shuffle manager (to get shuffle reader) Not sure if there are alternative ways to get to these. Regards, Mridul On Wed, Mar 16, 2016 at 2:52 PM, Reynold Xin

Fwd: Extra libs in executor classpath

2016-03-19 Thread Леонид Поляков
It makes no sense for worker, the issue is with executor classpath, not the driver classpath. Please, answer actual question that is not in "P.S." - that one it's just a note about driver Thanks, Leonid On Wed, Mar 16, 2016 at 6:21 PM, Ted Yu wrote: > For your last point,

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Mich Talebzadeh
I did some tests on Hive running on MR to get rid of Spark effects. In an ORC table that has been partitioned, partition elimination with predicate push down works and the query is narrowed to the partition itself. I can see that from the number of rows within that partition. For example below

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Benyi Wang
I would say change class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord] extends FileInputFormat to class RawDataInputFormat[LongWritable, RDRawDataRecord] extends FileInputFormat ​ On Thu, Mar 17, 2016 at 9:48 AM, Mich Talebzadeh wrote: > Hi Tony, >

Re: Extra libs in executor classpath

2016-03-19 Thread Ted Yu
For your last point, spark-submit has: if [ -z "${SPARK_HOME}" ]; then export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" fi Meaning the script would determine the proper SPARK_HOME variable. FYI On Wed, Mar 16, 2016 at 4:22 AM, Леонид Поляков wrote: > Hello, guys! > > >

Re: installing packages with pyspark

2016-03-19 Thread Felix Cheung
For some, like graphframes that are Spark packages, you could also use --packages in the command line of spark-submit or pyspark. Seehttp://spark.apache.org/docs/latest/submitting-applications.html _ From: Jakob Odersky Sent: Thursday, March

help coercing types

2016-03-19 Thread Bauer, Robert
I have data that I pull in using a sql context and then I convert to an rdd. The problem is that the type in the rdd is [Any, Iterable[Any]] And I need to have the type RDD[Array[String]] -- convert the Iterable to an Array. Here’s more detail: val zdata = sqlContext.read.parquet("s3://..

Re: [Spark-1.5.2]Column renaming with withColumnRenamed has no effect

2016-03-19 Thread Ted Yu
Can you give a bit more detail ? Release of Spark symptom of renamed column being not recognized Please take a look at "withColumnRenamed" test in: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala On Thu, Mar 17, 2016 at 2:02 AM, Divya Gehlot wrote:

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov
Spark 1.5 is the latest that I have access to and where this problem happens. I don't see it's fixed in master but I might be wrong. diff atatched. https://raw.githubusercontent.com/apache/spark/branch-1.5/python/pyspark/sql/types.py

Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-19 Thread craigiggy
Slight update I suppose? For some reason, sometimes it will connect and continue and the job will be completed. But most of the time I still run into this error and the job is killed and the application doesn't finish. Still have no idea why this is happening. I could really use some help here.

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
Thanks - I'll give that a try cheers On 20 March 2016 at 09:42, Felix Cheung wrote: > You are running pyspark in Spark client deploy mode. I have ran into the > same error as well and I'm not sure if this is graphframes specific - the > python process can't find the

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov
Running following: #fix schema for gaid which should not be Double > from pyspark.sql.types import * > customSchema = StructType() > for (col,typ) in tsp_orig.dtypes: > if col=='Agility_GAID': > typ='string' > customSchema.add(col,typ,True) Getting ValueError: Could not parse

Re: installing packages with pyspark

2016-03-19 Thread Felix Cheung
You are running pyspark in Spark client deploy mode. I have ran into the same error as well and I'm not sure if this is graphframes specific - the python process can't find the graphframes Python code when it is loaded as a Spark package. To workaround this, I extract the graphframes Python

Re: Spark 2.0 Shell -csv package weirdness

2016-03-19 Thread Mich Talebzadeh
Hi Vince, We had a similar case a while back. I tried two solutions in both Spark on Hive metastore and Hive on Spark engine. Hive version 2 Spark as Hive engine 1.3.1 Basically --1 Move .CSV data into HDFS: --2 Create an external table (all columns as string) --3 Create the ORC table

sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
I am using pyspark 1.6.0 and datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series data The data is originally captured by a spark streaming app and written to Cassandra. The value of the timestamp comes from Rdd.foreachRDD(new VoidFunction2()

DistributedLDAModel missing APIs in org.apache.spark.ml

2016-03-19 Thread cindymc
I like using the new DataFrame APIs on Spark ML, compared to using RDDs in the older SparkMLlib. But it seems some of the older APIs are missing. In particular, '*.mllib.clustering.DistributedLDAModel' had two APIs that I need now: topDocumentsPerTopic topTopicsPerDocument How can I get at the

Re: ClassNotFoundException in RDD.map

2016-03-19 Thread Dirceu Semighini Filho
Hi Ted, thanks for answering. The map is just that, whenever I try inside the map it throws this ClassNotFoundException, even if I do map(f => f) it throws the exception. What is bothering me is that when I do a take or a first it returns the result, which make me conclude that the previous code

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-19 Thread Jiří Syrový
Unfortunately I can't share any snippet quickly as the code is generated, but for now at least can share the plan. (See it here - http://pastebin.dqd.cz/RAhm/) After I've increased spark.sql.autoBroadcastJoinThreshold to 30 from 10 it went through without any problems. With 10 it was

RE: Error in "java.io.IOException: No input paths specified in job"

2016-03-19 Thread Sun, Rui
It complains about the file path "./examples/src/main/resources/people.json" You can try to use absolute path instead of relative path, and make sure the absolute path is correct. If that still does not work, you can prefix the path with "file://" in case the default file schema for Hadoop is

Re: Stop spark application when the job is complete.

2016-03-19 Thread Ted Yu
Can you call sc.stop() after indexing into elastic search ? > On Mar 16, 2016, at 9:17 PM, Imre Nagi wrote: > > Hi, > > I have a spark application for batch processing in standalone cluster. The > job is to query the database and then do some transformation,

Re: Extra libs in executor classpath

2016-03-19 Thread Леонид Поляков
No, sadly, it's not an option. End users are not my team members, it's for customers, so I have to bundle the framework and ship it. There is more to my project than just libs, so end users will have to use bundle anyway. On Wed, Mar 16, 2016 at 6:41 PM, Silvio Fiorito <

Re: Spark configuration with 5 nodes

2016-03-19 Thread Mich Talebzadeh
Thanks Steve, For NN it all depends how fast you want a start-up. 1GB of NameNode memory accommodates around 42T so if you are talking about 100GB of NN memory then SSD may make sense to speed up the start-up. Raid 10 is the best one that one can get assuming all internal disks. In general it

Re: Spark configuration with 5 nodes

2016-03-19 Thread Steve Loughran
> On 17 Mar 2016, at 12:28, Mich Talebzadeh wrote: > > Thanks Steve, > > For NN it all depends how fast you want a start-up. 1GB of NameNode memory > accommodates around 42T so if you are talking about 100GB of NN memory then > SSD may make sense to speed up the

CfP 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16)

2016-03-19 Thread VHPC 16
CALL FOR PAPERS 11th Workshop on Virtualization in High­-Performance Cloud Computing (VHPC '16) held in conjunction with the International Supercomputing Conference - High Performance, June 19-23, 2016, Frankfurt, Germany.

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger
Sounds like you're using one of the KafkaUtils.createDirectStream overloads that needs to do some broker communication in order to even construct the stream, because you aren't providing topicpartitions? Just wrap your construction attempt in a try / catch and retry in whatever way makes sense for

Re: Subquery performance

2016-03-19 Thread Michael Armbrust
Try running EXPLAIN on both version of the query. Likely when you cache the subquery we know that its going to be small so use a broadcast join instead of a shuffling the data. On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > > > I’m running

Re: unix_timestamp() time zone problem

2016-03-19 Thread Davies Liu
Could you try to cast the timestamp as long? Internally, timestamp are stored as microseconds in UTC, you will got seconds in UTC if you cast it to long. On Thu, Mar 17, 2016 at 1:28 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using python spark 1.6 and the --packages >

Re: Fwd: Spark 2.0 Shell -csv package weirdness

2016-03-19 Thread Marco Mistroni
Have u tried df.saveAsParquetFIle? I think that method is on df Api Hth Marco On 19 Mar 2016 7:18 pm, "Vincent Ohprecio" wrote: > > For some reason writing data from Spark shell to csv using the `csv > package` takes almost an hour to dump to disk. Am I going crazy or did I

Fwd: Spark 2.0 Shell -csv package weirdness

2016-03-19 Thread Vincent Ohprecio
For some reason writing data from Spark shell to csv using the `csv package` takes almost an hour to dump to disk. Am I going crazy or did I do this wrong? I tried writing to parquet first and its fast as normal. On my Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB the machine CPU's goes crazy and

Re: installing packages with pyspark

2016-03-19 Thread Ajinkya Kale
Thanks Jakob, Felix. I am aware you can do it with --packages but i was wondering if there is a way to do something like "!pip install " like i do for other packages from jupyter notebook for python. But I guess I cannot add a package once i launch the pyspark context right ? On Thu, Mar 17, 2016

Re: Checkpoint of DStream joined with RDD

2016-03-19 Thread Ted Yu
I looked at the places in SparkContext.scala where NewHadoopRDD is constrcuted. It seems the Configuration object shouldn't be null. Which hbase release are you using (so that I can see which line the NPE came from) ? Thanks On Fri, Mar 18, 2016 at 8:05 AM, Lubomir Nerad

Re: Spark streaming with akka association with remote system failure

2016-03-19 Thread David Gomez Saavedra
Solved the issue by setting up the same heartbeat interval and pauses in both actor systems akka { loggers = ["akka.event.slf4j.Slf4jLogger"] loglevel = DEBUG logging-filter = "akka.event.slf4j.Slf4jLoggingFilter" log-dead-letters = on log-dead-letters-during-shutdown = on daemonic =

Issue regarding removal of duplicates from RDD

2016-03-19 Thread Thamme Gowda N.
Hi, I am facing an issue while deduplicating the keys in RDD (Code Snippet below). I have few Sequence Files, some of them have duplicate entries. I am trying to drop duplicate values for each key. Here are two methods with code snippets: val path = "path/to/sequence/file" val rdd1 =

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Sebastian Piu
Try to toubleshoot why it is happening, maybe some messages are too big to be read from the topic? I remember getting that error and that was the cause On Fri, Mar 18, 2016 at 11:16 AM Ramkumar Venkataraman < ram.the.m...@gmail.com> wrote: > I am using Spark streaming and reading data from Kafka

Re: sql timestamp timezone bug

2016-03-19 Thread Davies Liu
On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidson wrote: > I am using pyspark 1.6.0 and > datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series > data > > The data is originally captured by a spark streaming app and written to > Cassandra. The value

ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Ascot Moss
Hi, I have a SparkStream (with Kafka) job, after running several days, it failed with following errors: ERROR DirectKafkaInputDStream: ArrayBuffer(java.nio.channels.ClosedChannelException) Any idea what would be wrong? will it be SparkStreaming buffer overflow issue? Regards *** from

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Tony Liu
Sorry for latter reply. Yep, RDRawDataRecord is my object, It defined in other java project(jar.), I get it with maven. My MapReduce program also use it and works. On Fri, Mar 18, 2016 at 12:48 AM, Mich Talebzadeh wrote: > Hi Tony, > > Is > >

Re: How to add an accumulator for a Set in Spark

2016-03-19 Thread Adrien Mogenet
Btw, here is a great article about accumulators and all their related traps! http://imranrashid.com/posts/Spark-Accumulators/ (I'm not the author) On 16 March 2016 at 18:24, swetha kasireddy wrote: > OK. I did take a look at them. So once I have an accumulater for a >

Spark jobs running in serial mode on yarn client

2016-03-19 Thread Mich Talebzadeh
Hi, I am testing some parallel processing of Spark applications. I have a two node spark cluster and currently running two worker processes on each in Yarn-client mode. The master has 12 cores and 24GB of RAM. The worker node has 4GB of RAM and 2 cores (well an old 32 bit host). The OS on both

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
See the instructions in the Spark documentation: https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna wrote: > > > Hi, > > Scala version:2.11.7(had to upgrade the scala verison to enable case

Re: Restarting an executor during execution causes it to lose AWS credentials (anyone seen this?)

2016-03-19 Thread Steve Loughran
On 17 Mar 2016, at 16:01, Allen George > wrote: Hi guys, I'm having a problem where respawning a failed executor during a job that reads/writes parquet on S3 causes subsequent tasks to fail because of missing AWS keys. Setup: I'm using

Zip File and XML parsing with Spark Streaming

2016-03-19 Thread tjb305
Hello, I am trying to figure out how to unzip zip files in Spark Streaming. Within each zip file will be a series of xml files which will also need parsing. Are there libraries that work with DStream that parse a zip or parse an xml file?. I have seen the databricks xml library but I do not

Re: Unit test with sqlContext

2016-03-19 Thread Vikas Kawadia
If you prefer the py.test framework, I just wrote a blog post with some examples: Unit testing Apache Spark with py.test https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b On Fri, Feb 5, 2016 at 11:43 AM, Steve Annessa wrote: > Thanks for

best practices: running multi user jupyter notebook server

2016-03-19 Thread Andy Davidson
We are considering deploying a notebook server for use by two kinds of users 1. interactive dashboard. > 1. I.e. Forms allow users to select data sets and visualizations > 2. Review real time graphs of data captured by our spark streams 2. General notebooks for Data Scientists My concern is

Fwd: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread satyajit vegesna
Hi, Scala version:2.11.7(had to upgrade the scala verison to enable case clasess to accept more than 22 parameters.) Spark version:1.6.1. PFB pom.xml Getting below error when trying to setup spark on intellij IDE, 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1

df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov
Hello, Looking at https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html and can't wrap my head around how to convert string data types names to actual pyspark.sql.types data types? Does pyspark.sql.types has an interface to return StringType() for "string",

Joins in Spark

2016-03-19 Thread Stuti Awasthi
Hi All, I have to join 2 files both not very big say few MBs only but the result can be huge say generating 500GBs to TBs of data. Now I have tried using spark Join() function but Im noticing that join is executing on only 1 or 2 nodes at the max. Since I have a cluster size of 5 nodes , I

Extra libs in executor classpath

2016-03-19 Thread Леонид Поляков
Hello, guys! I’ve been developing a kind of framework on top of spark, and my idea is to bundle the framework jars and some extra configs with the spark and pass it to other developers for their needs. So that devs can use this bundle and run usual spark stuff but with extra flavor that

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Jakob Odersky
Doesn't FileInputFormat require type parameters? Like so: class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord] extends FileInputFormat[LW, RD] I haven't verified this but it could be related to the compile error you're getting. On Thu, Mar 17, 2016 at 9:53 AM, Benyi Wang

Improving Spark Scheduler Delay

2016-03-19 Thread Prabhu Joseph
Hi All, On running Concurrent Spark Jobs (huge number of tasks) with same Spark Context, there is high scheduler delay. We have FAIR schedulingPolicy set and also we tried with different pool for each jobs but still no improvement. What are the tuning ways to improve Scheduler Delay. Thanks,

Re: Checkpoint of DStream joined with RDD

2016-03-19 Thread Lubomir Nerad
The HBase version is 1.0.1.1. Thanks, Lubo On 18.3.2016 17:29, Ted Yu wrote: I looked at the places in SparkContext.scala where NewHadoopRDD is constrcuted. It seems the Configuration object shouldn't be null. Which hbase release are you using (so that I can see which line the NPE came

Re: spark shuffle service on yarn

2016-03-19 Thread Steve Loughran
On 19 Mar 2016, at 02:25, Koert Kuipers > wrote: spark on yarn is nice because i can bring my own spark. i am worried that the shuffle service forces me to use some "sanctioned" spark version that is officially "installed" on the cluster. so... can

Re: Error using collectAsMap() in scala

2016-03-19 Thread Ted Yu
It is defined in: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman wrote: > I am using following code snippet in scala: > > > *val dict: RDD[String] = sc.textFile("path/to/csv/file")* > *val

Setting up spark to run on two nodes

2016-03-19 Thread Ashok Kumar
Experts. Please your valued advice. I have spark 1.5.2 set up as standalone for now and I have started the master as below start-master.sh I also have modified config/slave file to have  # A Spark Worker will be started on each of the machines listed below. localhostworkerhost On the localhost

Re: [Error] : dynamically union All + adding new column

2016-03-19 Thread Ted Yu
bq. .drop("Col9") Could it be due to the above ? On Wed, Mar 16, 2016 at 7:29 PM, Divya Gehlot wrote: > Hi, > I am dynamically doing union all and adding new column too > > val dfresult = >>

Re: Incomplete data when reading from S3

2016-03-19 Thread DB Tsai
You need to use wholetextfiles to read the whole file at once. Otherwise, It can be split. DB Tsai - Sent From My Phone On Mar 17, 2016 12:45 AM, "Blaž Šnuderl" wrote: > Hi. > > We have json data stored in S3 (json record per line). When reading the > data from s3 using the

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Bryan Jeffrey
Cody et. al, I am seeing a similar error. I've increased the number of retries. Once I've got a job up and running I'm seeing it retry correctly. However, I am having trouble getting the job started - number of retries does not seem to help with startup behavior. Thoughts? Regards, Bryan

Re: Unit testing framework for Spark Jobs?

2016-03-19 Thread Vikas Kawadia
I just wrote a blog post on Unit testing Apache Spark with py.test https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b If you prefer using the py.test framework, then it might be useful. -vikas On Wed, Mar 2, 2016 at 10:59 AM, radoburansky

Apache Beam Spark runner

2016-03-19 Thread Sela, Amit
Hi all, The Apache Beam Spark runner is now available at: https://github.com/apache/incubator-beam/tree/master/runners/spark Check it out! The Apache Beam (http://beam.incubator.apache.org/) project is a unified model for building data pipelines using Google’s Dataflow programming model, and

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky
Hi, regarding 1, packages are resolved locally. That means that when you specify a package, spark-submit will resolve the dependencies and download any jars on the local machine, before shipping* them to the cluster. So, without a priori knowledge of dataproc clusters, it should be no different to

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size? > On 16 Mar 2016, at 11:23, Joseph wrote: > > Hi all, > > I have known that ORC provides three level of indexes within each file, file > level, stripe level, and

Stop spark application when the job is complete.

2016-03-19 Thread Imre Nagi
Hi, I have a spark application for batch processing in standalone cluster. The job is to query the database and then do some transformation, aggregation, and several actions such as indexing the result into the elasticsearch. If I dont call the sc.stop(), the spark application wont stop and take

Setting up log4j2/logback with Spark 1.6.0

2016-03-19 Thread Yuval.Itzchakov
I've been trying to get log4j2 and logback to get to play nice with Spark 1.6.0 so I can properly offload my logs to a remote server. I've attempted the following things: 1. Setting logback/log4j2 on the class path for both the driver and worker nodes 2. Passing -Dlog4j.configurationFile= and

Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Reynold Xin
On Wed, Mar 16, 2016 at 3:29 PM, Mridul Muralidharan wrote: > b) Shuffle manager (to get shuffle reader) > What's the use case for shuffle manager/reader? This seems like using super internal APIs in applications.

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
Err, whoops, looks like this is a user app and not building Spark itself, so you'll have to change your deps to use the 2.11 versions of Spark. e.g. spark-streaming_2.10 -> spark-streaming_2.11. On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen wrote: > See the instructions

Re: Joins in Spark

2016-03-19 Thread Rishi Mishra
My suspect is your input file partitions are small. Hence small number of tasks are started. Can you provide some more details like how you load the files and how the result size is around 500GBs ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/)

How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Ramkumar Venkataraman
I am using Spark streaming and reading data from Kafka using KafkaUtils.createDirectStream. I have the "auto.offset.reset" set to smallest. But in some Kafka partitions, I get kafka.common.OffsetOutOfRangeException and my spark job crashes. I want to understand if there is a graceful way to

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger
That's a networking error when the driver is attempting to contact leaders to get the latest available offsets. If it's a transient error, you can look at increasing the value of spark.streaming.kafka.maxRetries, see http://spark.apache.org/docs/latest/configuration.html If it's not a transient

CfP 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16)

2016-03-19 Thread VHPC 16
CALL FOR PAPERS 11th Workshop on Virtualization in High­-Performance Cloud Computing (VHPC '16) held in conjunction with the International Supercomputing Conference - High Performance, June 19-23, 2016, Frankfurt, Germany.

Re: Spark configuration with 5 nodes

2016-03-19 Thread Steve Loughran
On 11 Mar 2016, at 16:25, Mich Talebzadeh > wrote: Hi Steve, My argument has always been that if one is going to use Solid State Disks (SSD), it makes sense to have it for NN disks start-up from fsimage etc. Obviously NN lives in

Re: Checkpoint of DStream joined with RDD

2016-03-19 Thread Ted Yu
This is the line where NPE came from: if (conf.get(SCAN) != null) { So Configuration instance was null. On Fri, Mar 18, 2016 at 9:58 AM, Lubomir Nerad wrote: > The HBase version is 1.0.1.1. > > Thanks, > Lubo > > > On 18.3.2016 17:29, Ted Yu wrote: > > I looked

What is the most efficient and scalable way to get all the recommendation results from ALS model ?

2016-03-19 Thread Hiroyuki Yamada
Hi, I'm testing Collaborative Filtering with Milib. Making a model by ALS.trainImplicit (or train) seems scalable as far as I have tested, but I'm wondering how I can get all the recommendation results efficiently. The predictAll method can get all the results, but it needs the whole

RE: Subquery performance

2016-03-19 Thread Younes Naguib
Anyways to cache the subquery or force a broadcast join without persisting it? y From: Michael Armbrust [mailto:mich...@databricks.com] Sent: March-17-16 8:59 PM To: Younes Naguib Cc: user@spark.apache.org Subject: Re: Subquery performance Try running EXPLAIN on both version of the query.

[Spark-1.5.2]Column renaming with withColumnRenamed has no effect

2016-03-19 Thread Divya Gehlot
Hi, I am adding a new column and renaming it at same time but the renaming doesnt have any effect. dffiltered = >

Re: Spark configuration with 5 nodes

2016-03-19 Thread Mich Talebzadeh
Thank you for info Steve. I always believed (IMO) that there is an optimal position where one can plot the projected NN memory (assuming 1GB--> 40TB of data) to the number of nodes. For example heuristically how many nodes would be sufficient for 1PB of storage with nodes each having 512GB of

Re: best way to do deep learning on spark ?

2016-03-19 Thread charles li
Hi, Alexander, that's awesome, and when will that feature be released ? Since I want to know the opportunity cost between waiting for that release and use caffe or tensorFlow ? great thanks again On Thu, Mar 17, 2016 at 10:32 AM, Ulanov, Alexander < alexander.ula...@hpe.com> wrote: > Hi

RE: The build-in indexes in ORC file does not work.

2016-03-19 Thread Wietsma, Tristan A.
Regarding bloom filters, https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12417 Sent with Good (www.good.com) From: Joseph Sent: Wednesday, March 16, 2016 9:46:25 AM To: user Cc: user; user Subject: Re: Re: The build-in

Re: best way to do deep learning on spark ?

2016-03-19 Thread Daniel Darabos
On Thu, Mar 17, 2016 at 3:51 AM, charles li wrote: > Hi, Alexander, > > that's awesome, and when will that feature be released ? Since I want to > know the opportunity cost between waiting for that release and use caffe or > tensorFlow ? > I don't expect MLlib will be

Rename Several Aggregated Columns

2016-03-19 Thread Andres.Fernandez
Good morning. I have a dataframe and would like to group by on two fields, and perform a sum aggregation on more than 500 fields, though I would like to keep the same name for the 500 hundred fields (instead of sum(Field)). I do have the field names in an array. Could anybody help with this

best way to do deep learning on spark ?

2016-03-19 Thread charles li
Hi, guys, I'm new to MLlib on spark, after reading the document, it seems that MLlib does not support deep learning, I want to know is there any way to implement deep learning on spark ? *Do I must use 3-party package like caffe or tensorflow ?* or *Does deep learning module list in the MLlib

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Cody Koeninger
Is that happening only at startup, or during processing? If that's happening during normal operation of the stream, you don't have enough resources to process the stream in time. There's not a clean way to deal with that situation, because it's a violation of preconditions. If you want to

unix_timestamp() time zone problem

2016-03-19 Thread Andy Davidson
I am using python spark 1.6 and the --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 I need to convert a time stamp string into a unix epoch time stamp. The function unix_timestamp() function assume current time zone. How ever my string data is UTC and encodes the time zone as zero.

[discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Reynold Xin
Any objections? Please articulate your use case. SparkEnv is a weird one because it was documented as "private" but not marked as so in class visibility. * NOTE: This is not intended for external use. This is exposed for Shark and may be made private * in a future release. I do see Hive

Re: Unit testing framework for Spark Jobs?

2016-03-19 Thread Lars Albertsson
I would recommend against writing unit tests for Spark programs, and instead focus on integration tests of jobs or pipelines of several jobs. You can still use a unit test framework to execute them. Perhaps this is what you meant. You can use any of the popular unit test frameworks to drive your

Subquery performance

2016-03-19 Thread Younes Naguib
Hi all, I'm running a query that looks like the following: Select col1, count(1) >From (Select col2, count(1) from tab2 group by col2) Inner join tab1 on (col1=col2) Group by col1 This creates a very large shuffle, 10 times the data size, as if the subquery was executed for each row. Anything

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 which starts and gives me a REPL, but when I try from graphframes import * I get No module names graphframes without '--master yarn' it

Re: How to add an accumulator for a Set in Spark

2016-03-19 Thread swetha kasireddy
OK. I did take a look at them. So once I have an accumulater for a HashSet, how can I check if a particular key is already present in the HashSet accumulator? I don't see any .contains method there. My requirement is that I need to keep accumulating the keys in the HashSet across all the tasks in

Checkpoint of DStream joined with RDD

2016-03-19 Thread Lubomir Nerad
Hi, I tried to replicate the example of joining DStream with lookup RDD from http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation. It works fine, but when I enable checkpointing for the StreamingContext and let the application to recover from a previously

Re: [Error] : dynamically union All + adding new column

2016-03-19 Thread Ted Yu
It turned out that Col1 appeared twice in the select :-) > On Mar 16, 2016, at 7:29 PM, Divya Gehlot wrote: > > Hi, > I am dynamically doing union all and adding new column too > >> val dfresult = >>

Re: Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-19 Thread Cody Koeninger
There's 1 topic per partition, so you're probably better off dealing with topics that way rather than at the individual message level. http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers Look at the discussion of "HasOffsetRanges" If you

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky
> But I guess I cannot add a package once i launch the pyspark context right ? Correct. Potentially, if you really really wanted to, you could maybe (with lots of pain) load packages dynamically with some class-loader black magic, but Spark does not provide that functionality. On Thu, Mar 17,

Re: Rename Several Aggregated Columns

2016-03-19 Thread Sunitha Kambhampati
One way is to rename the columns using the toDF For eg: val df = Seq((1, 2),(1,4),(2,3) ).toDF("a","b") df.printSchema() val renamedf = df.groupBy('a).agg(sum('b)).toDF("mycola", "mycolb") renamedf.printSchema() Best regards, Sunitha > On Mar 18, 2016, at 9:10 AM,

Re: HP customer support @ www.globalpccure.com/Support/Support-for-HP.aspx

2016-03-19 Thread nsalian
Please refrain from posting such messages on this email thread. This is specific to the Spark ecosystem and not an avenue to advertise an entity/company. Thank you. - Neelesh S. Salian Cloudera -- View this message in context:

Re: ClassNotFoundException in RDD.map

2016-03-19 Thread Ted Yu
bq. $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 Do you mind showing more of your code involving the map() ? On Thu, Mar 17, 2016 at 8:32 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Hello, > I found a strange behavior after executing a prediction with MLIB. > My code

[Error] : dynamically union All + adding new column

2016-03-19 Thread Divya Gehlot
Hi, I am dynamically doing union all and adding new column too val dfresult = > dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9") > val schemaL = dfresult.schema > var dffiltered = sqlContext.createDataFrame(sc.emptyRDD[Row], schemaL) > for ((key,values) <- lcrMap)

  1   2   >