Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Andy Davidson
ser @spark" <user@spark.apache.org> Subject: Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk( > FYI, there is a PR and JIRA for virtualEnv support in PySpark > > https://issues.apache.org/jira/browse/SPARK-13587 > https://github.com/apache/spark/pull/1359

Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Andy Davidson
FYI http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext From: Andrew Davidson Date: Wednesday, April 4, 2018 at 5:36 PM To: "user @spark" Subject: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:

Re: Union of multiple data frames

2018-04-05 Thread Andy Davidson
Hi Ceasar I have used Brandson approach in the past with out any problem Andy From: Brandon Geise Date: Thursday, April 5, 2018 at 11:23 AM To: Cesar , "user @spark" Subject: Re: Union of multiple data frames > Maybe

how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-04 Thread Andy Davidson
I am having a heck of a time setting up my development environment. I used pip to install pyspark. I also downloaded spark from apache. My eclipse pyDev intereperter is configured as a python3 virtualenv I have a simple unit test that loads a small dataframe. Df.show() generates the following

trouble with 'pip pyspark' pyspark.sql.functions. ³unresolved import² for col() and lit()

2018-04-04 Thread Andy Davidson
I am having trouble setting up my python3 virtualenv. I created a virtualenv Œspark-2.3.0¹ Installed pyspark using pip how ever I am not able to import pyspark.sql.functions. I get ³unresolved import² when I try to import col() and lit() from pyspark.sql.functions import * I found if I

Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
t; ++-+ > | a| b| > ++-+ > |john| red| > |john| blue| > |john| red| > |bill| blue| > |bill| red| > | sam|green| > ++-+ > > > distData.as("tbl1").join(distData.as("tbl2"), Seq("a"), > "fullouter&quo

Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
I was a little sloppy when I created the sample output. Its missing a few pairs Assume for a given row I have [a, b, c] I want to create something like the cartesian join From: Andrew Davidson Date: Friday, March 30, 2018 at 5:54 PM To: "user @spark"

how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²) This produces a column of type array. Now for each row I want to create a multiple pairs/tuples from the array so that I can create a contingency table. Any idea how I can transform my data so that call crosstab() ? The

newbie: how to partition data on file system. What are best practices?

2017-11-22 Thread Andy Davidson
I am working on a deep learning project. Currently we do everything on a single machine. I am trying to figure out how we might be able to move to a clustered spark environment. Clearly its possible a machine or job on the cluster might fail so I assume that the data needs to be replicated to

does "Deep Learning Pipelines" scale out linearly?

2017-11-22 Thread Andy Davidson
I am starting a new deep learning project currently we do all of our work on a single machine using a combination of Keras and Tensor flow. https://databricks.github.io/spark-deep-learning/site/index.html looks very promising. Any idea how performance is likely to improve as I add machines to my

anyone know what the status of spark-ec2 is?

2016-09-06 Thread Andy Davidson
Spark-ec2 used to be part of the spark distribution. It now seems to be split into a separate repo https://github.com/amplab/spark-ec2 It does not seem to be listed on https://spark-packages.org/ Does anyone know what the status is? There is a readme.md how ever I am unable to find any release

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andy Davidson
pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp > Do you have a file called tmp at / on HDFS? > > > > > > On Thu, Aug 18, 2016 at 2:57 PM -0700, "Andy Davidson" > <a...@santacru

pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andy Davidson
For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path

Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-18 Thread Andy Davidson
Hi I am using python3, Java8 and spark-1.6.1. I am running my code in Jupyter notebook The following code runs fine on my mac udfRetType = ArrayType(StringType(), True) findEmojiUDF = udf(lambda s : re.findall(emojiPattern2, s), udfRetType) retDF = (emojiSpecialDF # convert into a

py4j.Py4JException: Method lower([class java.lang.String]) does not exist

2016-08-18 Thread Andy Davidson
I am still working on spark-1.6.1. I am using java8. Any idea why (df.select("*", functions.lower("rawTag").alias("tag²)) Would raise py4j.Py4JException: Method lower([class java.lang.String]) does not exist Thanks in advance Andy

FW: [jupyter] newbie. apache spark python3 'Jupyter' data frame problem with auto completion and accessing documentation

2016-08-02 Thread Andy Davidson
yter' data frame problem with auto completion and accessing documentation > Hi Andy, > > On 1 August 2016 at 22:46, Andy Davidson <a...@santacruzintegration.com> > wrote: >> I wonder if the auto completion problem as todo with the way code is >> typically written in

Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread Andy Davidson
Hi Freedafeng I have been reading and writing to s3 using spark-1.6.x with out any problems. Can you post a little code example and any error messages? Andy From: freedafeng Date: Tuesday, August 2, 2016 at 9:26 AM To: "user @spark" Subject:

python 'Jupyter' data frame problem with autocompletion

2016-08-01 Thread Andy Davidson
I started using python3 and jupyter in a chrome browser. I seem to be having trouble with data frame code completion. Regular python functions seems to work correctly. I wonder if I need to import something so the notebook knows about data frames? Kind regards Andy

Re: how to copy local files to hdfs quickly?

2016-07-30 Thread Andy Davidson
For lack of a better solution I am using ŒAWS s3 copy¹ to copy my files locally and Œhadoop fs ­put ./tmp/* Œ to transfer them. In general put works much better with a smaller number of big files compared to a large number of small files Your milage may vary Andy From: Andrew Davidson

Re: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-30 Thread Andy Davidson
gt; found this worked, but lacking in a few ways so I started this project: >>> https://github.com/EntilZha/spark-s3 >>> >>> This takes that idea further by: >>> 1. Rather than sc.parallelize, implement the RDD interface where each >>> partition is d

use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-29 Thread Andy Davidson
cket", "file1", >> "folder2").regularRDDOperationsHere or import implicits and do >> sc.s3.textFileByPrefix >> >> At present, I am battle testing and benchmarking it at my current job and >> results are promising with significant improvements to

Re: spark 1.6.0 read s3 files error.

2016-07-28 Thread Andy Davidson
Hi Freedafeng Can you tells a little more? I.E. Can you paste your code and error message? Andy From: freedafeng Date: Thursday, July 28, 2016 at 9:21 AM To: "user @spark" Subject: Re: spark 1.6.0 read s3 files error. > The question is, what

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Andy Davidson
m. > > Since I hadn't intended to advertise this quite yet the documentation is not > super polished but exists here: > http://spark-s3.entilzha.io/latest/api/#io.entilzha.spark.s3.S3Context > > I am completing the sonatype process for publishing artifacts on maven central > (t

performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Andy Davidson
I have a relatively small data set however it is split into many small JSON files. Each file is between maybe 4K and 400K This is probably a very common issue for anyone using spark streaming. My streaming app works fine, how ever my batch application takes several hours to run. All I am doing

Re: A question about Spark Cluster vs Local Mode

2016-07-27 Thread Andy Davidson
Hi Ascot When you run in cluster mode it means your cluster manager will cause your driver to execute on one of the works in your cluster. The advantage of this is you can log on to a machine in your cluster and submit your application and then log out. The application will continue to run.

how to copy local files to hdfs quickly?

2016-07-27 Thread Andy Davidson
I have a spark streaming app that saves JSON files to s3:// . It works fine Now I need to calculate some basic summary stats and am running into horrible performance problems. I want to run a test to see if reading from hdfs instead of s3 makes difference. I am able to quickly copy my the data

spark-2.x what is the default version of java ?

2016-07-27 Thread Andy Davidson
I currently have to configure spark-1.x to use Java 8 and python 3.x. I noticed that http://spark.apache.org/releases/spark-release-2-0-0.html#removals mentions java 7 is deprecated. Is the default now Java 8 ? Thanks Andy Deprecations The following features have been deprecated in Spark

Re: spark 1.6.0 read s3 files error.

2016-07-27 Thread Andy Davidson
Hi Freedafeng The following works for me df will be a data frame. fullPath is lists list of various part files stored in s3. fullPath = ['s3n:///json/StreamingKafkaCollector/s1/2016-07-10/146817304/part-r -0-a2121800-fa5b-44b1-a994-67795' ] from pyspark.sql import SQLContext

spark-2.0 support for spark-ec2 ?

2016-07-27 Thread Andy Davidson
Congratulations on releasing 2.0! spark-2.0.0-bin-hadoop2.7 no longer includes the spark-ec2 script How ever http://spark.apache.org/docs/latest/index.html has a link to the spark-ec2 github repo https://github.com/amplab/spark-ec2 Is this the right group to discuss spark-ec2? Any idea how

getting more concurrency best practices

2016-07-26 Thread Andy Davidson
Bellow is a very simple application. It runs very slowly. It does not look like I am getting a lot of parallel execution. I image this is a very common work flow. Periodically I want to runs some standard summary statistics across several different data sets. Any suggestions would be greatly

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Andy Davidson
Yup in cluster mode you need to figure out what machine the driver is running on. That is the machine the UI will run on https://issues.apache.org/jira/browse/SPARK-15829 You may also have fire wall issues Here are some notes I made about how to figure out what machine the driver is running on

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Andy Davidson
Hi Kevin Just a heads up at the recent spark summit in S.F. There was a presentation about streaming in 2.0. They said that streaming was not going to production ready in 2.0. I am not sure if the older 1.6.x version will be supported. My project will not be able to upgrade with streaming

Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
ead "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space > How much heap memory do you give the driver ? > > On Fri, Jul 22, 2016 at 2:17 PM, Andy Davidson <a...@santacruzintegration.com> > wrote: >> Given I get a stack trace in my python notebook I am

Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
Given I get a stack trace in my python notebook I am guessing the driver is running out of memory? My app is simple it creates a list of dataFrames from s3://, and counts each one. I would not think this would take a lot of driver memory I am not running my code locally. Its using 12 cores. Each

running jupyter notebook server Re: spark and plot data

2016-07-22 Thread Andy Davidson
in our cluster (1.5.0) i can't get this. > > do you have any solution. > > Thanks > > 2016-07-21 18:44 GMT+02:00 Andy Davidson <a...@santacruzintegration.com>: >> Hi Pseudo >> >> Plotting, graphing, data visualization, report generation are common ne

Re: How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Andy Davidson
local:6066 (cluster mode) > > Thanks > Saisai > > On Fri, Jul 22, 2016 at 6:44 AM, Andy Davidson <a...@santacruzintegration.com> > wrote: >> I have some very long lived streaming apps. They have been running for >> several months. I wonder if something has change

How to submit app in cluster mode? port 7077 or 6066

2016-07-21 Thread Andy Davidson
I have some very long lived streaming apps. They have been running for several months. I wonder if something has changed recently? I first started working with spark-1.3 . I am using the stand alone cluster manager. The way I would submit my app to run in cluster mode was port 6066 Looking at

Re: spark and plot data

2016-07-21 Thread Andy Davidson
Hi Pseudo Plotting, graphing, data visualization, report generation are common needs in scientific and enterprise computing. Can you tell me more about your use case? What is it about the current process / workflow do you think could be improved by pushing plotting (I assume you mean plotting

Re: write and call UDF in spark dataframe

2016-07-20 Thread Andy Davidson
Hi Divya In general you will get better performance if you can minimize your use of UDFs. Spark 2.0/ tungsten does a lot of code generation. It will have to treat your UDF as a block box. Andy From: Rishabh Bhardwaj Date: Wednesday, July 20, 2016 at 4:22 AM To: Rabin

Re: Role-based S3 access outside of EMR

2016-07-19 Thread Andy Davidson
Hi Everett I always do my initial data exploration and all our product development in my local dev env. I typically select a small data set and copy it to my local machine My main() has an optional command line argument Œ- - runLocal¹ Normally I load data from either hdfs:/// or S3n:// . If the

Re: Trouble while running spark at ec2 cluster

2016-07-18 Thread Andy Davidson
Hi Hassan Typically I log on to my master to submit my app. [ec2-user@ip-172-31-11-222 bin]$ echo $SPARK_ROOT /root/spark [ec2-user@ip-172-31-11-222 bin]$echo $MASTER_URL spark://ec2-54-215-11-222.us-west-1.compute.amazonaws.com:7077 [ec2-user@ip-172-31-11-222 bin]$

/spark-ec2 script: trouble using ganglia web ui spark 1.6.1

2016-07-11 Thread Andy Davidson
I created a cluster using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 script. The shows ganglia started how ever I am not able to access http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:5080/ganglia. I have tried using the private ip from with in my data center. I d not see anything listing

Re: Spark Streaming - Direct Approach

2016-07-11 Thread Andy Davidson
Hi Pradeep I can not comment about experimental or production, how ever I recently started a POC using direct approach. Its been running off and on for about 2 weeks. In general it seems to work really well. One thing that is not clear to me is how the cursor is manage. E.G. I have my topic set

trouble accessing driver log files using rest-api

2016-07-11 Thread Andy Davidson
I am running spark-1.6.1 and the stand alone cluster manager. I am running into performance problems with spark streaming and added some extra metrics to my log files. I submit my app in cluster mode. (I.e. The driver runs on a slave not master) I am not able to get the driver log files while

WARN FileOutputCommitter: Failed to delete the temporary output directory of task: attempt_201607111453_128606_m_000000_0 - s3n://

2016-07-11 Thread Andy Davidson
I am running into serious performance problems with my spark 1.6 streaming app. As it runs it gets slower and slower. My app is simple. * It receives fairly large and complex JSON files. (twitter data) * Converts the RDD to DataFrame * Splits the data frame in to maybe 20 different data sets *

spark UI what does storage memory x/y mean

2016-07-11 Thread Andy Davidson
My stream app is running into problems It seems to slow down over time. How can I interpret the storage memory column. I wonder if I have a GC problem? Any idea how I can get GC stats? Thanks Andy Executors (3) * Memory: 9.4 GB Used (1533.4 MB Total) * Disk: 0.0 B Used Executor IDAddressRDD

can I use ExectorService in my driver? was: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Andy Davidson
> -----Original Message- > From: Cody Koeninger [mailto:c...@koeninger.org] > Sent: 08 July 2016 15:31 > To: Andy Davidson <a...@santacruzintegration.com> > Cc: user @spark <user@spark.apache.org> > Subject: Re: is dataframe.write() async? Streaming performance prob

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Andy Davidson
Kafka has an interesting model that might be applicable. You can think of kafka as enabling a queue system. Writes are called producers, and readers are called consumers. The server is called a broker. A ³topic² is like a named queue. Producer are independent. They can write to a ³topic² at

is dataframe.write() async? Streaming performance problem

2016-07-07 Thread Andy Davidson
I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using kafka direct stream approach. I am running into performance problems. My processing time is > than my window size. Changing window sizes, adding cores and executor memory does not change performance. I am having a lot of

Re: strange behavior when I chain data frame transformations

2016-05-13 Thread Andy Davidson
:38 PM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: strange behavior when I chain data frame transformations > In the structure shown, tag is under element. > > I wonder if that was a factor. > > On F

strange behavior when I chain data frame transformations

2016-05-13 Thread Andy Davidson
I am using spark-1.6.1. I create a data frame from a very complicated JSON file. I would assume that query planer would treat both version of my transformation chains the same way. // org.apache.spark.sql.AnalysisException: Cannot resolve column name "tag" among (actor, body, generator, pip,

How to transform a JSON string into a Java HashMap<> java.io.NotSerializableException

2016-05-11 Thread Andy Davidson
I have a streaming app that receives very complicated JSON (twitter status). I would like to work with it as a hash map. It would be very difficult to define a pojo for this JSON. (I can not use twitter4j) // map json string to map JavaRDD> jsonMapRDD =

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Andy Davidson
Hi Tobias I am very interested implemented rest based api on top of spark. My rest based system would make predictions from data provided in the request using models trained in batch. My SLA is 250 ms. Would you mind sharing how you implemented your rest server? I am using spark-1.6.1. I have

java.io.NotSerializableException: org.apache.spark.sql.types.LongType

2016-04-21 Thread Andy Davidson
I started using http://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#fp-gr owth in python. It was really easy to get the frequent items set. Unfortunately associations is not implemented in python. Here is my python code It works great rawJsonRDD = jsonToPythonDictionaries(sc,

custom transformer pipeline sample code

2016-04-20 Thread Andy Davidson
Someone recently asked me for a code example of how to to write a custom pipeline transformer in Java Enjoy, Share Andy https://github.com/AEDWIP/Spark-Naive-Bayes-text-classification/blob/260a6b9 b67d7da42c1d0f767417627da342c8a49/src/test/java/com/santacruzintegration/spa

Re: Spark replacing Hadoop

2016-04-14 Thread Andy Davidson
Hi Ashok In general if I was starting a new project and had not invested heavily in hadoop (i.e. Had a large staff that was trained on hadoop, had a lot of existing projects implemented on hadoop, Š) I would probably start using spark. Its faster and easier to use Your mileage may vary Andy

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-06 Thread Andy Davidson
+1 From: Matei Zaharia Date: Tuesday, April 5, 2016 at 4:58 PM To: Xiangrui Meng Cc: Shivaram Venkataraman , Sean Owen , Xiangrui Meng , dev , "user @spark"

Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS

2016-04-05 Thread Andy Davidson
t; > > > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > > On 5 April 2016 at 23:59, Andy Davidson <a...@santacruzintegration.com> wrote: >> In my experience my streaming I was getting tens of thousands of emp

Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS

2016-04-05 Thread Andy Davidson
t; > > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > > On 5 April 2016 at 23:35, Andy Davidson <a...@santacruzintegration.com> wrote: >> Hi Mich >> >> Yup I was surprised to find empty files. Its easy to wor

Re: Saving Spark streaming RDD with saveAsTextFiles ends up creating empty files on HDFS

2016-04-05 Thread Andy Davidson
Hi Mich Yup I was surprised to find empty files. Its easy to work around. Note I should probably use coalesce() and not repartition() In general I found I almost always need to reparation. I was getting thousands of empty partitions. It was really slowing my system down. private static void

Re: Can spark somehow help with this usecase?

2016-04-05 Thread Andy Davidson
Hi Marco You might consider setting up some sort of ELT pipe line. One of your stages might be to create a file of all the FTP URL. You could then write a spark app that just fetches the urls and stores the data in some sort of data base or on the file system (hdfs?) My guess would be to maybe

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-04-04 Thread Andy Davidson
:6 17) ... 1 more From: Jeff Zhang <zjf...@gmail.com> Date: Tuesday, March 29, 2016 at 10:34 PM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: pyspark unable to convert dataframe column to a vector: Unabl

data frame problem preserving sort order with repartition() and coalesce()

2016-03-29 Thread Andy Davidson
I have a requirement to write my results out into a series of CSV files. No file may have more than 100 rows of data. In the past my data was not sorted, and I was able to use reparation() or coalesce() to ensure the file length requirement. I realize that reparation() cause the data to be

Vectors.sparse exception: TypeError: indices array must be sorted

2016-03-29 Thread Andy Davidson
I am using pyspark 1.6.1 and python3 Any idea what my bug is? Clearly the indices are being sorted? Could it be the numDimensions = 713912692155621377 and my indices are longs not ints? import numpy as np from pyspark.mllib.linalg import Vectors from pyspark.mllib.linalg import VectorUDT #sv1

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-29 Thread Andy Davidson
>> 713912692155621376 >>> >>> From: Alexander Krasnukhin <the.malk...@gmail.com> >>> Date: Monday, March 28, 2016 at 5:55 PM >>> To: Andrew Davidson <a...@santacruzintegration.com> >>> Cc: "user @spark" <user@spa

Re: looking for an easy to to find the max value of a column in a data frame

2016-03-29 Thread Andy Davidson
ct(max(col("foo"))).show() > > On Tue, Mar 29, 2016 at 2:15 AM, Andy Davidson <a...@santacruzintegration.com> > wrote: >> I am using pyspark 1.6.1 and python3. >> >> >> Given: >> >> idDF2 = idDF.select(idDF.id, idDF.col.id <http

Re: Sending events to Kafka from spark job

2016-03-29 Thread Andy Davidson
Hi Fanoos I would be careful about using collect(). You need to make sure you local computer has enough memory to hold your entire data set. Eventually I will need to do something similar. I have to written the code yet. My plan is to load the data into a data frame and then write a UDF that

pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-03-28 Thread Andy Davidson
I am using pyspark spark-1.6.1-bin-hadoop2.6 and python3. I have a data frame with a column I need to convert to a sparse vector. I get an exception Any idea what my bug is? Kind regards Andy Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. :

looking for an easy to to find the max value of a column in a data frame

2016-03-28 Thread Andy Davidson
I am using pyspark 1.6.1 and python3. Given: idDF2 = idDF.select(idDF.id, idDF.col.id ) idDF2.printSchema() idDF2.show() root |-- id: string (nullable = true) |-- col[id]: long (nullable = true) +--+--+ |id| col[id]| +--+--+ |1008930924| 534494917|

Re: --packages configuration equivalent item name?

2016-03-28 Thread Andy Davidson
Hi Russell I use Jupyter python notebooks a lot. Here is how I start the server set -x # turn debugging on #set +x # turn debugging off # https://github.com/databricks/spark-csv # http://spark-packages.org/package/datastax/spark-cassandra-connector

Re: pyspark sql convert long to timestamp?

2016-03-22 Thread Andy Davidson
<user@spark.apache.org> Subject: Re: pyspark sql convert long to timestamp? > Have a look at the from_unixtime() functions. > https://spark.apache.org/docs/1.5.0/api/python/_modules/pyspark/sql/functions. > html#from_unixtime > > Thanks > Best Regards >

pyspark sql convert long to timestamp?

2016-03-21 Thread Andy Davidson
Any idea how I have a col in a data frame that is of type long any idea how I create a column who¹s type is time stamp? The long is unix epoch in ms Thanks Andy

bug spark should not use java.sql.timestamp was: sql timestamp timezone bug

2016-03-20 Thread Andy Davidson
Here is a nice analysis of the issue from the Cassandra mail list. (Datastax is the Databricks for Cassandra) Should I fill a bug? Kind regards Andy http://stackoverflow.com/questions/2305973/java-util-date-vs-java-sql-date and this one On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer

Re: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
Hi Davies > > What's the type of `created`? TimestampType? The Œcreated¹ column in cassandra is a timestamp https://docs.datastax.com/en/cql/3.0/cql/cql_reference/timestamp_type_r.html In the spark data frame it is a a timestamp > > If yes, when created is compared to a string, it will be

Re: sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
For completeness. Clearly spark sql returned a different data set In [4]: rawDF.selectExpr("count(row_key) as num_samples", "sum(count) as total", "max(count) as max ").show() +---++-+ |num_samples|total|max|

sql timestamp timezone bug

2016-03-19 Thread Andy Davidson
I am using pyspark 1.6.0 and datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series data The data is originally captured by a spark streaming app and written to Cassandra. The value of the timestamp comes from Rdd.foreachRDD(new VoidFunction2()

best practices: running multi user jupyter notebook server

2016-03-19 Thread Andy Davidson
We are considering deploying a notebook server for use by two kinds of users 1. interactive dashboard. > 1. I.e. Forms allow users to select data sets and visualizations > 2. Review real time graphs of data captured by our spark streams 2. General notebooks for Data Scientists My concern is

unix_timestamp() time zone problem

2016-03-19 Thread Andy Davidson
I am using python spark 1.6 and the --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 I need to convert a time stamp string into a unix epoch time stamp. The function unix_timestamp() function assume current time zone. How ever my string data is UTC and encodes the time zone as zero.

what is the pyspark inverse of registerTempTable()?

2016-03-15 Thread Andy Davidson
Thanks Andy

Re: newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
ls files. > > Frank Austin Nothaft > fnoth...@berkeley.edu > fnoth...@eecs.berkeley.edu > 202-340-0466 > >> On Mar 15, 2016, at 11:45 AM, Andy Davidson <a...@santacruzintegration.com> >> wrote: >> >> We use the spark-ec2 script to create AWS clusters as needed (we

newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
We use the spark-ec2 script to create AWS clusters as needed (we do not use AWS EMR) 1. will we get better performance if we copy data to HDFS before we run instead of reading directly from S3? 2. What is a good way to move results from HDFS to S3? It seems like there are many ways to bulk copy

Re: trouble with NUMPY constructor in UDF

2016-03-10 Thread Andy Davidson
; Date: Wednesday, March 9, 2016 at 3:28 PM > To: Andrew Davidson <a...@santacruzintegration.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: trouble with NUMPY constructor in UDF > >> bq. epoch2numUDF = udf(foo, FloatType()) >> >

Re: trouble with NUMPY constructor in UDF

2016-03-10 Thread Andy Davidson
"user @spark" <user@spark.apache.org> Subject: Re: trouble with NUMPY constructor in UDF > bq. epoch2numUDF = udf(foo, FloatType()) > > Is it possible that return value from foo is not FloatType ? > > On Wed, Mar 9, 2016 at 3:09 PM, Andy Davidson <a...@santacruz

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Andy Davidson
In my experience I would try the following I use the standalone cluster manager. Each app gets it own performance web page . The streaming tab is really helpful. If processing time is greater than then your mini batch length you are going to run into performance problems Use the ³stages² tab to

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-09 Thread Andy Davidson
t settings separately > spark.cassandra.connection.host > spark.cassandra.connection.port > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/referenc > e.md#cassandra-connection-parameters > > Looking at the logs, it seems your port config is not being set and it'

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Andy Davidson
onnector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042 > Have you contacted spark-cassandra-connector related mailing list ? > > I wonder where the port 9042 came from. > > Cheers > > On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson

pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Andy Davidson
I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python notebook that reads a data frame from Cassandra. I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works how ever I can not figure out how to configure my notebook. I have tried various hacks any idea what I

streaming will I loose data if spark.streaming.backpressure.enabled=true

2016-03-07 Thread Andy Davidson
http://spark.apache.org/docs/latest/streaming-programming-guide.html#deployi ng-applications Gives a brief discussion about max rate and back pressure Its not clear to me what will happen. I use an unreliable reciever. Let say me app is running and process time is less then window length. Happy

how to implement and deploy robust streaming apps

2016-03-07 Thread Andy Davidson
One of the challenges we need to prepare for with streaming apps is bursty data. Typically we need to estimate our worst case data load and make sure we have enough capacity It not obvious what best practices are with spark streaming. * we have implemented check pointing as described in the

streaming app performance when would increasing execution size or adding more cores

2016-03-07 Thread Andy Davidson
We just deployed our first streaming apps. The next step is running them so they run reliably We have spend a lot of time reading the various prog guides looking at the standalone cluster manager app performance web pages. Looking at the streaming tab and the stages tab have been the most

Re: newbie unable to write to S3 403 forbidden error

2016-02-24 Thread Andy Davidson
would access my > bucket this way s3n://bucketname/foldername. > > You can test privileges using the s3 cmd line client. > > Also, if you are using instance profiles you don't need to specify access and > secret keys. No harm in specifying though. > &

streaming spark is writing results to S3 a good idea?

2016-02-23 Thread Andy Davidson
Currently our stream apps write results to hdfs. We are running into problems with HDFS becoming corrupted and running out of space. It seems like a better solution might be to write directly to S3. Is this a good idea? We plan to continue to write our checkpoints to hdfs Are there any issues to

spark-1.6.0-bin-hadoop2.6/ec2/spark-ec2 uses old version of hadoop

2016-02-23 Thread Andy Davidson
I do not have any hadoop legacy code. My goal is to run spark on top of HDFS. Recently I have been have hdfs corruption problem. I was also never able to access S3 even though I used --copy-aws-credentials. I noticed that by default the spark-ec2 script uses hadoop 1.0.4. I ran help and

Re: GroupedDataset needs a mapValues

2016-02-14 Thread Andy Davidson
Hi Michael From: Michael Armbrust Date: Saturday, February 13, 2016 at 9:31 PM To: Koert Kuipers Cc: "user @spark" Subject: Re: GroupedDataset needs a mapValues > Instead of grouping with a lambda function, you can do it

Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Andy Davidson
gt; or applications errors, but I think it is general dev ops work not > very specific to spark or hadoop. > > BR, > > Arkadiusz Bicz > https://www.linkedin.com/in/arkadiuszbicz > > On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson > <a...@santacruzintegration.com> wrot

Re: newbie unable to write to S3 403 forbidden error

2016-02-12 Thread Andy Davidson
uot;s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/ > <http://s3-us-west-1.amazonaws.com/com.pws.twitter/> json² > > not sure, but > can you try to remove s3-us-west-1.amazonaws.com > <http://s3-us-west-1.amazonaws.com/com.pws.twitter/> from path ? > > On 11 February

Re: Question on Spark architecture and DAG

2016-02-12 Thread Andy Davidson
From: Mich Talebzadeh Date: Thursday, February 11, 2016 at 2:30 PM To: "user @spark" Subject: Question on Spark architecture and DAG > Hi, > > I have used Hive on Spark engine and of course Hive tables and its pretty > impressive comparing Hive

org.apache.spark.sql.AnalysisException: undefined function lit;

2016-02-12 Thread Andy Davidson
I am trying to add a column with a constant value to my data frame. Any idea what I am doing wrong? Kind regards Andy DataFrame result = Š String exprStr = "lit(" + time.milliseconds()+ ") as ms"; logger.warn("AEDWIP expr: {}", exprStr); result.selectExpr("*", exprStr).show(false);

best practices? spark streaming writing output detecting disk full error

2016-02-11 Thread Andy Davidson
We recently started a Spark/Spark Streaming POC. We wrote a simple streaming app in java to collect tweets. We choose twitter because we new we get a lot of data and probably lots of burst. Good for stress testing We spun up a couple of small clusters using the spark-ec2 script. In one cluster

newbie unable to write to S3 403 forbidden error

2016-02-11 Thread Andy Davidson
I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I am using the standalone cluster manager My java streaming app is not able to write to s3. It appears to be some for of permission problem. Any idea what the problem might be? I tried use the IAM simulator to test the

  1   2   3   >