What is the minimum value allowed for StreamingContext's Seconds parameter?

2016-05-23 Thread YaoPau
Just wondering how small the microbatches can be, and any best practices on the smallest value that should be used in production. For example, any issue with running it at 0.01 seconds? -- View this message in context:

A number of issues when running spark-ec2

2016-04-16 Thread YaoPau
I launched a cluster with: "./spark-ec2 --key-pair my_pem --identity-file ../../ssh/my_pem.pem launch jg_spark2" and I got the "Spark standalone cluster started at http://ec2-54-88-249-255.compute-1.amazonaws.com:8080; and "Done!" success confirmations at the end. I confirmed on EC2 that 1 Master

Why do I need to handle dependencies on EMR but not on-prem Hadoop?

2016-04-08 Thread YaoPau
On-prem I'm running PySpark on Cloudera's distribution, and I've never had to worry about dependency issues. I import my libraries on my driver node only using pip or conda, run my jobs in yarn-client mode, and everything works (I just assumed the relevant libraries are copied temporarily to each

Spark Streaming, PySpark 1.3, randomly losing connection

2015-12-18 Thread YaoPau
Hi - I'm running Spark Streaming using PySpark 1.3 in yarn-client mode on CDH 5.4.4. The job sometimes runs a full 24hrs, but more often it fails sometime during the day. I'm getting several vague errors that I don't see much about when searching online: - py4j.Py4JException: Error while

Python 3.x support

2015-12-17 Thread YaoPau
I found the jira for Python 3 support here , but it looks like support for 3.4 was still unresolved. Which Python 3 versions are supported by Spark 1.5? -- View this message in context:

sortByKey not spilling to disk? (PySpark 1.3)

2015-12-09 Thread YaoPau
I'm running sortByKey on a dataset that's nearly the amount of memory I've provided to executors (I'd like to keep the amount of used memory low so other jobs can run), and I'm getting the vague "filesystem closed" error. When I re-run with more memory it runs fine. By default shouldn't

Spark SQL 1.3 not finding attribute in DF

2015-12-06 Thread YaoPau
When I run df.printSchema() I get: root |-- durable_key: string (nullable = true) |-- code: string (nullable = true) |-- desc: string (nullable = true) |-- city: string (nullable = true) |-- state_code: string (nullable = true) |-- zip_code: string (nullable = true) |-- county: string

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-06 Thread YaoPau
If anyone runs into the same issue, I found a workaround: >>> df.where('state_code = "NY"') works for me. >>> df.where(df.state_code == "NY").collect() fails with the error from the first post. -- View this message in context:

Re: sparkavro for PySpark 1.3

2015-12-05 Thread YaoPau
Here's what I'm currently trying: -- I'm including --packages com.databricks:spark-avro_2.10:1.0.0 in my pyspark call. This seems to work: Ivy Default Cache set to: /home/jrgregg/.ivy2/cache The jars for the packages stored in: /home/jrgregg/.ivy2/jars ::

sparkavro for PySpark 1.3

2015-12-03 Thread YaoPau
How can I read from and write to Avro using PySpark in 1.3? I can only find the 1.4 documentation , which uses a sqlContext.read method that isn't available to me in 1.3. -- View this message in context:

Getting different DESCRIBE results between SparkSQL and Hive

2015-11-23 Thread YaoPau
Example below. The partition columns show up as regular columns. I'll note that SHOW PARTITIONS works correctly in Spark SQL, so it's aware of the partitions but it does not show them in DESCRIBE. In Hive: "DESCRIBE pub.inventory_daily"

Spark SQL: filter if column substring does not contain a string

2015-11-14 Thread YaoPau
I'm using pyspark 1.3.0, and struggling with what should be simple. Basically, I'd like to run this: site_logs.filter(lambda r: 'page_row' in r.request[:20]) meaning that I want to keep rows that have 'page_row' in the first 20 characters of the request column. The following is the closest

sqlCtx.sql('some_hive_table') works in pyspark but not spark-submit

2015-11-07 Thread YaoPau
Within a pyspark shell, both of these work for me: print hc.sql("SELECT * from raw.location_tbl LIMIT 10").collect() print sqlCtx.sql("SELECT * from raw.location_tbl LIMIT 10").collect() But when I submit both of those in batch mode (hc and sqlCtx both exist), I get the following error. Why is

Vague Spark SQL error message with saveAsParquetFile

2015-11-03 Thread YaoPau
I'm using Spark SQL to query one partition at a time of Hive external table that sits atop .gzip data, and then I'm saving that partition to a new HDFS location as a set of parquet snappy files using .saveAsParquetFile() The query completes successfully, but then I get a vague error message I

Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-19 Thread YaoPau
I've connected Spark SQL to the Hive Metastore and currently I'm running SQL code via pyspark. Typically everything works fine, but sometimes after a long-running Spark SQL job I get the error below, and from then on I can no longer run Spark SQL commands. I still do have both my sc and my

Any plans to support Spark Streaming within an interactive shell?

2015-10-13 Thread YaoPau
I'm seeing products that allow you to interact with a stream in realtime (write code, and see the streaming output automatically change), which I think makes it easier to test streaming code, although running it on batch then turning streaming on certainly is a good way as well. I played around

Does Spark use more memory than MapReduce?

2015-10-12 Thread YaoPau
I had this question come up and I'm not sure how to answer it. A user said that, for a big job, he thought it would be better to use MapReduce since it writes to disk between iterations instead of keeping the data in memory the entire time like Spark generally does. I mentioned that Spark can

Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-10-05 Thread YaoPau
I'm using SqlCtx connected to Hive in CDH 5.4.4. When I run "SELECT * FROM my_db.my_tbl LIMIT 5", it scans the entire table like Hive would instead of doing a .take(5) on it and returning results immediately. Is there a way to get Spark SQL to use .take(5) instead of the Hive logic of scanning

Spark SQL with Hive error: "Conf non-local session path expected to be non-null;"

2015-10-04 Thread YaoPau
I've been experimenting with using PySpark SQL to query Hive tables for the last week and all has been smooth, but on a command I've run hundreds of times successfully (a basic SELECT * ...), suddenly this error started popping up every time I ran a sqlCtx command until I restarted my session.

Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread YaoPau
I'm trying to add multiple SerDe jars to my pyspark session. I got the first one working by changing my PYSPARK_SUBMIT_ARGS to: "--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory 3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar" But when I tried to add a

Spark SQL deprecating Hive? How will I access Hive metadata in the future?

2015-09-29 Thread YaoPau
I've heard that Spark SQL will be or has already started deprecating HQL. We have Spark SQL + Python jobs that currently read from the Hive metastore to get things like table location and partition values. Will we have to re-code these functions in future releases of Spark (maybe by connecting

Run Spark job from within iPython+Spark?

2015-08-24 Thread YaoPau
I set up iPython Notebook to work with the pyspark shell, and now I'd like use %run to basically 'spark-submit' another Python Spark file, and leave the objects accessible within the Notebook. I tried this, but got a ValueError: Cannot run multiple SparkContexts at once error. I then tried

Re: collect() works, take() returns ImportError: No module named iter

2015-08-13 Thread YaoPau
In case anyone runs into this issue in the future, we got it working: the following variable must be set on the edge node: export PYSPARK_PYTHON=/your/path/to/whatever/python/you/want/to/run/bin/python I didn't realize that variable gets passed to every worker node. All I saw when searching for

Spark 1.3 + Parquet: Skipping data using statistics

2015-08-12 Thread YaoPau
I've seen this function referenced in a couple places, first this forum post https://forums.databricks.com/questions/951/why-should-i-use-parquet.html and this talk by Michael Armbrust https://www.youtube.com/watch?v=6axUqHCu__Y during the 42nd minute. As I understand it, if you create a

collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread YaoPau
I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via iPython Notebook. I'm getting collect() to work just fine, but take() errors. (I'm having issues with collect() on other datasets ... but take() seems to break every time I run it.) My code is below. Any thoughts? sc

Error when running pyspark/shell.py to set up iPython notebook

2015-08-09 Thread YaoPau
I'm trying to set up iPython notebook on an edge node with port forwarding so I can run pyspark off my laptop's browser. I've mostly been following the Cloudera guide here: http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ I got this working on one cluster

What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread YaoPau
I've heard Spark is not just MapReduce mentioned during Spark talks, but it seems like every method that Spark has is really doing something like (Map - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the performance benefit of keeping RDDs in memory between stages. Am I wrong

What is needed to integrate Spark with Pandas and scikit-learn?

2015-06-19 Thread YaoPau
I'm running Spark on YARN, will be upgrading to 1.3 soon. For the integration, will I need to install Pandas and scikit-learn on every node in my cluster, or is the integration just something that takes place on the edge node after a collect in yarn-client mode? -- View this message in

Are there ways to restrict what parameters users can set for a Spark job?

2015-06-12 Thread YaoPau
For example, Hive lets you set a whole bunch of parameters (# of reducers, # of mappers, size of reducers, cache size, max memory to use for a join), while Impala gives users a much smaller subset of parameters to work with, which makes it nice to give to a BI team. Is there a way to restrict

How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-05 Thread YaoPau
I have a file badFullIPs.csv of bad IP addresses used for filtering. In yarn-client mode, I simply read it off the edge node, transform it, and then broadcast it: val badIPs = fromFile(edgeDir + badfullIPs.csv) val badIPsLines = badIPs.getLines val badIpSet = badIPsLines.toSet

Read from file and broadcast before every Spark Streaming bucket?

2015-01-29 Thread YaoPau
I'm creating a real-time visualization of counts of ads shown on my website, using that data pushed through by Spark Streaming. To avoid clutter, it only looks good to show 4 or 5 lines on my visualization at once (corresponding to 4 or 5 different ads), but there are 50+ different ads that show

reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread YaoPau
The TwitterPopularTags example works great: the Twitter firehose keeps messages pretty well in order by timestamp, and so to get the most popular hashtags over the last 60 seconds, reduceByKeyAndWindow works well. My stream pulls Apache weblogs from Kafka, and so it's not as simple: messages can

Connect to Hive metastore (on YARN) from Spark Shell?

2015-01-21 Thread YaoPau
Is this possible, and if so what steps do I need to take to make this happen? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Connect-to-Hive-metastore-on-YARN-from-Spark-Shell-tp21300.html Sent from the Apache Spark User List mailing list archive at

How to set UI port #?

2015-01-11 Thread YaoPau
I have multiple Spark Streaming jobs running all day, and so when I run my hourly batch job, I always get a java.net.BindException: Address already in use which starts at 4040 then goes to 4041, 4042, 4043 before settling at 4044. That slows down my hourly job, and time is critical. Is there a

Does Spark automatically run different stages concurrently when possible?

2015-01-10 Thread YaoPau
I'm looking for ways to reduce the runtime of my Spark job. My code is a single file of scala code and is written in this order: (1) val lines = Import full dataset using sc.textFile (2) val ABonly = Parse out all rows that are not of type A or B (3) val processA = Process only the A rows from

How to convert RDD to JSON?

2014-12-08 Thread YaoPau
Pretty straightforward: Using Scala, I have an RDD that represents a table with four columns. What is the recommended way to convert the entire RDD to one JSON object? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-RDD-to-JSON-tp20585.html

Appending with saveAsTextFile?

2014-11-29 Thread YaoPau
I am using Spark to aggregate logs that land in HDFS throughout the day. The job kicks off 15min after the hour and processes anything that landed the previous hour. For example, the 2:15pm job will process anything that came in from 1:00pm-2:00pm. 99.9% of that data will consist of logs

Missing parents for stage (Spark Streaming)

2014-11-21 Thread YaoPau
When I submit a Spark Streaming job, I see these INFO logs printing frequently: 14/11/21 18:53:17 INFO DAGScheduler: waiting: Set(Stage 216) 14/11/21 18:53:17 INFO DAGScheduler: failed: Set() 14/11/21 18:53:17 INFO DAGScheduler: Missing parents for Stage 216: List() 14/11/21 18:53:17 INFO

Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread YaoPau
I joined two datasets together, and my resulting logs look like this: (975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith))) (253142991,((30058,20141112T171246,web,ATLANTA16,GA,US,Southeast),(Male,Bob,Jones)))

Joining DStream with static file

2014-11-19 Thread YaoPau
Here is my attempt: val sparkConf = new SparkConf().setAppName(LogCounter) val ssc = new StreamingContext(sparkConf, Seconds(2)) val sc = new SparkContext() val geoData = sc.textFile(data/geoRegion.csv) .map(_.split(',')) .map(line = (line(0),

How to broadcast a textFile?

2014-11-17 Thread YaoPau
I have a 1 million row file that I'd like to read from my edge node, and then send a copy of it to each Hadoop machine's memory in order to run JOINs in my spark streaming code. I see examples in the docs of how use use broadcast() for a simple array, but how about when the data is in a textFile?

Re: How to broadcast a textFile?

2014-11-17 Thread YaoPau
OK then I'd still need to write the code (within my spark streaming code I'm guessing) to convert my text file into an object like a HashMap before broadcasting. How can I make sure only the HashMap is being broadcast while all the pre-processing to create the HashMap is only performed once?

Given multiple .filter()'s, is there a way to set the order?

2014-11-14 Thread YaoPau
I have an RDD x of millions of STRINGs, each of which I want to pass through a set of filters. My filtering code looks like this: x.filter(filter#1, which will filter out 40% of data). filter(filter#2, which will filter out 20% of data). filter(filter#3, which will filter out 2% of data).

Building a hash table from a csv file using yarn-cluster, and giving it to each executor

2014-11-13 Thread YaoPau
I built my Spark Streaming app on my local machine, and an initial step in log processing is filtering out rows with spam IPs. I use the following code which works locally: // Creates a HashSet for badIPs read in from file val badIpSource = scala.io.Source.fromFile(wrongIPlist.csv)

Converting Apache log string into map using delimiter

2014-11-11 Thread YaoPau
I have an RDD of logs that look like this:

Re: Converting Apache log string into map using delimiter

2014-11-11 Thread YaoPau
OK I got it working with: z.map(row = (row.map(element = element.split(=)(0)) zip row.map(element = element.split(=)(1))).toMap) But I'm guessing there is a more efficient way than to create two separate lists and then zip them together and then convert the result into a map. -- View this

SparkPi endlessly in yarnAppState: ACCEPTED

2014-11-07 Thread YaoPau
I'm using Cloudera 5.1.3, and I'm repeatedly getting the following output after submitting the SparkPi example in yarn cluster mode (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_apps.html) using: spark-submit --class