Re: Apache Spark MLIB

2017-02-24 Thread Jon Gregg
ing-guide.html#broadcast-variables> that would allow you to put flagged IP ranges into an array and make that available on every node. Then you can filters to detect users who've logged in from a flagged IP range. Jon Gregg On Thu, Feb 23, 2017 at 9:19 PM, Mina Aslani <aslanim...@gmail.c

Re: Spark executors in streaming app always uses 2 executors

2017-02-22 Thread Jon Gregg
ot; A fix might be as simple as switching to the direct approach <https://spark.apache.org/docs/2.1.0/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers> ? Jon Gregg On Wed, Feb 22, 2017 at 12:37 AM, satishl <satish.la...@gmail.com> wrote: > I am reading fro

Re: Query data in subdirectories in Hive Partitions using Spark SQL

2017-02-18 Thread Jon Gregg
ories, and then use Spark SQL to query that Hive table. There might be a cleaner way to do this in Spark 2.0+ but that is a common pattern for me in Spark 1.6 when I know the directory structure but don't have "=" signs in the paths. Jon Gregg On Fri, Feb 17, 2017 at 7:02 PM, 颜

Re: skewed data in join

2017-02-17 Thread Jon Gregg
It depends how you salt it. See slide 40 and onwards from a spark summit talk here: http://www.slideshare.net/cloudera/top-5-mistakes- to-avoid-when-writing-apache-spark-applications The speakers use a mod8 integer salt appended to the end of the key, the salt that works best for you might be

Re: notebook connecting Spark On Yarn

2017-02-15 Thread Jon Gregg
Could you just make Hadoop's resource manager (port 8088) available to your users, and they can check available containers that way if they see the launch is stalling? Another option is to reduce the default # of executors and memory per executor in the launch script to some small fraction of

Re: Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread Jon Gregg
Spark has a zipWithIndex function for RDDs ( http://stackoverflow.com/a/26081548) that adds an index column right after you create an RDD, and I believe it preserves order. Then you can sort it by the index after the cache step. I haven't tried this with a Dataframe but this answer seems

Re: Lost executor 4 Container killed by YARN for exceeding memory limits.

2017-02-13 Thread Jon Gregg
Setting Spark's memoryOverhead configuration variable is recommended in your logs, and has helped me with these issues in the past. Search for "memoryOverhead" here: http://spark.apache.org/docs/latest/running-on-yarn.html That said, you're running on a huge cluster as it is. If it's possible

Re: does persistence required for single action ?

2017-02-08 Thread Jon Gregg
Hard to say without more context around where your job is stalling, what file sizes you're working with etc. Best answer would be to test and see, but in general for simple DAGs, I find that not persisting anything typically runs the fastest. If I persist anything it would be rdd6 because it took

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Jon Gregg
Strange that it's working for some directories but not others. Looks like wholeTextFiles maybe doesn't work with S3? https://issues.apache.org/jira/browse/SPARK-4414 . If it's possible to load the data into EMR and run Spark from there that may be a workaround. This blogspot shows a python

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread Jon Gregg
Confirming that Spark can read newly created views - I just created a test view in HDFS and I was able to query it in Spark 1.5 immediately after without a refresh. Possibly an issue with your Spark-Hive connection? Jon On Sun, Feb 5, 2017 at 9:31 PM, KhajaAsmath Mohammed <

Re: Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Jon Gregg
Spark is written in Scala, so yes it's still the strongest option. You also get the Dataset type with Scala (compile time type-safety), and that's not an available feature with Python. That said, I think the Python API is a viable candidate if you use Pandas for Data Science. There are

Re: How do I access the nested field in a dataframe, spark Streaming app... Please help.

2016-11-20 Thread Jon Gregg
In these cases it might help to just flatten the DataFrame. Here's a helper function from the tutorial (scroll down to the "Flattening" header:

Re: Spark AVRO S3 read not working for partitioned data

2016-11-17 Thread Jon Gregg
Making a guess here: you need to add s3:ListBucket? http://stackoverflow.com/questions/35803808/spark-saveastextfile-to-s3-fails On Thu, Nov 17, 2016 at 2:11 PM, Jain, Nishit wrote: > When I read a specific file it works: > > val filePath=

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Jon Gregg
Since you're completely new to Kafka, I would start with the Kafka docs ( https://kafka.apache.org/documentation). You should be able to get through the Getting Started part easily and there are some examples for setting up a basic Kafka server. You don't need Kafka to start working with Spark

Re: Newbie question - Best way to bootstrap with Spark

2016-11-14 Thread Jon Gregg
Piggybacking off this - how are you guys teaching DataFrames and Datasets to new users? I haven't taken the edx courses but I don't see Spark SQL covered heavily in the syllabus. I've dug through the Databricks documentation but it's a lot of information for a new user I think - hoping there is

Re: A number of issues when running spark-ec2

2016-04-16 Thread Jon Gregg
That link points to hadoop2.6.tgz. I tried changing the URL to https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz and I get a NoSuchKey error. Should I just go with it even though it says hadoop2.6? On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu wrote:

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Jon Gregg
I'm working with a Hadoop distribution that doesn't support 1.5 yet, we'll be able to upgrade in probably two months. For now I'm seeing the same issue with spark not recognizing an existing column name in many hive-table-to-dataframe situations: Py4JJavaError: An error occurred while calling

Re: Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-11-05 Thread Jon Gregg
Here's my code: my_data = sqlCtx.sql("SELECT * FROM raw.site_activity_data LIMIT 2") my_data.collect() raw.site_activity_data is a Hive external table atop daily-partitioned .gzip data. When I execute the command I start seeing many of these pop up in the logs (below is a small subset)

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Jon Gregg
1.3 on cdh 5.4.4 ... I'll take the responses to mean that the fix will be probably a few months away for us. Not a huge problem but something I've run into a number of times. On Tue, Oct 20, 2015 at 3:01 PM, Yin Huai wrote: > btw, what version of Spark did you use? > > On

Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Jon Gregg
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was the issue, so we've tried many combinations since then with all three of 2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and PYTHONPATHs each time. Every combination has produced the same error. We

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
: You can call collect() to pull in the contents of an RDD into the driver: val badIPsLines = badIPs.collect() On Fri, Feb 6, 2015 at 12:19 PM, Jon Gregg jonrgr...@gmail.com wrote: OK I tried that, but how do I convert an RDD to a Set that I can then broadcast and cache? val badIPs

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
: Is the SparkContext you're using the same one that the StreamingContext wraps? If not, I don't think using two is supported. -Sandy On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg jonrgr...@gmail.com wrote: I'm still getting an error. Here's my code, which works successfully when tested using

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-10 Thread Jon Gregg
wrote: You should be able to replace that second line with val sc = ssc.sparkContext On Tue, Feb 10, 2015 at 10:04 AM, Jon Gregg jonrgr...@gmail.com wrote: They're separate in my code, how can I combine them? Here's what I have: val sparkConf = new SparkConf() val ssc = new

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-06 Thread Jon Gregg
OK I tried that, but how do I convert an RDD to a Set that I can then broadcast and cache? val badIPs = sc.textFile(hdfs:///user/jon/+ badfullIPs.csv) val badIPsLines = badIPs.getLines val badIpSet = badIPsLines.toSet val badIPsBC = sc.broadcast(badIpSet) produces the

Re: how to get actual count from as long from JavaDStream ?

2014-09-30 Thread Jon Gregg
Hi Andy I'm new to Spark and have been working with Scala not Java but I see there's a dstream() method to convert from JavaDStream to DStream. Then within DStream http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/api/java/org/apache/spark/streaming/dstream/DStream.html there is a