ing-guide.html#broadcast-variables>
that
would allow you to put flagged IP ranges into an array and make that
available on every node. Then you can filters to detect users who've
logged in from a flagged IP range.
Jon Gregg
On Thu, Feb 23, 2017 at 9:19 PM, Mina Aslani <aslanim...@gmail.c
ot;
A fix might be as simple as switching to the direct approach
<https://spark.apache.org/docs/2.1.0/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers>
?
Jon Gregg
On Wed, Feb 22, 2017 at 12:37 AM, satishl <satish.la...@gmail.com> wrote:
> I am reading fro
ories, and then use Spark SQL to
query that Hive table. There might be a cleaner way to do this in Spark
2.0+ but that is a common pattern for me in Spark 1.6 when I know the
directory structure but don't have "=" signs in the paths.
Jon Gregg
On Fri, Feb 17, 2017 at 7:02 PM, 颜
It depends how you salt it. See slide 40 and onwards from a spark summit
talk here: http://www.slideshare.net/cloudera/top-5-mistakes-
to-avoid-when-writing-apache-spark-applications The speakers use a mod8
integer salt appended to the end of the key, the salt that works best for
you might be
Could you just make Hadoop's resource manager (port 8088) available to your
users, and they can check available containers that way if they see the
launch is stalling?
Another option is to reduce the default # of executors and memory per
executor in the launch script to some small fraction of
Spark has a zipWithIndex function for RDDs (
http://stackoverflow.com/a/26081548) that adds an index column right after
you create an RDD, and I believe it preserves order. Then you can sort it
by the index after the cache step.
I haven't tried this with a Dataframe but this answer seems
Setting Spark's memoryOverhead configuration variable is recommended in
your logs, and has helped me with these issues in the past. Search for
"memoryOverhead" here:
http://spark.apache.org/docs/latest/running-on-yarn.html
That said, you're running on a huge cluster as it is. If it's possible
Hard to say without more context around where your job is stalling, what
file sizes you're working with etc.
Best answer would be to test and see, but in general for simple DAGs, I
find that not persisting anything typically runs the fastest. If I persist
anything it would be rdd6 because it took
Strange that it's working for some directories but not others. Looks like
wholeTextFiles maybe doesn't work with S3?
https://issues.apache.org/jira/browse/SPARK-4414 .
If it's possible to load the data into EMR and run Spark from there that
may be a workaround. This blogspot shows a python
Confirming that Spark can read newly created views - I just created a test
view in HDFS and I was able to query it in Spark 1.5 immediately after
without a refresh. Possibly an issue with your Spark-Hive connection?
Jon
On Sun, Feb 5, 2017 at 9:31 PM, KhajaAsmath Mohammed <
Spark is written in Scala, so yes it's still the strongest option. You
also get the Dataset type with Scala (compile time type-safety), and that's
not an available feature with Python.
That said, I think the Python API is a viable candidate if you use Pandas
for Data Science. There are
In these cases it might help to just flatten the DataFrame. Here's a
helper function from the tutorial (scroll down to the "Flattening" header:
Making a guess here: you need to add s3:ListBucket?
http://stackoverflow.com/questions/35803808/spark-saveastextfile-to-s3-fails
On Thu, Nov 17, 2016 at 2:11 PM, Jain, Nishit
wrote:
> When I read a specific file it works:
>
> val filePath=
Since you're completely new to Kafka, I would start with the Kafka docs (
https://kafka.apache.org/documentation). You should be able to get through
the Getting Started part easily and there are some examples for setting up
a basic Kafka server.
You don't need Kafka to start working with Spark
Piggybacking off this - how are you guys teaching DataFrames and Datasets
to new users? I haven't taken the edx courses but I don't see Spark SQL
covered heavily in the syllabus. I've dug through the Databricks
documentation but it's a lot of information for a new user I think - hoping
there is
That link points to hadoop2.6.tgz. I tried changing the URL to
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
and I get a NoSuchKey error.
Should I just go with it even though it says hadoop2.6?
On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu wrote:
I'm working with a Hadoop distribution that doesn't support 1.5 yet, we'll
be able to upgrade in probably two months. For now I'm seeing the same
issue with spark not recognizing an existing column name in many
hive-table-to-dataframe situations:
Py4JJavaError: An error occurred while calling
Here's my code:
my_data = sqlCtx.sql("SELECT * FROM raw.site_activity_data LIMIT 2")
my_data.collect()
raw.site_activity_data is a Hive external table atop daily-partitioned
.gzip data. When I execute the command I start seeing many of these pop up
in the logs (below is a small subset)
1.3 on cdh 5.4.4 ... I'll take the responses to mean that the fix will be
probably a few months away for us. Not a huge problem but something I've
run into a number of times.
On Tue, Oct 20, 2015 at 3:01 PM, Yin Huai wrote:
> btw, what version of Spark did you use?
>
> On
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was
the issue, so we've tried many combinations since then with all three of
2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and
PYTHONPATHs each time. Every combination has produced the same error.
We
:
You can call collect() to pull in the contents of an RDD into the driver:
val badIPsLines = badIPs.collect()
On Fri, Feb 6, 2015 at 12:19 PM, Jon Gregg jonrgr...@gmail.com wrote:
OK I tried that, but how do I convert an RDD to a Set that I can then
broadcast and cache?
val badIPs
:
Is the SparkContext you're using the same one that the StreamingContext
wraps? If not, I don't think using two is supported.
-Sandy
On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg jonrgr...@gmail.com wrote:
I'm still getting an error. Here's my code, which works successfully
when tested using
wrote:
You should be able to replace that second line with
val sc = ssc.sparkContext
On Tue, Feb 10, 2015 at 10:04 AM, Jon Gregg jonrgr...@gmail.com wrote:
They're separate in my code, how can I combine them? Here's what I have:
val sparkConf = new SparkConf()
val ssc = new
OK I tried that, but how do I convert an RDD to a Set that I can then
broadcast and cache?
val badIPs = sc.textFile(hdfs:///user/jon/+ badfullIPs.csv)
val badIPsLines = badIPs.getLines
val badIpSet = badIPsLines.toSet
val badIPsBC = sc.broadcast(badIpSet)
produces the
Hi Andy
I'm new to Spark and have been working with Scala not Java but I see
there's a dstream() method to convert from JavaDStream to DStream. Then within
DStream
http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/api/java/org/apache/spark/streaming/dstream/DStream.html
there is a
25 matches
Mail list logo