which database for gene alignment data ?

2015-06-05 Thread roni
I want to use spark for reading compressed .bed file for reading gene
sequencing alignments data.
I want to store bed file data in db and then use external gene expression
data to find overlaps etc, which database is best for it ?
Thanks
-Roni


Re: Managing spark processes via supervisord

2015-06-05 Thread ayan guha
I use a simple python to launch cluster. I just did itfor fun, so of course
not the best and lot ofmodifications can be done.But I think you arelooking
for something similar?

import subprocess as s
from time import sleep
cmd =
"D:\\spark\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\bin\\spark-class.cmd"

master = "org.apache.spark.deploy.master.Master"
worker = "org.apache.spark.deploy.worker.Worker"
masterUrl="spark://BigData:7077"
cmds={"masters":1,"workers":3}

masterProcess=[cmd,master]
workerProcess=[cmd,worker,masterUrl]

noWorker = 3

pMaster = s.Popen(masterProcess)
sleep(3)

pWorkers = []
for i in range(noWorker):
pw = s.Popen(workerProcess)
pWorkers.append(pw)



On Sat, Jun 6, 2015 at 8:19 AM, Mike Trienis 
wrote:

> Thanks Ignor,
>
> I managed to find a fairly simple solution. It seems that the shell
> scripts (e.g. .start-master.sh, start-slave.sh) end up executing
> /bin/spark-class which is always run in the foreground.
>
> Here is a solution I provided on stackoverflow:
>
>-
>
> http://stackoverflow.com/questions/30672648/how-to-autostart-an-apache-spark-cluster-using-supervisord/30676844#30676844
>
>
> Cheers Mike
>
>
> On Wed, Jun 3, 2015 at 12:29 PM, Igor Berman 
> wrote:
>
>> assuming you are talking about standalone cluster
>> imho, with workers you won't get any problems and it's straightforward
>> since they are usually foreground processes
>> with master it's a bit more complicated, ./sbin/start-master.sh goes
>> background which is not good for supervisor, but anyway I think it's
>> doable(going to setup it too in a few days)
>>
>> On 3 June 2015 at 21:46, Mike Trienis  wrote:
>>
>>> Hi All,
>>>
>>> I am curious to know if anyone has successfully deployed a spark cluster
>>> using supervisord?
>>>
>>>- http://supervisord.org/
>>>
>>> Currently I am using the cluster launch scripts which are working
>>> greater, however, every time I reboot my VM or development environment I
>>> need to re-launch the cluster.
>>>
>>> I am considering using supervisord to control all the processes (worker,
>>> master, ect.. ) in order to have the cluster up an running after boot-up;
>>> although I'd like to understand if it will cause more issues than it
>>> solves.
>>>
>>> Thanks, Mike.
>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha


Re: Managing spark processes via supervisord

2015-06-05 Thread Mike Trienis
Thanks Ignor,

I managed to find a fairly simple solution. It seems that the shell scripts
(e.g. .start-master.sh, start-slave.sh) end up executing /bin/spark-class
which is always run in the foreground.

Here is a solution I provided on stackoverflow:

   -
   
http://stackoverflow.com/questions/30672648/how-to-autostart-an-apache-spark-cluster-using-supervisord/30676844#30676844


Cheers Mike


On Wed, Jun 3, 2015 at 12:29 PM, Igor Berman  wrote:

> assuming you are talking about standalone cluster
> imho, with workers you won't get any problems and it's straightforward
> since they are usually foreground processes
> with master it's a bit more complicated, ./sbin/start-master.sh goes
> background which is not good for supervisor, but anyway I think it's
> doable(going to setup it too in a few days)
>
> On 3 June 2015 at 21:46, Mike Trienis  wrote:
>
>> Hi All,
>>
>> I am curious to know if anyone has successfully deployed a spark cluster
>> using supervisord?
>>
>>- http://supervisord.org/
>>
>> Currently I am using the cluster launch scripts which are working
>> greater, however, every time I reboot my VM or development environment I
>> need to re-launch the cluster.
>>
>> I am considering using supervisord to control all the processes (worker,
>> master, ect.. ) in order to have the cluster up an running after boot-up;
>> although I'd like to understand if it will cause more issues than it
>> solves.
>>
>> Thanks, Mike.
>>
>>
>


Re: Spark SQL and Streaming Results

2015-06-05 Thread Tathagata Das
I am not sure. Saisai may be able to say more about it.

TD

On Fri, Jun 5, 2015 at 5:35 PM, Todd Nist  wrote:

> There use to be a project, StreamSQL (
> https://github.com/thunderain-project/StreamSQL), but it appears a bit
> dated and I do not see it in the Spark repo, but may have missed it.
>
> @TD Is this project still active?
>
> I'm not sure what the status is but it may provide some insights on how to
> achieve what your looking to do.
>
> On Fri, Jun 5, 2015 at 6:34 PM, Tathagata Das  wrote:
>
>> You could take at RDD *async operations, their source code. May be that
>> can help if getting some early results.
>>
>> TD
>>
>> On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile <
>> pietro.gentile89.develo...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>>
>>> what is the best way to perform Spark SQL queries and obtain the result
>>> tuplas in a stremaing way. In particullar, I want to aggregate data and
>>> obtain the first and incomplete results in a fast way. But it should be
>>> updated until the aggregation be completed.
>>>
>>> Best Regards.
>>>
>>
>>
>


Running SparkSql against Hive tables

2015-06-05 Thread James Pirz
I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark
SQL' to run some SQL scripts, on the cluster. I realized that for a better
performance, it is a good idea to use Parquet files. I have 2 questions
regarding that:

1) If I wanna use Spark SQL against  *partitioned & bucketed* tables with
Parquet format in Hive, does the provided spark binary on the apache
website support that or do I need to build a new spark binary with some
additional flags ? (I found a note

in
the documentation about enabling Hive support, but I could not fully get it
as what the correct way of building is, if I need to build)

2) Does running Spark SQL against tables in Hive downgrade the performance,
and it is better that I load parquet files directly to HDFS or having Hive
in the picture is harmless ?

Thnx


Re: How to run spark streaming application on YARN?

2015-06-05 Thread Saiph Kappa
I was able to run my application by just using an hadoop/YARN cluster with
1 machine. Today I tried to extend the cluster to use one more machine, but
I got some problems on the yarn node manager of that new added machine:

Node Manager Log:
«
2015-06-06 01:41:33,379 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Initializing user myuser
2015-06-06 01:41:33,382 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying
from
/tmp/hadoop-myuser/nm-local-dir/nmPrivate/container_1433549642381_0004_01_03.tokens
to
/tmp/hadoop-myuser/nm-local-dir/usercache/myuser/appcache/application_1433549642381_0004/container_1433549642381_0004_01_03.tokens
2015-06-06 01:41:33,382 INFO
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Localizer CWD set to
/tmp/hadoop-myuser/nm-local-dir/usercache/myuser/appcache/application_1433549642381_0004
=
file:/tmp/hadoop-myuser/nm-local-dir/usercache/myuser/appcache/application_1433549642381_0004
2015-06-06 01:41:33,405 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
{
file:/home/myuser/my-spark/assembly/target/scala-2.10/spark-assembly-1.3.2-SNAPSHOT-hadoop2.7.0.jar,
1433441011000, FILE, null } failed: Resource
file:/home/myuser/my-spark/assembly/target/scala-2.10/spark-assembly-1.3.2-SNAPSHOT-hadoop2.7.0.jar
changed on src filesystem (expected 1433441011000, was 1433531913000
java.io.IOException: Resource
file:/home/myuser/my-spark/assembly/target/scala-2.10/spark-assembly-1.3.2-SNAPSHOT-hadoop2.7.0.jar
changed on src filesystem (expected 1433441011000, was 1433531913000
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:255)
at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

2015-06-06 01:41:33,405 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource
file:/home/myuser/my-spark/assembly/target/scala-2.10/spark-assembly-1.3.2-SNAPSHOT-hadoop2.7.0.jar(->/tmp/hadoop-myuser/nm-local-dir/usercache/myuser/filecache/15/spark-assembly-1.3.2-SNAPSHOT-hadoop2.7.0.jar)
transitioned from DOWNLOADING to FAILED
2015-06-06 01:41:33,406 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
Container container_1433549642381_0004_01_03 transitioned from
LOCALIZING to LOCALIZATION_FAILED
2015-06-06 01:41:33,406 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl:
Container container_1433549642381_0004_01_03 sent RELEASE event on a
resource request {
file:/home/myuser/my-spark/assembly/target/scala-2.10/spark-assembly-1.3.2-SNAPSHOT-hadoop2.7.0.jar,
1433441011000, FILE, null } not present in cache.
2015-06-06 01:41:33,406 WARN org.apache.hadoop.ipc.Client: interrupted
waiting to send rpc request to server
»

I have this jar on both machines:
/home/myuser/my-spark/assembly/target/scala-2.10/spark-assembly-1.3.2-SNAPSHOT-hadoop2.7.0.jar

However, I simply copied my-spark folder from machine1 to machine2, so that
YARN could find the jar

Any ideas of what can be wrong? Isn't this the correct way to share spark
jars across YARN cluster?

Thanks.

On Thu, Jun 4, 2015 at 7:20 PM, Saiph Kappa  wrote:

> Additionally, I think this document (
> https://spark.apache.org/docs/latest/building-spark.html ) should mention
> that the protobuf.version might need to be changed to match the one used in
> the chosen hadoop version. For instance, with hadoop 2.7.0 I had to change
> protobuf.version to 1.5.0 to be able to run my application.
>
> On Thu, Jun 4, 2015 at 7:14 PM, Sandy Ryza 
> wrote:
>
>> That might work, but there might also be other steps that are required.
>>
>> -Sandy
>>
>> On Thu, Jun 4, 2015 at 11:13 AM, Saiph Kappa 
>> wrote:
>>
>>> Thanks! It is working fine now with spark-submit. Just out of curiosity,
>>> how would you use org.apache.spark.deploy.yarn.Client? Adding that
>>> spark_yarn jar to the configuration inside the a

Re: Spark SQL and Streaming Results

2015-06-05 Thread Todd Nist
There use to be a project, StreamSQL (
https://github.com/thunderain-project/StreamSQL), but it appears a bit
dated and I do not see it in the Spark repo, but may have missed it.

@TD Is this project still active?

I'm not sure what the status is but it may provide some insights on how to
achieve what your looking to do.

On Fri, Jun 5, 2015 at 6:34 PM, Tathagata Das  wrote:

> You could take at RDD *async operations, their source code. May be that
> can help if getting some early results.
>
> TD
>
> On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile <
> pietro.gentile89.develo...@gmail.com> wrote:
>
>> Hi all,
>>
>>
>> what is the best way to perform Spark SQL queries and obtain the result
>> tuplas in a stremaing way. In particullar, I want to aggregate data and
>> obtain the first and incomplete results in a fast way. But it should be
>> updated until the aggregation be completed.
>>
>> Best Regards.
>>
>
>


RE: Cassandra Submit

2015-06-05 Thread Mohammed Guller
Check your spark.cassandra.connection.host setting. It should be pointing to 
one of your Cassandra nodes.

Mohammed

From: Yasemin Kaya [mailto:godo...@gmail.com]
Sent: Friday, June 5, 2015 7:31 AM
To: user@spark.apache.org
Subject: Cassandra Submit

Hi,

I am using cassandraDB in my project. I had that error Exception in thread 
"main" java.io.IOException: Failed to open native connection to Cassandra at 
{127.0.1.1}:9042

I think I have to modify the submit line. What should I add or remove when I 
submit my project?

Best,
yasemin


--
hiç ender hiç


Re: Can you specify partitions?

2015-06-05 Thread amghost
Maybe you could try to implement your own Partitioner. As I remember, by
default, Spark use HashPartitioner.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-you-specify-partitions-tp23156p23187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Loading CSV to DataFrame and saving it into Parquet for speedup

2015-06-05 Thread Hossein
Why not letting SparkSQL deal with parallelism? When using SparkSQL data
sources you can control parallelism by specifying mapred.min.split.size
and mapred.max.split.size in your Hadoop configuration. You can then
repartition your data as you wish and save it as Parquet.

--Hossein

On Thu, May 28, 2015 at 8:32 AM, M Rez  wrote:

> I am using Spark-CSV  to load a 50GB of around 10,000 CSV files into couple
> of unified DataFrames. Since this process is slow I have wrote this
> snippet:
>
> targetList.foreach { target =>
> // this is using sqlContext.load by getting list of files then
> loading them according to schema files that
> // read before and built their StructType
> getTrace(target, sqlContext)
>   .reduce(_ unionAll _)
>   .registerTempTable(target.toUpperCase())
> sqlContext.sql("SELECT * FROM " + target.toUpperCase())
>   .saveAsParquetFile(processedTraces + target)
>
> to load the csv files and then union all the cvs files with the same schema
> and write them into a single parquet file with their parts. The problems is
> my cpu (not all cpus are being busy) and disk (ssd, with 1MB/s at most) are
> barely utilized. I wonder what am I doing wrong?!
>
> snippet for getTrace:
>
> def getTrace(target: String, sqlContext: SQLContext): Seq[DataFrame] = {
> logFiles(mainLogFolder + target).map {
>   file =>
> sqlContext.load(
>   driver,
>   // schemaSelect builds the StructType once
>   schemaSelect(schemaFile, target, sqlContext),
>   Map("path" -> file, "header" -> "false", "delimiter" -> ","))
> }
>   }
>
> thanks for any help
>
>
>
>
> -
> regards,
> mohamad
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-CSV-to-DataFrame-and-saving-it-into-Parquet-for-speedup-tp23071.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL and Streaming Results

2015-06-05 Thread Tathagata Das
You could take at RDD *async operations, their source code. May be that can
help if getting some early results.

TD

On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile <
pietro.gentile89.develo...@gmail.com> wrote:

> Hi all,
>
>
> what is the best way to perform Spark SQL queries and obtain the result
> tuplas in a stremaing way. In particullar, I want to aggregate data and
> obtain the first and incomplete results in a fast way. But it should be
> updated until the aggregation be completed.
>
> Best Regards.
>


Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread John Omernik
Thanks all. The answers post is me too, I multi thread. That and Ted is
aware to and Mapr is helping me with it.  I shall report the answer of that
investigation when we have it.

As to reproduction, I've installed mapr file system, tired both version
4.0.2 and 4.1.0.  Have mesos running along side mapr, and then I use
standard methods for submitting spark jobs to mesos. I don't have my
configs now, on vacation :) but I can shar on Monday.

I appreciate the support I am getting from every one, mesos community,
spark community, and mapr.  Great to see folks solving problems and I will
be sure report back findings as they arise.



On Friday, June 5, 2015, Tim Chen  wrote:

> It seems like there is another thread going on:
>
>
> http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html
>
> I'm not particularly sure why, seems like the problem is that getting the
> current context class loader is returning null in this instance.
>
> Do you have some repro steps or config we can try this?
>
> Tim
>
> On Fri, Jun 5, 2015 at 3:40 AM, Steve Loughran  > wrote:
>
>>
>>  On 2 Jun 2015, at 00:14, Dean Wampler > > wrote:
>>
>>  It would be nice to see the code for MapR FS Java API, but my google
>> foo failed me (assuming it's open source)...
>>
>>
>>  I know that MapRFS is closed source, don't know about the java JAR. Why
>> not ask Ted Dunning (cc'd)  nicely to see if he can track down the stack
>> trace for you.
>>
>>   So, shooting in the dark ;) there are a few things I would check, if
>> you haven't already:
>>
>>  1. Could there be 1.2 versions of some Spark jars that get picked up at
>> run time (but apparently not in local mode) on one or more nodes? (Side
>> question: Does your node experiment fail on all nodes?) Put another way,
>> are the classpaths good for all JVM tasks?
>> 2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?
>>
>>  Incidentally, how are you combining Mesos and MapR? Are you running
>> Spark in Mesos, but accessing data in MapR-FS?
>>
>>  Perhaps the MapR "shim" library doesn't support Spark 1.3.1.
>>
>>  HTH,
>>
>>  dean
>>
>>  Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Mon, Jun 1, 2015 at 2:49 PM, John Omernik > > wrote:
>>
>>> All -
>>>
>>>  I am facing and odd issue and I am not really sure where to go for
>>> support at this point.  I am running MapR which complicates things as it
>>> relates to Mesos, however this HAS worked in the past with no issues so I
>>> am stumped here.
>>>
>>>  So for starters, here is what I am trying to run. This is a simple
>>> show tables using the Hive Context:
>>>
>>>  from pyspark import SparkContext, SparkConf
>>> from pyspark.sql import SQLContext, Row, HiveContext
>>> sparkhc = HiveContext(sc)
>>> test = sparkhc.sql("show tables")
>>> for r in test.collect():
>>>   print r
>>>
>>>  When I run it on 1.3.1 using ./bin/pyspark --master local  This works
>>> with no issues.
>>>
>>>  When I run it using Mesos with all the settings configured (as they
>>> had worked in the past) I get lost tasks and when I zoom in them, the error
>>> that is being reported is below.  Basically it's a NullPointerException on
>>> the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
>>> and compared both together, the class path, everything is exactly the same.
>>> Yet running in local mode works, and running in mesos fails.  Also of note,
>>> when the task is scheduled to run on the same node as when I run locally,
>>> that fails too! (Baffling).
>>>
>>>  Ok, for comparison, how I configured Mesos was to download the mapr4
>>> package from spark.apache.org.  Using the exact same configuration file
>>> (except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
>>> When I run this example with the mapr4 for 1.2.0 there is no issue in
>>> Mesos, everything runs as intended. Using the same package for 1.3.1 then
>>> it fails.
>>>
>>>  (Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails
>>> as well).
>>>
>>>  So basically When I used 1.2.0 and followed a set of steps, it worked
>>> on Mesos and 1.3.1 fails.  Since this is a "current" version of Spark, MapR
>>> is supports 1.2.1 only.  (Still working on that).
>>>
>>>  I guess I am at a loss right now on why this would be happening, any
>>> pointers on where I could look or what I could tweak would be greatly
>>> appreciated. Additionally, if there is something I could specifically draw
>>> to the attention of MapR on this problem please let me know, I am perplexed
>>> on the change from 1.2.0 to 1.3.1.
>>>
>>>  Thank you,
>>>
>>>  John
>>>
>>>
>>>
>>>
>>>  Full Error on 1.3.1 on Mesos:
>>> 15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
>>> 1060.3 MB java.lang.NullPointerException at
>>> com.mapr

Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-05 Thread Doug Balog
Hi Yin,
 Thanks for the suggestion.
I’m not happy about this, and I don’t agree with your position that since it 
wasn’t an “officially” supported feature 
 no harm was done breaking it in the course of implementing SPARK-6908. I would 
still argue that it changed 
and therefore broke .table()’s api.
(As you know, I’ve filed 2 bugs regarding this SPARK-8105 and SPARK-8107)

I’m done complaining about this issue. 
My short term plan is to change my code for 1.4.0 and 
possibility work on a cleaner solution for 1.5.0 that will be acceptable.

Thanks for looking into it and responding to my initial email.

Doug


> On Jun 5, 2015, at 3:36 PM, Yin Huai  wrote:
> 
> Hi Doug,
> 
> For now, I think you can use "sqlContext.sql("USE databaseName")" to change 
> the current database.
> 
> Thanks,
> 
> Yin
> 
> On Thu, Jun 4, 2015 at 12:04 PM, Yin Huai  wrote:
> Hi Doug,
> 
> sqlContext.table does not officially support database name. It only supports 
> table name as the parameter. We will add a method to support database name in 
> future.
> 
> Thanks,
> 
> Yin
> 
> On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog  wrote:
> Hi Yin,
>  I’m very surprised to hear that its not supported in 1.3 because I’ve been 
> using it since 1.3.0.
> It worked great up until  SPARK-6908 was merged into master.
> 
> What is the supported way to get  DF for a table that is not in the default 
> database ?
> 
> IMHO, If you are not going to support “databaseName.tableName”, 
> sqlContext.table() should have a version that takes a database and a table, ie
> 
> def table(databaseName: String, tableName: String): DataFrame =
>   DataFrame(this, catalog.lookupRelation(Seq(databaseName,tableName)))
> 
> The handling of databases in Spark(sqlContext, hiveContext, Catalog) could be 
> better.
> 
> Thanks,
> 
> Doug
> 
> > On Jun 3, 2015, at 8:21 PM, Yin Huai  wrote:
> >
> > Hi Doug,
> >
> > Actually, sqlContext.table does not support database name in both Spark 1.3 
> > and Spark 1.4. We will support it in future version.
> >
> > Thanks,
> >
> > Yin
> >
> >
> >
> > On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog  
> > wrote:
> > Hi,
> >
> > sqlContext.table(“db.tbl”) isn’t working for me, I get a 
> > NoSuchTableException.
> >
> > But I can access the table via
> >
> > sqlContext.sql(“select * from db.tbl”)
> >
> > So I know it has the table info from the metastore.
> >
> > Anyone else see this ?
> >
> > I’ll keep digging.
> > I compiled via make-distribution  -Pyarn -phadoop-2.4 -Phive 
> > -Phive-thriftserver
> > It worked for me in 1.3.1
> >
> > Cheers,
> >
> > Doug
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> 
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 2:05 PM Will Briggs  wrote:

> Your lambda expressions on the RDDs in the SecondRollup class are closing
> around the context, and Spark has special logic to ensure that all
> variables in a closure used on an RDD are Serializable - I hate linking to
> Quora, but there's a good explanation here:
> http://www.quora.com/What-does-Closure-cleaner-func-mean-in-Spark
>

Ah, I see!  So if I broke out the lambda expressions into a method on an
object it would prevent this issue.  Essentially, "don't use lambda
expressions when using threads".

Thanks again, I appreciate the help.


Re: SparkContext & Threading

2015-06-05 Thread Will Briggs
Your lambda expressions on the RDDs in the SecondRollup class are closing 
around the context, and Spark has special logic to ensure that all variables in 
a closure used on an RDD are Serializable - I hate linking to Quora, but 
there's a good explanation here: 
http://www.quora.com/What-does-Closure-cleaner-func-mean-in-Spark


On June 5, 2015, at 4:14 PM, Lee McFadden  wrote:



On Fri, Jun 5, 2015 at 12:58 PM Marcelo Vanzin  wrote:

You didn't show the error so the only thing we can do is speculate. You're 
probably sending the object that's holding the SparkContext reference over  the 
network at some point (e.g. it's used by a task run in an executor), and that's 
why you were getting that exception.


Apologies - the full error is as follows.  All I did here was remove the 
@transient annotation from the sc variable in my class constructor.  In 
addition, the full code for the classes and launching process is included below.


Error traceback:

```

Exception in thread "pool-5-thread-1" java.lang.Error: 
org.apache.spark.SparkException: Task not serializable

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.spark.SparkException: Task not serializable

        at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)

        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)

        at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)

        at org.apache.spark.rdd.RDD.map(RDD.scala:288)

        at io.icebrg.analytics.spark.SecondRollup.run(ConnTransforms.scala:33)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        ... 2 more

Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext

        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)

        at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

        at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

        at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

        at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)

        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)

        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)

        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)

        at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)

        at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)

        at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)

        ... 7 more 

```


Code:

```

class SecondRollup(sc: SparkContext, connector: CassandraConnector, scanPoint: 
DateTime) extends Runnable with Serializable {


  def run {

    val conn = sc.cassandraTable("alpha_test", "sensor_readings")

      .select("data")

      .where("timep = ?", scanPoint)

      .where("sensorid IN ?", System.sensors)

      .map(r => Json.parse(r.getString("data")))

      .cache()


    conn.flatMap(AffectedIp.fromJson)

      .map(a => (AffectedIp.key(a), a))

      .reduceByKey(AffectedIp.reduce)

      .map(_._2)

      .map(a => AffectedIp.reduceWithCurrent(connector, a))

      .saveToCassandra("alpha_test", "affected_hosts")


    conn.flatMap(ServiceSummary.fromnJson)

      .map(s => (ServiceSummary.key(s), s))

      .reduceByKey(ServiceSummary.reduce)

      .map(_._2)

      .saveToCassandra("alpha_test", "service_summary_rollup")


  }

}


object Transforms {

  private val appNameBase = "Transforms%s"

  private val dtFormatter = DateTimeFormat.forPattern("MMddHH")


  def main(args: Array[String]) {

    if (args.size < 2) {

      println("""Usage: ConnTransforms  

             DateTime to start processing at. Format: MMddHH

               DateTime to end processing at.  Format: MMddHH""")

      sys.exit(1)

    }


    // withZoneRetainFields gives us a UTC time as specified on the command 
line.

    val start = 
dtFormatter.parseDateTime(args(0)).withZoneRetainFields(DateTimeZone.UTC)

    val end = 
dtFormatter.parseDateTime(args(1)).withZoneRetainFields(DateTimeZone.UTC)


    println("Processing rollups from %s to %s".format(start, end))


    // Create the spark context.

    val conf = new SparkConf()

      .setAppName(appNameBase.format("Test"))


    val connector = CassandraConnector(conf)


    val sc = new SparkContext(conf)


    // Set up the threadpool for running Jobs

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Davies Liu
Thanks for let us now.

On Fri, Jun 5, 2015 at 8:34 AM, Sam Stoelinga  wrote:
> Please ignore this whole thread. It's working out of nowhere. I'm not sure
> what was the root cause. After I restarted the VM the previous SIFT code
> also started working.
>
> On Fri, Jun 5, 2015 at 10:40 PM, Sam Stoelinga 
> wrote:
>>
>> Thanks Davies. I will file a bug later with code and single image as
>> dataset. Next to that I can give anybody access to my vagrant VM that
>> already has spark with OpenCV and the dataset available.
>>
>> Or you can setup the same vagrant machine at your place. All is automated
>> ^^
>> git clone https://github.com/samos123/computer-vision-cloud-platform
>> cd computer-vision-cloud-platform
>> ./scripts/setup.sh
>> vagrant ssh
>>
>> (Expect failures, I haven't cleaned up and tested it for other people) btw
>> I study at Tsinghua also currently.
>>
>> On Fri, Jun 5, 2015 at 2:43 PM, Davies Liu  wrote:
>>>
>>> Please file a bug here: https://issues.apache.org/jira/browse/SPARK/
>>>
>>> Could you also provide a way to reproduce this bug (including some
>>> datasets)?
>>>
>>> On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga 
>>> wrote:
>>> > I've changed the SIFT feature extraction to SURF feature extraction and
>>> > it
>>> > works...
>>> >
>>> > Following line was changed:
>>> > sift = cv2.xfeatures2d.SIFT_create()
>>> >
>>> > to
>>> >
>>> > sift = cv2.xfeatures2d.SURF_create()
>>> >
>>> > Where should I file this as a bug? When not running on Spark it works
>>> > fine
>>> > so I'm saying it's a spark bug.
>>> >
>>> > On Fri, Jun 5, 2015 at 2:17 PM, Sam Stoelinga 
>>> > wrote:
>>> >>
>>> >> Yea should have emphasized that. I'm running the same code on the same
>>> >> VM.
>>> >> It's a VM with spark in standalone mode and I run the unit test
>>> >> directly on
>>> >> that same VM. So OpenCV is working correctly on that same machine but
>>> >> when
>>> >> moving the exact same OpenCV code to spark it just crashes.
>>> >>
>>> >> On Tue, Jun 2, 2015 at 5:06 AM, Davies Liu 
>>> >> wrote:
>>> >>>
>>> >>> Could you run the single thread version in worker machine to make
>>> >>> sure
>>> >>> that OpenCV is installed and configured correctly?
>>> >>>
>>> >>> On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga
>>> >>> 
>>> >>> wrote:
>>> >>> > I've verified the issue lies within Spark running OpenCV code and
>>> >>> > not
>>> >>> > within
>>> >>> > the sequence file BytesWritable formatting.
>>> >>> >
>>> >>> > This is the code which can reproduce that spark is causing the
>>> >>> > failure
>>> >>> > by
>>> >>> > not using the sequencefile as input at all but running the same
>>> >>> > function
>>> >>> > with same input on spark but fails:
>>> >>> >
>>> >>> > def extract_sift_features_opencv(imgfile_imgbytes):
>>> >>> > imgfilename, discardsequencefile = imgfile_imgbytes
>>> >>> > imgbytes = bytearray(open("/tmp/img.jpg", "rb").read())
>>> >>> > nparr = np.fromstring(buffer(imgbytes), np.uint8)
>>> >>> > img = cv2.imdecode(nparr, 1)
>>> >>> > gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
>>> >>> > sift = cv2.xfeatures2d.SIFT_create()
>>> >>> > kp, descriptors = sift.detectAndCompute(gray, None)
>>> >>> > return (imgfilename, "test")
>>> >>> >
>>> >>> > And corresponding tests.py:
>>> >>> > https://gist.github.com/samos123/d383c26f6d47d34d32d6
>>> >>> >
>>> >>> >
>>> >>> > On Sat, May 30, 2015 at 8:04 PM, Sam Stoelinga
>>> >>> > 
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> Thanks for the advice! The following line causes spark to crash:
>>> >>> >>
>>> >>> >> kp, descriptors = sift.detectAndCompute(gray, None)
>>> >>> >>
>>> >>> >> But I do need this line to be executed and the code does not crash
>>> >>> >> when
>>> >>> >> running outside of Spark but passing the same parameters. You're
>>> >>> >> saying
>>> >>> >> maybe the bytes from the sequencefile got somehow transformed and
>>> >>> >> don't
>>> >>> >> represent an image anymore causing OpenCV to crash the whole
>>> >>> >> python
>>> >>> >> executor.
>>> >>> >>
>>> >>> >> On Fri, May 29, 2015 at 2:06 AM, Davies Liu
>>> >>> >> 
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Could you try to comment out some lines in
>>> >>> >>> `extract_sift_features_opencv` to find which line cause the
>>> >>> >>> crash?
>>> >>> >>>
>>> >>> >>> If the bytes came from sequenceFile() is broken, it's easy to
>>> >>> >>> crash a
>>> >>> >>> C library in Python (OpenCV).
>>> >>> >>>
>>> >>> >>> On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga
>>> >>> >>> 
>>> >>> >>> wrote:
>>> >>> >>> > Hi sparkers,
>>> >>> >>> >
>>> >>> >>> > I am working on a PySpark application which uses the OpenCV
>>> >>> >>> > library. It
>>> >>> >>> > runs
>>> >>> >>> > fine when running the code locally but when I try to run it on
>>> >>> >>> > Spark on
>>> >>> >>> > the
>>> >>> >>> > same Machine it crashes the worker.
>>> >>> >>> >
>>> >>> >>> > The code can be found here:
>>> >>> >>> > https://gist.github.com/samos123/885f9fe87c8fa5abf78f
>>> >>> >>> >
>>> >>> >>> > This is the 

Job aborted

2015-06-05 Thread gibbo87
I'm running PageRank on datasets with different sizes (from 1GB to 100GB).
Sometime my job is aborted showing this error:

Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times, most
recent failure: Lost task 0.3 in stage 4.1 (TID 2051, 9.12.247.250):
java.io.FileNotFoundException:
/tmp/spark-23fcf792-e281-4d0d-a55e-4db2c31e7f10/executor-d4a67fea-3d6b-4e93-ad92-53221fa92f2b/blockmgr-96e4bd04-00ba-4893-a65a-cec09ec2dc52/17/rdd_3_0
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:241)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:110)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:510)
at 
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:616)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:

I can not find the cause of this, can somone help?
I'm running Spark 1.4.0-rc3.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-tp23185.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 1:00 PM Igor Berman  wrote:

> Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos?
>

Spark standalone, v1.2.1.


Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 12:58 PM Marcelo Vanzin  wrote:

> You didn't show the error so the only thing we can do is speculate. You're
> probably sending the object that's holding the SparkContext reference over
> the network at some point (e.g. it's used by a task run in an executor),
> and that's why you were getting that exception.
>

Apologies - the full error is as follows.  All I did here was remove the
@transient annotation from the sc variable in my class constructor.  In
addition, the full code for the classes and launching process is included
below.

Error traceback:
```
Exception in thread "pool-5-thread-1" java.lang.Error:
org.apache.spark.SparkException: Task not serializable
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Task not serializable
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
at org.apache.spark.rdd.RDD.map(RDD.scala:288)
at
io.icebrg.analytics.spark.SecondRollup.run(ConnTransforms.scala:33)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
... 2 more
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 7 more
```

Code:
```
class SecondRollup(sc: SparkContext, connector: CassandraConnector,
scanPoint: DateTime) extends Runnable with Serializable {

  def run {
val conn = sc.cassandraTable("alpha_test", "sensor_readings")
  .select("data")
  .where("timep = ?", scanPoint)
  .where("sensorid IN ?", System.sensors)
  .map(r => Json.parse(r.getString("data")))
  .cache()

conn.flatMap(AffectedIp.fromJson)
  .map(a => (AffectedIp.key(a), a))
  .reduceByKey(AffectedIp.reduce)
  .map(_._2)
  .map(a => AffectedIp.reduceWithCurrent(connector, a))
  .saveToCassandra("alpha_test", "affected_hosts")

conn.flatMap(ServiceSummary.fromnJson)
  .map(s => (ServiceSummary.key(s), s))
  .reduceByKey(ServiceSummary.reduce)
  .map(_._2)
  .saveToCassandra("alpha_test", "service_summary_rollup")

  }
}

object Transforms {
  private val appNameBase = "Transforms%s"
  private val dtFormatter = DateTimeFormat.forPattern("MMddHH")

  def main(args: Array[String]) {
if (args.size < 2) {
  println("""Usage: ConnTransforms  
 DateTime to start processing at. Format: MMddHH
   DateTime to end processing at.  Format: MMddHH""")
  sys.exit(1)
}

// withZoneRetainFields gives us a UTC time as specified on the command
line.
val start =
dtFormatter.parseDateTime(args(0)).withZoneRetainFields(DateTimeZone.UTC)
val end =
dtFormatter.parseDateTime(args(1)).withZoneRetainFields(DateTimeZone.UTC)

println("Processing rollups from %s to %s".format(start, end))

// Create the spark context.
val conf = new SparkConf()
  .setAppName(appNameBase.format("Test"))

val connector = CassandraConnector(conf)

val sc = new SparkContext(conf)

// Set up the threadpool for running Jobs.
val pool = Executors.newFixedThreadPool(10)

pool.execute(new SecondRollup(sc, connector, start))

//for (dt <- new TimeRanger(start, end)) {
//  // Always run the second rollups.
//  pool.execute(new SecondRollup(sc, connector, dt))
//  if (Granularity.Minute.isGranularity(dt)) pool.execute(new
MinuteRollup(sc, connector, dt))
//}

// stop the pool from accepting new tasks
pool.shutdown()

// We've submitted all the tasks.

Re: SparkContext & Threading

2015-06-05 Thread Igor Berman
Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos?
in yarn-cluster the driver program is executed inside one of nodes in
cluster, so might be that driver code needs to be serialized to be sent to
some node

On 5 June 2015 at 22:55, Lee McFadden  wrote:

>
> On Fri, Jun 5, 2015 at 12:30 PM Marcelo Vanzin 
> wrote:
>
>> Ignoring the serialization thing (seems like a red herring):
>>
>
> People seem surprised that I'm getting the Serialization exception at all
> - I'm not convinced it's a red herring per se, but on to the blocking
> issue...
>
>
>>
>>
> You might be using this Cassandra library with an incompatible version of
>> Spark; the `TaskMetrics` class has changed in the past, and the method it's
>> looking for does not exist at least in 1.4.
>>
>>
> You are correct, I was being a bone head.  We recently downgraded to Spark
> 1.2.1 and I was running the compiled jar using Spark 1.3.1 on my local
> machine.  Running the job with threading on my 1.2.1 cluster worked.  Thank
> you for finding the obvious mistake :)
>
> Regarding serialization, I'm still confused as to why I was getting a
> serialization error in the first place as I'm executing these Runnable
> classes from a java thread pool.  I'm fairly new to Scala/JVM world and
> there doesn't seem to be any Spark documentation to explain *why* I need to
> declare the sc variable as @transient (or even that I should).
>
> I was under the impression that objects only need to be serializable when
> they are sent over the network, and that doesn't seem to be occurring as
> far as I can tell.
>
> Apologies if this is simple stuff, but I don't like "fixing things"
> without knowing the full reason why the changes I made fixed things :)
>
> Thanks again for your time!
>


Re: SparkContext & Threading

2015-06-05 Thread Marcelo Vanzin
On Fri, Jun 5, 2015 at 12:55 PM, Lee McFadden  wrote:

> Regarding serialization, I'm still confused as to why I was getting a
> serialization error in the first place as I'm executing these Runnable
> classes from a java thread pool.  I'm fairly new to Scala/JVM world and
> there doesn't seem to be any Spark documentation to explain *why* I need to
> declare the sc variable as @transient (or even that I should).
>
> I was under the impression that objects only need to be serializable when
> they are sent over the network, and that doesn't seem to be occurring as
> far as I can tell.
>

You didn't show the error so the only thing we can do is speculate. You're
probably sending the object that's holding the SparkContext reference over
the network at some point (e.g. it's used by a task run in an executor),
and that's why you were getting that exception.


-- 
Marcelo


Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 12:30 PM Marcelo Vanzin  wrote:

> Ignoring the serialization thing (seems like a red herring):
>

People seem surprised that I'm getting the Serialization exception at all -
I'm not convinced it's a red herring per se, but on to the blocking issue...


>
>
You might be using this Cassandra library with an incompatible version of
> Spark; the `TaskMetrics` class has changed in the past, and the method it's
> looking for does not exist at least in 1.4.
>
>
You are correct, I was being a bone head.  We recently downgraded to Spark
1.2.1 and I was running the compiled jar using Spark 1.3.1 on my local
machine.  Running the job with threading on my 1.2.1 cluster worked.  Thank
you for finding the obvious mistake :)

Regarding serialization, I'm still confused as to why I was getting a
serialization error in the first place as I'm executing these Runnable
classes from a java thread pool.  I'm fairly new to Scala/JVM world and
there doesn't seem to be any Spark documentation to explain *why* I need to
declare the sc variable as @transient (or even that I should).

I was under the impression that objects only need to be serializable when
they are sent over the network, and that doesn't seem to be occurring as
far as I can tell.

Apologies if this is simple stuff, but I don't like "fixing things" without
knowing the full reason why the changes I made fixed things :)

Thanks again for your time!


Job aborted

2015-06-05 Thread Giovanni Paolo Gibilisco
I'm running PageRank on datasets with different sizes (from 1GB to 100GB).
Sometime my job is aborted showing this error:

Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times,
most recent failure: Lost task 0.3 in stage 4.1 (TID 2051,
9.12.247.250): java.io.FileNotFoundException:
/tmp/spark-23fcf792-e281-4d0d-a55e-4db2c31e7f10/executor-d4a67fea-3d6b-4e93-ad92-53221fa92f2b/blockmgr-96e4bd04-00ba-4893-a65a-cec09ec2dc52/17/rdd_3_0
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:241)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:110)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:510)
at 
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:616)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:


I can not find the cause of this, can somone help?
I'm running Spark 1.4.0-rc3.


Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-05 Thread Yin Huai
Hi Doug,

For now, I think you can use "sqlContext.sql("USE databaseName")" to change
the current database.

Thanks,

Yin

On Thu, Jun 4, 2015 at 12:04 PM, Yin Huai  wrote:

> Hi Doug,
>
> sqlContext.table does not officially support database name. It only
> supports table name as the parameter. We will add a method to support
> database name in future.
>
> Thanks,
>
> Yin
>
> On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog 
> wrote:
>
>> Hi Yin,
>>  I’m very surprised to hear that its not supported in 1.3 because I’ve
>> been using it since 1.3.0.
>> It worked great up until  SPARK-6908 was merged into master.
>>
>> What is the supported way to get  DF for a table that is not in the
>> default database ?
>>
>> IMHO, If you are not going to support “databaseName.tableName”,
>> sqlContext.table() should have a version that takes a database and a table,
>> ie
>>
>> def table(databaseName: String, tableName: String): DataFrame =
>>   DataFrame(this, catalog.lookupRelation(Seq(databaseName,tableName)))
>>
>> The handling of databases in Spark(sqlContext, hiveContext, Catalog)
>> could be better.
>>
>> Thanks,
>>
>> Doug
>>
>> > On Jun 3, 2015, at 8:21 PM, Yin Huai  wrote:
>> >
>> > Hi Doug,
>> >
>> > Actually, sqlContext.table does not support database name in both Spark
>> 1.3 and Spark 1.4. We will support it in future version.
>> >
>> > Thanks,
>> >
>> > Yin
>> >
>> >
>> >
>> > On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog 
>> wrote:
>> > Hi,
>> >
>> > sqlContext.table(“db.tbl”) isn’t working for me, I get a
>> NoSuchTableException.
>> >
>> > But I can access the table via
>> >
>> > sqlContext.sql(“select * from db.tbl”)
>> >
>> > So I know it has the table info from the metastore.
>> >
>> > Anyone else see this ?
>> >
>> > I’ll keep digging.
>> > I compiled via make-distribution  -Pyarn -phadoop-2.4 -Phive
>> -Phive-thriftserver
>> > It worked for me in 1.3.1
>> >
>> > Cheers,
>> >
>> > Doug
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>> >
>>
>>
>


Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread Tim Chen
It seems like there is another thread going on:

http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html

I'm not particularly sure why, seems like the problem is that getting the
current context class loader is returning null in this instance.

Do you have some repro steps or config we can try this?

Tim

On Fri, Jun 5, 2015 at 3:40 AM, Steve Loughran 
wrote:

>
>  On 2 Jun 2015, at 00:14, Dean Wampler  wrote:
>
>  It would be nice to see the code for MapR FS Java API, but my google foo
> failed me (assuming it's open source)...
>
>
>  I know that MapRFS is closed source, don't know about the java JAR. Why
> not ask Ted Dunning (cc'd)  nicely to see if he can track down the stack
> trace for you.
>
>   So, shooting in the dark ;) there are a few things I would check, if
> you haven't already:
>
>  1. Could there be 1.2 versions of some Spark jars that get picked up at
> run time (but apparently not in local mode) on one or more nodes? (Side
> question: Does your node experiment fail on all nodes?) Put another way,
> are the classpaths good for all JVM tasks?
> 2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?
>
>  Incidentally, how are you combining Mesos and MapR? Are you running
> Spark in Mesos, but accessing data in MapR-FS?
>
>  Perhaps the MapR "shim" library doesn't support Spark 1.3.1.
>
>  HTH,
>
>  dean
>
>  Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Mon, Jun 1, 2015 at 2:49 PM, John Omernik  wrote:
>
>> All -
>>
>>  I am facing and odd issue and I am not really sure where to go for
>> support at this point.  I am running MapR which complicates things as it
>> relates to Mesos, however this HAS worked in the past with no issues so I
>> am stumped here.
>>
>>  So for starters, here is what I am trying to run. This is a simple show
>> tables using the Hive Context:
>>
>>  from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SQLContext, Row, HiveContext
>> sparkhc = HiveContext(sc)
>> test = sparkhc.sql("show tables")
>> for r in test.collect():
>>   print r
>>
>>  When I run it on 1.3.1 using ./bin/pyspark --master local  This works
>> with no issues.
>>
>>  When I run it using Mesos with all the settings configured (as they had
>> worked in the past) I get lost tasks and when I zoom in them, the error
>> that is being reported is below.  Basically it's a NullPointerException on
>> the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
>> and compared both together, the class path, everything is exactly the same.
>> Yet running in local mode works, and running in mesos fails.  Also of note,
>> when the task is scheduled to run on the same node as when I run locally,
>> that fails too! (Baffling).
>>
>>  Ok, for comparison, how I configured Mesos was to download the mapr4
>> package from spark.apache.org.  Using the exact same configuration file
>> (except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
>> When I run this example with the mapr4 for 1.2.0 there is no issue in
>> Mesos, everything runs as intended. Using the same package for 1.3.1 then
>> it fails.
>>
>>  (Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as
>> well).
>>
>>  So basically When I used 1.2.0 and followed a set of steps, it worked
>> on Mesos and 1.3.1 fails.  Since this is a "current" version of Spark, MapR
>> is supports 1.2.1 only.  (Still working on that).
>>
>>  I guess I am at a loss right now on why this would be happening, any
>> pointers on where I could look or what I could tweak would be greatly
>> appreciated. Additionally, if there is something I could specifically draw
>> to the attention of MapR on this problem please let me know, I am perplexed
>> on the change from 1.2.0 to 1.3.1.
>>
>>  Thank you,
>>
>>  John
>>
>>
>>
>>
>>  Full Error on 1.3.1 on Mesos:
>> 15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
>> 1060.3 MB java.lang.NullPointerException at
>> com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at
>> com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at
>> com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
>> org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
>> at java.lang.Class.forName0(Native Method) at
>> java.lang.Class.forName(Class.java:274) at
>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
>> at
>> org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062)
>> at
>> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272)
>> at
>> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224)
>> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141)
>> at org.apache.hadoop.conf.Configu

Re: SparkContext & Threading

2015-06-05 Thread Marcelo Vanzin
Ignoring the serialization thing (seems like a red herring):

On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden  wrote:

> 15/06/05 11:35:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> localhost): java.lang.NoSuchMethodError:
> org.apache.spark.executor.TaskMetrics.inputMetrics_$eq(Lscala/Option;)V
> at
> com.datastax.spark.connector.metrics.InputMetricsUpdater$.apply(InputMetricsUpdater.scala:61)
> at
> com.datastax.spark.connector.rdd.CassandraTableScanRDD.compute(CassandraTableScanRDD.scala:196)
>

You might be using this Cassandra library with an incompatible version of
Spark; the `TaskMetrics` class has changed in the past, and the method it's
looking for does not exist at least in 1.4.

-- 
Marcelo


Re: SparkContext & Threading

2015-06-05 Thread Igor Berman
+1 to question about serializaiton. SparkContext is still in driver
process(even if it has several threads from which you submit jobs)
as for the problem, check your classpath, scala version, spark version etc.
such errors usually happens when there is some conflict in classpath. Maybe
you compiled your jar with different versions?

On 5 June 2015 at 21:55, Lee McFadden  wrote:

> You can see an example of the constructor for the class which executes a
> job in my opening post.
>
> I'm attempting to instantiate and run the class using the code below:
>
> ```
> val conf = new SparkConf()
>   .setAppName(appNameBase.format("Test"))
>
> val connector = CassandraConnector(conf)
>
> val sc = new SparkContext(conf)
>
> // Set up the threadpool for running Jobs.
> val pool = Executors.newFixedThreadPool(10)
>
> pool.execute(new SecondRollup(sc, connector, start))
> ```
>
> There is some surrounding code that then waits for all the jobs entered
> into the thread pool to complete, although it's not really required at the
> moment as I am only submitting one job until I get this issue straightened
> out :)
>
> Thanks,
>
> Lee
>
> On Fri, Jun 5, 2015 at 11:50 AM Marcelo Vanzin 
> wrote:
>
>> On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden  wrote:
>>
>>> Initially I had issues passing the SparkContext to other threads as it
>>> is not serializable.  Eventually I found that adding the @transient
>>> annotation prevents a NotSerializableException.
>>>
>>
>> This is really puzzling. How are you passing the context around that you
>> need to do serialization?
>>
>> Threads run all in the same process so serialization should not be needed
>> at all.
>>
>> --
>> Marcelo
>>
>


Re: SparkContext & Threading

2015-06-05 Thread Lee McFadden
You can see an example of the constructor for the class which executes a
job in my opening post.

I'm attempting to instantiate and run the class using the code below:

```
val conf = new SparkConf()
  .setAppName(appNameBase.format("Test"))

val connector = CassandraConnector(conf)

val sc = new SparkContext(conf)

// Set up the threadpool for running Jobs.
val pool = Executors.newFixedThreadPool(10)

pool.execute(new SecondRollup(sc, connector, start))
```

There is some surrounding code that then waits for all the jobs entered
into the thread pool to complete, although it's not really required at the
moment as I am only submitting one job until I get this issue straightened
out :)

Thanks,

Lee

On Fri, Jun 5, 2015 at 11:50 AM Marcelo Vanzin  wrote:

> On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden  wrote:
>
>> Initially I had issues passing the SparkContext to other threads as it is
>> not serializable.  Eventually I found that adding the @transient annotation
>> prevents a NotSerializableException.
>>
>
> This is really puzzling. How are you passing the context around that you
> need to do serialization?
>
> Threads run all in the same process so serialization should not be needed
> at all.
>
> --
> Marcelo
>


Re: SparkContext & Threading

2015-06-05 Thread Marcelo Vanzin
On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden  wrote:

> Initially I had issues passing the SparkContext to other threads as it is
> not serializable.  Eventually I found that adding the @transient annotation
> prevents a NotSerializableException.
>

This is really puzzling. How are you passing the context around that you
need to do serialization?

Threads run all in the same process so serialization should not be needed
at all.

-- 
Marcelo


SparkContext & Threading

2015-06-05 Thread Lee McFadden
Hi all,

I'm having some issues finding any kind of best practices when attempting
to create Spark applications which launch jobs from a thread pool.

Initially I had issues passing the SparkContext to other threads as it is
not serializable.  Eventually I found that adding the @transient annotation
prevents a NotSerializableException.

```
class SecondRollup(@transient sc: SparkContext, connector:
CassandraConnector, scanPoint: DateTime) extends Runnable with Serializable
{
...
}
```

However, now I am running into a different exception:

```
15/06/05 11:35:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
localhost): java.lang.NoSuchMethodError:
org.apache.spark.executor.TaskMetrics.inputMetrics_$eq(Lscala/Option;)V
at
com.datastax.spark.connector.metrics.InputMetricsUpdater$.apply(InputMetricsUpdater.scala:61)
at
com.datastax.spark.connector.rdd.CassandraTableScanRDD.compute(CassandraTableScanRDD.scala:196)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```

The documentation (https://spark.apache.org/docs/latest/job-scheduling.html)
explicitly states that jobs can be submitted by multiple threads but I seem
to be doing *something* incorrectly and haven't found any docs to point me
in the right direction.

Does anyone have any advice on how to get jobs submitted by multiple
threads?  The jobs are fairly simple and work when I run them serially, so
I'm not exactly sure what I'm doing wrong.

Thanks,

Lee


Removing Keys from a MapType

2015-06-05 Thread chrish2312
Hello!

I am working a column of Maps with dataframes, and I was wondering if there
was a good way of removing a set of keys and their associated values from
that columns. I've been using a UDF, but if there was some built in function
that I'm missing, I'd love to know.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Removing-Keys-from-a-MapType-tp23183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pregel runs slower and slower when each Pregel has data dependency

2015-06-05 Thread dash
Hi Heather,

Please check this issue https://issues.apache.org/jira/browse/SPARK-4672. I 
think you can solve this problem by checkpointing your data every several 
iterations.

Hope that helps.

Best regards,

Baoxu(Dash) Shi
Computer Science and Engineering Department
University of Notre Dame




> On Jun 2, 2015, at 10:13 PM, heather [via Apache Spark User List] 
>  wrote:
> 
> hello,have you solved your problem yet? I met this problem too. 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Pregel-runs-slower-and-slower-when-each-Pregel-has-data-dependency-tp7648p23120.html
> To unsubscribe from Pregel runs slower and slower when each Pregel has data 
> dependency, click here.
> NAML





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pregel-runs-slower-and-slower-when-each-Pregel-has-data-dependency-tp7648p23181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Shuffle strange error

2015-06-05 Thread octavian.ganea
Solved, is SPARK_PID_DIR from spark-env.sh. Changing this directory from /tmp
to smthg different actually changed the error that I got, now showing where
the actual error was coming from (a null pointer in my program). The first
error was not helpful at all though. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-strange-error-tp23179p23180.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Shuffle strange error

2015-06-05 Thread octavian.ganea
Hi all,

I'm using spark 1.3.1 and ran the following code:

sc.textFile(path)
.map(line => (getEntId(line), line))
.persist(StorageLevel.MEMORY_AND_DISK)
.groupByKey
.flatMap(x => func(x))
.reduceByKey((a,b) => (a + b).toShort)

I get the following error in flatMap() and it's very hard to understand
what's happening. Can someone please help me with this ? Also, how do I
force Spark to not use /tmp ? I tried changing java.io.tmpdir and
spark.local.dir, but without success. Thanks!


java.io.FileNotFoundException:
/tmp/spark-f6dd2621-6be0-4ddb-a31f-97e0f77eae87/spark-8391db48-fb2b-44a2-b34a-153b16b97292/spark-03e37c8c-95c7-47d4-9ac7-a69fbdfdf3a6/blockmgr-ed70dbea-5747-4765-8942-30dd0b1a4258/16/shuffle_2_31_0.data
(No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
at
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:201)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:759)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:758)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:823)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:758)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:754)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:754)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-strange-error-tp23179.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread John Omernik
I am using Spark 1.3.1.   So I don't have the 1.4.0 isEmpty.  I guess I am
curious on the right approach here, like I said in my original post,
perhaps this isn't "bad" but I the "exceptions" I guess bother me from a
programmer level... is that wrong? :)



On Fri, Jun 5, 2015 at 11:07 AM, Ted Yu  wrote:

> John:
> Which Spark release are you using ?
> As of 1.4.0, RDD has this method:
>
>   def isEmpty(): Boolean = withScope {
>
> FYI
>
> On Fri, Jun 5, 2015 at 9:01 AM, Evo Eftimov  wrote:
>
>> Foreachpartition callback is provided with Iterator by the Spark
>> Frameowrk – while iterator.hasNext() ……
>>
>>
>>
>> Also check whether this is not some sort of Python Spark API bug – Python
>> seems to be the foster child here – Scala and Java are the darlings
>>
>>
>>
>> *From:* John Omernik [mailto:j...@omernik.com]
>> *Sent:* Friday, June 5, 2015 4:08 PM
>> *To:* user
>> *Subject:* Spark Streaming for Each RDD - Exception on Empty
>>
>>
>>
>> Is there pythonic/sparkonic way to test for an empty RDD before using the
>> foreachRDD?  Basically I am using the Python example
>> https://spark.apache.org/docs/latest/streaming-programming-guide.html to
>> "put records somewhere"  When I have data, it works fine, when I don't I
>> get an exception. I am not sure about the performance implications of just
>> throwing an exception every time there is no data, but can I just test
>> before sending it?
>>
>>
>>
>> I did see one post mentioning look for take(1) from the stream to test
>> for data, but I am not sure where I put that in this example... Is that in
>> the lambda function? or somewhere else? Looking for pointers!
>>
>> Thanks!
>>
>>
>>
>>
>>
>>
>>
>> mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(parseRDD))
>>
>>
>>
>>
>>
>> Using this example code from the link above:
>>
>>
>>
>> *def* sendPartition(iter):
>>
>> connection = createNewConnection()
>>
>> *for* record *in* iter:
>>
>> connection.send(record)
>>
>> connection.close()
>>
>>
>>
>> dstream.foreachRDD(*lambda* rdd: rdd.foreachPartition(sendPartition))
>>
>>
>


Re: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread Ted Yu
John:
Which Spark release are you using ?
As of 1.4.0, RDD has this method:

  def isEmpty(): Boolean = withScope {

FYI

On Fri, Jun 5, 2015 at 9:01 AM, Evo Eftimov  wrote:

> Foreachpartition callback is provided with Iterator by the Spark Frameowrk
> – while iterator.hasNext() ……
>
>
>
> Also check whether this is not some sort of Python Spark API bug – Python
> seems to be the foster child here – Scala and Java are the darlings
>
>
>
> *From:* John Omernik [mailto:j...@omernik.com]
> *Sent:* Friday, June 5, 2015 4:08 PM
> *To:* user
> *Subject:* Spark Streaming for Each RDD - Exception on Empty
>
>
>
> Is there pythonic/sparkonic way to test for an empty RDD before using the
> foreachRDD?  Basically I am using the Python example
> https://spark.apache.org/docs/latest/streaming-programming-guide.html to
> "put records somewhere"  When I have data, it works fine, when I don't I
> get an exception. I am not sure about the performance implications of just
> throwing an exception every time there is no data, but can I just test
> before sending it?
>
>
>
> I did see one post mentioning look for take(1) from the stream to test for
> data, but I am not sure where I put that in this example... Is that in the
> lambda function? or somewhere else? Looking for pointers!
>
> Thanks!
>
>
>
>
>
>
>
> mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(parseRDD))
>
>
>
>
>
> Using this example code from the link above:
>
>
>
> *def* sendPartition(iter):
>
> connection = createNewConnection()
>
> *for* record *in* iter:
>
> connection.send(record)
>
> connection.close()
>
>
>
> dstream.foreachRDD(*lambda* rdd: rdd.foreachPartition(sendPartition))
>
>


RE: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread Evo Eftimov
Foreachpartition callback is provided with Iterator by the Spark Frameowrk – 
while iterator.hasNext() ……

 

Also check whether this is not some sort of Python Spark API bug – Python seems 
to be the foster child here – Scala and Java are the darlings

 

From: John Omernik [mailto:j...@omernik.com] 
Sent: Friday, June 5, 2015 4:08 PM
To: user
Subject: Spark Streaming for Each RDD - Exception on Empty

 

Is there pythonic/sparkonic way to test for an empty RDD before using the 
foreachRDD?  Basically I am using the Python example 
https://spark.apache.org/docs/latest/streaming-programming-guide.html to "put 
records somewhere"  When I have data, it works fine, when I don't I get an 
exception. I am not sure about the performance implications of just throwing an 
exception every time there is no data, but can I just test before sending it?

 

I did see one post mentioning look for take(1) from the stream to test for 
data, but I am not sure where I put that in this example... Is that in the 
lambda function? or somewhere else? Looking for pointers!

Thanks!

 

 

 

mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(parseRDD))

 

 

Using this example code from the link above:

 

def sendPartition(iter):
connection = createNewConnection()
for record in iter:
connection.send(record)
connection.close()
 
dstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))


Spark SQL and Streaming Results

2015-06-05 Thread Pietro Gentile
Hi all,


what is the best way to perform Spark SQL queries and obtain the result
tuplas in a stremaing way. In particullar, I want to aggregate data and
obtain the first and incomplete results in a fast way. But it should be
updated until the aggregation be completed.

Best Regards.


Saving compressed textFiles from a DStream in Scala

2015-06-05 Thread doki_pen
It looks like saveAsTextFiles doesn't support the compression parameter of
RDD.saveAsTextFile. Is there a way to add the functionality in my client
code without patching Spark? I tried making my own saveFunc function and
calling DStream.foreachRDD but ran into trouble with invoking rddToFileName
and making the RDD type parameter work properly. It's probably just do to my
lack of Scala knowledge. Can anyone give me a hand?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-compressed-textFiles-from-a-DStream-in-Scala-tp23177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
Please ignore this whole thread. It's working out of nowhere. I'm not sure
what was the root cause. After I restarted the VM the previous SIFT code
also started working.

On Fri, Jun 5, 2015 at 10:40 PM, Sam Stoelinga 
wrote:

> Thanks Davies. I will file a bug later with code and single image as
> dataset. Next to that I can give anybody access to my vagrant VM that
> already has spark with OpenCV and the dataset available.
>
> Or you can setup the same vagrant machine at your place. All is automated
> ^^
> git clone https://github.com/samos123/computer-vision-cloud-platform
> cd computer-vision-cloud-platform
> ./scripts/setup.sh
> vagrant ssh
>
> (Expect failures, I haven't cleaned up and tested it for other people) btw
> I study at Tsinghua also currently.
>
> On Fri, Jun 5, 2015 at 2:43 PM, Davies Liu  wrote:
>
>> Please file a bug here: https://issues.apache.org/jira/browse/SPARK/
>>
>> Could you also provide a way to reproduce this bug (including some
>> datasets)?
>>
>> On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga 
>> wrote:
>> > I've changed the SIFT feature extraction to SURF feature extraction and
>> it
>> > works...
>> >
>> > Following line was changed:
>> > sift = cv2.xfeatures2d.SIFT_create()
>> >
>> > to
>> >
>> > sift = cv2.xfeatures2d.SURF_create()
>> >
>> > Where should I file this as a bug? When not running on Spark it works
>> fine
>> > so I'm saying it's a spark bug.
>> >
>> > On Fri, Jun 5, 2015 at 2:17 PM, Sam Stoelinga 
>> wrote:
>> >>
>> >> Yea should have emphasized that. I'm running the same code on the same
>> VM.
>> >> It's a VM with spark in standalone mode and I run the unit test
>> directly on
>> >> that same VM. So OpenCV is working correctly on that same machine but
>> when
>> >> moving the exact same OpenCV code to spark it just crashes.
>> >>
>> >> On Tue, Jun 2, 2015 at 5:06 AM, Davies Liu 
>> wrote:
>> >>>
>> >>> Could you run the single thread version in worker machine to make sure
>> >>> that OpenCV is installed and configured correctly?
>> >>>
>> >>> On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga > >
>> >>> wrote:
>> >>> > I've verified the issue lies within Spark running OpenCV code and
>> not
>> >>> > within
>> >>> > the sequence file BytesWritable formatting.
>> >>> >
>> >>> > This is the code which can reproduce that spark is causing the
>> failure
>> >>> > by
>> >>> > not using the sequencefile as input at all but running the same
>> >>> > function
>> >>> > with same input on spark but fails:
>> >>> >
>> >>> > def extract_sift_features_opencv(imgfile_imgbytes):
>> >>> > imgfilename, discardsequencefile = imgfile_imgbytes
>> >>> > imgbytes = bytearray(open("/tmp/img.jpg", "rb").read())
>> >>> > nparr = np.fromstring(buffer(imgbytes), np.uint8)
>> >>> > img = cv2.imdecode(nparr, 1)
>> >>> > gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
>> >>> > sift = cv2.xfeatures2d.SIFT_create()
>> >>> > kp, descriptors = sift.detectAndCompute(gray, None)
>> >>> > return (imgfilename, "test")
>> >>> >
>> >>> > And corresponding tests.py:
>> >>> > https://gist.github.com/samos123/d383c26f6d47d34d32d6
>> >>> >
>> >>> >
>> >>> > On Sat, May 30, 2015 at 8:04 PM, Sam Stoelinga <
>> sammiest...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Thanks for the advice! The following line causes spark to crash:
>> >>> >>
>> >>> >> kp, descriptors = sift.detectAndCompute(gray, None)
>> >>> >>
>> >>> >> But I do need this line to be executed and the code does not crash
>> >>> >> when
>> >>> >> running outside of Spark but passing the same parameters. You're
>> >>> >> saying
>> >>> >> maybe the bytes from the sequencefile got somehow transformed and
>> >>> >> don't
>> >>> >> represent an image anymore causing OpenCV to crash the whole python
>> >>> >> executor.
>> >>> >>
>> >>> >> On Fri, May 29, 2015 at 2:06 AM, Davies Liu > >
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Could you try to comment out some lines in
>> >>> >>> `extract_sift_features_opencv` to find which line cause the crash?
>> >>> >>>
>> >>> >>> If the bytes came from sequenceFile() is broken, it's easy to
>> crash a
>> >>> >>> C library in Python (OpenCV).
>> >>> >>>
>> >>> >>> On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga
>> >>> >>> 
>> >>> >>> wrote:
>> >>> >>> > Hi sparkers,
>> >>> >>> >
>> >>> >>> > I am working on a PySpark application which uses the OpenCV
>> >>> >>> > library. It
>> >>> >>> > runs
>> >>> >>> > fine when running the code locally but when I try to run it on
>> >>> >>> > Spark on
>> >>> >>> > the
>> >>> >>> > same Machine it crashes the worker.
>> >>> >>> >
>> >>> >>> > The code can be found here:
>> >>> >>> > https://gist.github.com/samos123/885f9fe87c8fa5abf78f
>> >>> >>> >
>> >>> >>> > This is the error message taken from STDERR of the worker log:
>> >>> >>> > https://gist.github.com/samos123/3300191684aee7fc8013
>> >>> >>> >
>> >>> >>> > Would like pointers or tips on how to debug further? Would be
>> nice
>> >>> >>> > to
>> >>> >>> > know
>> >>> >>> > the reason why the wor

Re: Required settings for permanent HDFS Spark on EC2

2015-06-05 Thread Nicholas Chammas
If your problem is that stopping/starting the cluster resets configs, then
you may be running into this issue:

https://issues.apache.org/jira/browse/SPARK-4977

Nick

On Thu, Jun 4, 2015 at 2:46 PM barmaley  wrote:

> Hi - I'm having similar problem with switching from ephemeral to persistent
> HDFS - it always looks for 9000 port regardless of options I set for 9010
> persistent HDFS. Have you figured out a solution? Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Required-settings-for-permanent-HDFS-Spark-on-EC2-tp22860p23157.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread John Omernik
Is there pythonic/sparkonic way to test for an empty RDD before using the
foreachRDD?  Basically I am using the Python example
https://spark.apache.org/docs/latest/streaming-programming-guide.html to
"put records somewhere"  When I have data, it works fine, when I don't I
get an exception. I am not sure about the performance implications of just
throwing an exception every time there is no data, but can I just test
before sending it?

I did see one post mentioning look for take(1) from the stream to test for
data, but I am not sure where I put that in this example... Is that in the
lambda function? or somewhere else? Looking for pointers!
Thanks!



mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(parseRDD))



Using this example code from the link above:


def sendPartition(iter):
connection = createNewConnection()
for record in iter:
connection.send(record)
connection.close()
dstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))


Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
Thanks Davies. I will file a bug later with code and single image as
dataset. Next to that I can give anybody access to my vagrant VM that
already has spark with OpenCV and the dataset available.

Or you can setup the same vagrant machine at your place. All is automated ^^
git clone https://github.com/samos123/computer-vision-cloud-platform
cd computer-vision-cloud-platform
./scripts/setup.sh
vagrant ssh

(Expect failures, I haven't cleaned up and tested it for other people) btw
I study at Tsinghua also currently.

On Fri, Jun 5, 2015 at 2:43 PM, Davies Liu  wrote:

> Please file a bug here: https://issues.apache.org/jira/browse/SPARK/
>
> Could you also provide a way to reproduce this bug (including some
> datasets)?
>
> On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga 
> wrote:
> > I've changed the SIFT feature extraction to SURF feature extraction and
> it
> > works...
> >
> > Following line was changed:
> > sift = cv2.xfeatures2d.SIFT_create()
> >
> > to
> >
> > sift = cv2.xfeatures2d.SURF_create()
> >
> > Where should I file this as a bug? When not running on Spark it works
> fine
> > so I'm saying it's a spark bug.
> >
> > On Fri, Jun 5, 2015 at 2:17 PM, Sam Stoelinga 
> wrote:
> >>
> >> Yea should have emphasized that. I'm running the same code on the same
> VM.
> >> It's a VM with spark in standalone mode and I run the unit test
> directly on
> >> that same VM. So OpenCV is working correctly on that same machine but
> when
> >> moving the exact same OpenCV code to spark it just crashes.
> >>
> >> On Tue, Jun 2, 2015 at 5:06 AM, Davies Liu 
> wrote:
> >>>
> >>> Could you run the single thread version in worker machine to make sure
> >>> that OpenCV is installed and configured correctly?
> >>>
> >>> On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga 
> >>> wrote:
> >>> > I've verified the issue lies within Spark running OpenCV code and not
> >>> > within
> >>> > the sequence file BytesWritable formatting.
> >>> >
> >>> > This is the code which can reproduce that spark is causing the
> failure
> >>> > by
> >>> > not using the sequencefile as input at all but running the same
> >>> > function
> >>> > with same input on spark but fails:
> >>> >
> >>> > def extract_sift_features_opencv(imgfile_imgbytes):
> >>> > imgfilename, discardsequencefile = imgfile_imgbytes
> >>> > imgbytes = bytearray(open("/tmp/img.jpg", "rb").read())
> >>> > nparr = np.fromstring(buffer(imgbytes), np.uint8)
> >>> > img = cv2.imdecode(nparr, 1)
> >>> > gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
> >>> > sift = cv2.xfeatures2d.SIFT_create()
> >>> > kp, descriptors = sift.detectAndCompute(gray, None)
> >>> > return (imgfilename, "test")
> >>> >
> >>> > And corresponding tests.py:
> >>> > https://gist.github.com/samos123/d383c26f6d47d34d32d6
> >>> >
> >>> >
> >>> > On Sat, May 30, 2015 at 8:04 PM, Sam Stoelinga <
> sammiest...@gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> Thanks for the advice! The following line causes spark to crash:
> >>> >>
> >>> >> kp, descriptors = sift.detectAndCompute(gray, None)
> >>> >>
> >>> >> But I do need this line to be executed and the code does not crash
> >>> >> when
> >>> >> running outside of Spark but passing the same parameters. You're
> >>> >> saying
> >>> >> maybe the bytes from the sequencefile got somehow transformed and
> >>> >> don't
> >>> >> represent an image anymore causing OpenCV to crash the whole python
> >>> >> executor.
> >>> >>
> >>> >> On Fri, May 29, 2015 at 2:06 AM, Davies Liu 
> >>> >> wrote:
> >>> >>>
> >>> >>> Could you try to comment out some lines in
> >>> >>> `extract_sift_features_opencv` to find which line cause the crash?
> >>> >>>
> >>> >>> If the bytes came from sequenceFile() is broken, it's easy to
> crash a
> >>> >>> C library in Python (OpenCV).
> >>> >>>
> >>> >>> On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga
> >>> >>> 
> >>> >>> wrote:
> >>> >>> > Hi sparkers,
> >>> >>> >
> >>> >>> > I am working on a PySpark application which uses the OpenCV
> >>> >>> > library. It
> >>> >>> > runs
> >>> >>> > fine when running the code locally but when I try to run it on
> >>> >>> > Spark on
> >>> >>> > the
> >>> >>> > same Machine it crashes the worker.
> >>> >>> >
> >>> >>> > The code can be found here:
> >>> >>> > https://gist.github.com/samos123/885f9fe87c8fa5abf78f
> >>> >>> >
> >>> >>> > This is the error message taken from STDERR of the worker log:
> >>> >>> > https://gist.github.com/samos123/3300191684aee7fc8013
> >>> >>> >
> >>> >>> > Would like pointers or tips on how to debug further? Would be
> nice
> >>> >>> > to
> >>> >>> > know
> >>> >>> > the reason why the worker crashed.
> >>> >>> >
> >>> >>> > Thanks,
> >>> >>> > Sam Stoelinga
> >>> >>> >
> >>> >>> >
> >>> >>> > org.apache.spark.SparkException: Python worker exited
> unexpectedly
> >>> >>> > (crashed)
> >>> >>> > at
> >>> >>> >
> >>> >>> >
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:172)
> >>> >>> > at
> >>> >>> >
> >>> >>> >
> >>> >>> >
> org.apache.spark.

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Charles Earl
Would the IndexedRDD feature provide what the Lookup RDD does?
I'Ve been using a broadcast variable map for a similar kind of thing -- It
probably is within 1GB but interested to know if the lookup (or indexed)
might be better.
C

On Friday, June 5, 2015, Dmitry Goldenberg  wrote:

> Thanks everyone. Evo, could you provide a link to the Lookup RDD project?
> I can't seem to locate it exactly on Github. (Yes, to your point, our
> project is Spark streaming based). Thank you.
>
> On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov  > wrote:
>
>> Oops, @Yiannis, sorry to be a party pooper but the Job Server is for
>> Spark Batch Jobs (besides anyone can put something like that in 5 min),
>> while I am under the impression that Dmytiy is working on Spark Streaming
>> app
>>
>>
>>
>> Besides the Job Server is essentially for sharing the Spark Context
>> between multiple threads
>>
>>
>>
>> Re Dmytiis intial question – you can load large data sets as Batch
>> (Static) RDD from any Spark Streaming App and then join DStream RDDs
>> against them to emulate “lookups” , you can also try the “Lookup RDD” –
>> there is a git hub project
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com
>> ]
>> *Sent:* Friday, June 5, 2015 12:12 AM
>> *To:* Yiannis Gkoufas
>> *Cc:* Olivier Girardot; user@spark.apache.org
>> 
>> *Subject:* Re: How to share large resources like dictionaries while
>> processing data with Spark ?
>>
>>
>>
>> Thanks so much, Yiannis, Olivier, Huang!
>>
>>
>>
>> On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas > > wrote:
>>
>> Hi there,
>>
>>
>>
>> I would recommend checking out
>> https://github.com/spark-jobserver/spark-jobserver which I think gives
>> the functionality you are looking for.
>>
>> I haven't tested it though.
>>
>>
>>
>> BR
>>
>>
>>
>> On 5 June 2015 at 01:35, Olivier Girardot > > wrote:
>>
>> You can use it as a broadcast variable, but if it's "too" large (more
>> than 1Gb I guess), you may need to share it joining this using some kind of
>> key to the other RDDs.
>>
>> But this is the kind of thing broadcast variables were designed for.
>>
>>
>>
>> Regards,
>>
>>
>>
>> Olivier.
>>
>>
>>
>> Le jeu. 4 juin 2015 à 23:50, dgoldenberg > > a écrit :
>>
>> We have some pipelines defined where sometimes we need to load potentially
>> large resources such as dictionaries.
>>
>> What would be the best strategy for sharing such resources among the
>> transformations/actions within a consumer?  Can they be shared somehow
>> across the RDD's?
>>
>> I'm looking for a way to load such a resource once into the cluster memory
>> and have it be available throughout the lifecycle of a consumer...
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>
>>
>>
>>
>>
>
>

-- 
- Charles


Re: Override Logging with spark-streaming

2015-06-05 Thread Alexander Krasheninnikov
Have you tried putting this file on local disk on each of executor 
nodes? That worked for me.

On 05.06.2015 16:56, nib...@free.fr wrote:

Hello,
I want to override the log4j configuration when I start my spark job.
I tried :
.../bin/spark-submit --class  --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/.../log4j.properties"
 x.jar
or
.../bin/spark-submit --class  --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=/.../log4j.properties" 
x.jar

But it doesn't work , I still have the default configuration.

Any ideas ?
Tks
Nicolas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Cassandra Submit

2015-06-05 Thread Yasemin Kaya
Hi,

I am using cassandraDB in my project. I had that error *Exception in thread
"main" java.io.IOException: Failed to open native connection to Cassandra
at {127.0.1.1}:9042*

I think I have to modify the submit line. What should I add or remove when
I submit my project?

Best,
yasemin


-- 
hiç ender hiç


RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
And RDD.lookup() can not be invoked from Transformations e.g. maps

 

Lookup() is an action which can be invoked only from the driver – if you want 
functionality like that from within Transformations executed on the cluster 
nodes try Indexed RDD

 

Other options are load a Batch / Static RDD once in your Spark Streaming App 
and then keep joining and then e.g. filtering every incoming DStream RDD with 
the (big static) Batch RDD

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Friday, June 5, 2015 3:27 PM
To: 'Dmitry Goldenberg'
Cc: 'Yiannis Gkoufas'; 'Olivier Girardot'; 'user@spark.apache.org'
Subject: RE: How to share large resources like dictionaries while processing 
data with Spark ?

 

It is called Indexed RDD https://github.com/amplab/spark-indexedrdd 

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 3:15 PM
To: Evo Eftimov
Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I 
can't seem to locate it exactly on Github. (Yes, to your point, our project is 
Spark streaming based). Thank you.

 

On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov  wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas  wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot  wrote:

You can use it as a broadcast variable, but if it's "too" large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg  a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 

 



RE: redshift spark

2015-06-05 Thread Ewan Leith
That project is for reading data in from Redshift table exports stored in s3 by 
running commands in redshift like this:

unload ('select * from venue')   
to 's3://mybucket/tickit/unload/'

http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

The path in the parameters below is the s3 bucket path.

Hope this helps,
Ewan

-Original Message-
From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] 
Sent: 05 June 2015 15:25
To: user@spark.apache.org
Subject: redshift spark

Hi All,

I want to read and write data to aws redshift. I found spark-redshift project 
at following address.
https://github.com/databricks/spark-redshift

in its documentation there is following code is written. 
import com.databricks.spark.redshift.RedshiftInputFormat

val records = sc.newAPIHadoopFile(
  path,
  classOf[RedshiftInputFormat],
  classOf[java.lang.Long],
  classOf[Array[String]])

I am unable to understand it's parameters. Can somebody explain how to use 
this? what is meant by path in this case?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/redshift-spark-tp23175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
It is called Indexed RDD https://github.com/amplab/spark-indexedrdd 

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 3:15 PM
To: Evo Eftimov
Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I 
can't seem to locate it exactly on Github. (Yes, to your point, our project is 
Spark streaming based). Thank you.

 

On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov  wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas  wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot  wrote:

You can use it as a broadcast variable, but if it's "too" large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg  a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 

 



redshift spark

2015-06-05 Thread Hafiz Mujadid
Hi All,

I want to read and write data to aws redshift. I found spark-redshift
project at following address.
https://github.com/databricks/spark-redshift

in its documentation there is following code is written. 
import com.databricks.spark.redshift.RedshiftInputFormat

val records = sc.newAPIHadoopFile(
  path,
  classOf[RedshiftInputFormat],
  classOf[java.lang.Long],
  classOf[Array[String]])

I am unable to understand it's parameters. Can somebody explain how to use
this? what is meant by path in this case?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/redshift-spark-tp23175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg
Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I
can't seem to locate it exactly on Github. (Yes, to your point, our project
is Spark streaming based). Thank you.

On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov  wrote:

> Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
> Batch Jobs (besides anyone can put something like that in 5 min), while I
> am under the impression that Dmytiy is working on Spark Streaming app
>
>
>
> Besides the Job Server is essentially for sharing the Spark Context
> between multiple threads
>
>
>
> Re Dmytiis intial question – you can load large data sets as Batch
> (Static) RDD from any Spark Streaming App and then join DStream RDDs
> against them to emulate “lookups” , you can also try the “Lookup RDD” –
> there is a git hub project
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
> *Sent:* Friday, June 5, 2015 12:12 AM
> *To:* Yiannis Gkoufas
> *Cc:* Olivier Girardot; user@spark.apache.org
> *Subject:* Re: How to share large resources like dictionaries while
> processing data with Spark ?
>
>
>
> Thanks so much, Yiannis, Olivier, Huang!
>
>
>
> On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas 
> wrote:
>
> Hi there,
>
>
>
> I would recommend checking out
> https://github.com/spark-jobserver/spark-jobserver which I think gives
> the functionality you are looking for.
>
> I haven't tested it though.
>
>
>
> BR
>
>
>
> On 5 June 2015 at 01:35, Olivier Girardot  wrote:
>
> You can use it as a broadcast variable, but if it's "too" large (more than
> 1Gb I guess), you may need to share it joining this using some kind of key
> to the other RDDs.
>
> But this is the kind of thing broadcast variables were designed for.
>
>
>
> Regards,
>
>
>
> Olivier.
>
>
>
> Le jeu. 4 juin 2015 à 23:50, dgoldenberg  a
> écrit :
>
> We have some pipelines defined where sometimes we need to load potentially
> large resources such as dictionaries.
>
> What would be the best strategy for sharing such resources among the
> transformations/actions within a consumer?  Can they be shared somehow
> across the RDD's?
>
> I'm looking for a way to load such a resource once into the cluster memory
> and have it be available throughout the lifecycle of a consumer...
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


Re: Setting S3 output file grantees for spark output files

2015-06-05 Thread Justin Steigel
I figured it out.  I had to add this line to the script:

sc._jsc.hadoopConfiguration().set("fs.s3.canned.acl",
"BucketOwnerFullControl")

Bascially, I had to get the JavaSparkContext in the SparkContext to access
the Hadoop configuration to set the permissions.

Follow up question: Is there a better way to get the JavaSparkContext or
the Hadoop Configuration from the pyspark SparkContext?  Accessing a
protected variable directly doesn't seem right.

Thanks,
Justin

On Fri, Jun 5, 2015 at 3:02 AM, Akhil Das 
wrote:

> You could try adding the configuration in the spark-defaults.conf file.
> And once you run the application you can actually check on the driver UI
> (runs on 4040) Environment tab to see if the configuration is set properly.
>
> Thanks
> Best Regards
>
> On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel 
> wrote:
>
>> Hi all,
>>
>> I'm running Spark on AWS EMR and I'm having some issues getting the
>> correct permissions on the output files using
>> rdd.saveAsTextFile('').  In hive, I would add a line in the
>> beginning of the script with
>>
>> set fs.s3.canned.acl=BucketOwnerFullControl
>>
>> and that would set the correct grantees for the files. For Spark, I tried
>> adding the permissions as a --conf option:
>>
>> hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
>> /home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master
>> yarn-cluster \
>> --conf "spark.driver.extraJavaOptions
>> -Dfs.s3.canned.acl=BucketOwnerFullControl" \
>> hdfs:///user/hadoop/spark.py
>>
>> But the permissions do not get set properly on the output files. What is
>> the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or
>> any of the S3 canned permissions to the spark job?
>>
>> Thanks in advance
>>
>
>


Override Logging with spark-streaming

2015-06-05 Thread nibiau
Hello,
I want to override the log4j configuration when I start my spark job.
I tried :
.../bin/spark-submit --class  --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/.../log4j.properties"
 x.jar 
or
.../bin/spark-submit --class  --conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=/.../log4j.properties" 
x.jar 

But it doesn't work , I still have the default configuration.

Any ideas ?
Tks
Nicolas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Accumulator map

2015-06-05 Thread Cosmin Cătălin Sanda
Hi,

I am trying to gather some statistics from an RDD using accumulators.
Essentially, I am counting how many times specific segments appear in each
row of the RDD. This works fine and I get the expected results, but the
problem is that each time I add a new segment to look for, I have to
explicitly create an Accumulator for it and explicitly use the Accumulator
in the foreach method.

Is there a way to use a dynamic list of Accumulators in Spark? I want to
control the segments from a single place in the code and the accumulators
to be dynamically created and used based on the metrics list.

BR,

*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35


Re: union and reduceByKey wrong shuffle?

2015-06-05 Thread Igor Berman
this jira seems to be connected to our issue
https://issues.apache.org/jira/browse/SPARK-1018

On 2 June 2015 at 19:54, Josh Rosen  wrote:

> Ah, interesting.  While working on my new Tungsten shuffle manager, I came
> up with some nice testing interfaces for allowing me to manually trigger
> spills in order to deterministically test those code paths without
> requiring large amounts of data to be shuffled.  Maybe I could make similar
> test interface changes to the existing shuffle code, which might make it
> easier to reproduce this in an isolated environment.
>
> On Mon, Jun 1, 2015 at 11:41 PM, Igor Berman 
> wrote:
>
>> Hi,
>> small mock data doesn't reproduce the problem. IMHO problem is reproduced
>> when we make shuffle big enough to split data into disk.
>> We will work on it to understand and reproduce the problem(not first
>> priority though...)
>>
>>
>> On 1 June 2015 at 23:02, Josh Rosen  wrote:
>>
>>> How much work is to produce a small standalone reproduction?  Can you
>>> create an Avro file with some mock data, maybe 10 or so records, then
>>> reproduce this locally?
>>>
>>> On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman 
>>> wrote:
>>>
 switching to use simple pojos instead of using avro for spark
 serialization solved the problem(I mean reading avro from s3 and than
 mapping each avro object to it's pojo serializable counterpart with same
 fields, pojo is registered withing kryo)
 Any thought where to look for a problem/misconfiguration?

 On 31 May 2015 at 22:48, Igor Berman  wrote:

> Hi
> We are using spark 1.3.1
> Avro-chill (tomorrow will check if its important) we register avro
> classes from java
> Avro 1.7.6
> On May 31, 2015 22:37, "Josh Rosen"  wrote:
>
>> Which Spark version are you using?  I'd like to understand whether
>> this change could be caused by recent Kryo serializer re-use changes in
>> master / Spark 1.4.
>>
>> On Sun, May 31, 2015 at 11:31 AM, igor.berman 
>> wrote:
>>
>>> after investigation the problem is somehow connected to avro
>>> serialization
>>> with kryo + chill-avro(mapping avro object to simple scala case
>>> class and
>>> running reduce on these case class objects solves the problem)
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

>>>
>>
>


Slow file listing when loading records from in S3 without filename or wildcard

2015-06-05 Thread Ewan Leith
Hi all,

I'm not sure if this is a Spark issue, or an AWS/Hadoop/S3 driver issue, but 
I've noticed that I get a very slow response when I run:

val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/").count()

(which will count all the files in the directory)

But an almost immediate response if I run this command with a wildcard added to 
the end:

val files = sc.wholeTextFiles("s3://emr-test-dgp/testfiles/*").count()

The time difference is in the order of 1 minute extra per 1000 files being 
listed from S3. The count returns the same value for each query.

This is on 1000s of files, with no sub-directories to confuse things. Has 
anyone seen anything similar?

Thanks,
Ewan
15/06/05 10:31:58 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is 
ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 
3(ms)
15/06/05 10:31:58 INFO cluster.YarnClusterScheduler: 
YarnClusterScheduler.postStartHook done
15/06/05 10:32:00 INFO metrics.MetricsSaver: Saved 3:61 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:32:02 INFO fs.EmrFileSystem: Consistency enabled, using 
com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2 as filesystem 
implementation
15/06/05 10:32:03 INFO storage.MemoryStore: ensureFreeSpace(257941) called with 
curMem=0, maxMem=280248975
15/06/05 10:32:03 INFO storage.MemoryStore: Block broadcast_0 stored as values 
in memory (estimated size 251.9 KB, free 267.0 MB)
15/06/05 10:32:03 INFO storage.MemoryStore: ensureFreeSpace(19668) called with 
curMem=257941, maxMem=280248975
15/06/05 10:32:03 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as 
bytes in memory (estimated size 19.2 KB, free 267.0 MB)
15/06/05 10:32:03 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on ip-10-111-0-34.eu-west-1.compute.internal:50494 (size: 19.2 KB, free: 
267.2 MB)
15/06/05 10:32:03 INFO storage.BlockManagerMaster: Updated info of block 
broadcast_0_piece0
15/06/05 10:32:03 INFO spark.SparkContext: Created broadcast 0 from main at 
NativeMethodAccessorImpl.java:-2
15/06/05 10:32:10 INFO metrics.MetricsSaver: 1 aggregated 
AmazonDynamoDBv2GetItemDelay 59 raw values into 6 aggregated values, total 6
15/06/05 10:32:30 INFO metrics.MetricsSaver: Saved 17:286 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:00 INFO metrics.MetricsSaver: Saved 11:169 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:30 INFO metrics.MetricsSaver: Saved 11:175 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:33:44 INFO metrics.MetricsSaver: 101 aggregated DdbReadUnitGetItem 
53 raw values into 2 aggregated values, total 14
15/06/05 10:34:00 INFO metrics.MetricsSaver: Saved 11:172 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:34:30 INFO metrics.MetricsSaver: Saved 11:178 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:00 INFO metrics.MetricsSaver: Saved 11:273 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:13 INFO metrics.MetricsSaver: 201 aggregated DdbReadUnitGetItem 
50 raw values into 2 aggregated values, total 13
15/06/05 10:35:30 INFO metrics.MetricsSaver: Saved 11:172 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:35:35 INFO input.FileInputFormat: Total input paths to process : 
5001
15/06/05 10:36:00 INFO metrics.MetricsSaver: Saved 14:263 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:36:30 INFO metrics.MetricsSaver: Saved 11:322 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:36:33 INFO metrics.MetricsSaver: 301 aggregated 
AmazonS3GetObjectMetadataDelay 80 raw values into 3 aggregated values, total 3
15/06/05 10:37:00 INFO metrics.MetricsSaver: Saved 11:250 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:37:30 INFO metrics.MetricsSaver: Saved 11:265 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:37:46 INFO metrics.MetricsSaver: 401 aggregated 
AmazonDynamoDBv2GetItemDelay 75 raw values into 2 aggregated values, total 16
15/06/05 10:38:00 INFO metrics.MetricsSaver: Saved 11:214 records to 
/mnt/var/em/raw/i-eff91516_20150605_ApplicationMaster_04401_raw.bin
15/06/05 10:38:18 INFO input.FileInputFormat: Total input paths to process : 
5001
15/06/05 10:38:19 INFO input.CombineFileInputFormat: DEBUG: Terminated node 
allocation with : CompletedNodes: 1, size left: 0
15/06/05 10:38:21 INFO spark.SparkContext: Starting job: isEmpty at 
JsonRDD.scala:51
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Got job 0 (isEmpty at 
JsonRDD.scala:51) with 1 output partitions (allowLocal=true)
15/06/05 10:38:21 INFO scheduler.DAGScheduler: Final sta

Re: Saving calculation to single local file

2015-06-05 Thread marcos rebelo
Hi all

I have to say that this solution surprise me. For the fist time I have such
a requirement but I would expect another more elegant solution.

I'm sure that many persons are doing the some as I'm and they would love to
have a better solution to such a problem.

Best Regards
Marcos

On Fri, Jun 5, 2015 at 1:16 PM, ayan guha  wrote:

> Another option is merge partfiles after your app ends.
> On 5 Jun 2015 20:37, "Akhil Das"  wrote:
>
>> you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be
>> efficient if your output data is huge since one task will be doing the
>> whole writing.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo  wrote:
>>
>>> Hi all
>>>
>>> I'm running spark in a single local machine, no hadoop, just reading and
>>> writing in local disk.
>>>
>>> I need to have a single file as output of my calculation.
>>>
>>> if I do "rdd.saveAsTextFile(...)" all runs ok but I get allot of files.
>>> Since I need a single file I was considering to do something like:
>>>
>>>   Try {new FileWriter(outputPath)} match {
>>> case Success(writer) =>
>>>   try {
>>> rdd.toLocalIterator.foreach({line =>
>>>   val str = line.toString
>>>   writer.write(str)
>>> }
>>>   }
>>> }
>>> ...
>>>   }
>>>
>>>
>>> I get:
>>>
>>> [error] o.a.s.e.Executor - Exception in task 0.0 in stage 41.0 (TID 32)
>>> java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
>>> at
>>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>>> ~[na:1.8.0_45]
>>> at
>>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>> ~[na:1.8.0_45]
>>> at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>>> ~[na:1.8.0_45]
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>>> ~[na:1.8.0_45]
>>> [error] o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception in
>>> thread Thread[Executor task launch worker-1,5,main]
>>> java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
>>> at
>>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>>> ~[na:1.8.0_45]
>>> at
>>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>> ~[na:1.8.0_45]
>>> at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>>> ~[na:1.8.0_45]
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>>> ~[na:1.8.0_45]
>>> [error] o.a.s.s.TaskSetManager - Task 0 in stage 41.0 failed 1 times;
>>> aborting job
>>> [warn] application - Can't write to /tmp/err1433498283479.csv: {}
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 0 in stage 41.0 failed 1 times, most recent failure: Lost task 0.0 in stage
>>> 41.0 (TID 32, localhost): java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOf(Arrays.java:3236)
>>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>>> at
>>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>> at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>>> at
>>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>>> at
>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>>> at
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Driver stacktrace:
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
>>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
>>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> ~[scala-library-2.10.5.jar:na]
>>> at
>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.sc

Nested reduceByKeyAndWindow losing data

2015-06-05 Thread Alexander Krasheninnikov

Hello, everyone!
I've experienced problem, wnen using nested reduceByKeyAndWindow.
My task is to parse json-formatted events from textFileStream, and 
create aggregations for each field.

E.g. having such input:

{"type":"EventOne", "attr1":10,"attr2":20}

I have projections:

{"type":"EventOne", count:1}

{"type":"EventOne", "attr1":10, count:1}

{"type":"EventOne", "attr1":20, count:1}

{"type":"EventOne", "attr1":10, "attr2":20, count:1}

Each Durations.seconds (150) I pre-aggregate this data (to save and use 
for larger window). But larger windows receive no data at all - this 
looks like a bug.


Here is the code:

Duration preAggDuration = Durations.seconds(150);
Duration windowComputationPeriod = Durations.seconds(300);
JavaStreamingContext streamingContext 
=newJavaStreamingContext(sparkConf,Durations.seconds(10));

JavaDStream lines = streamingContext.textFileStream("hdfs://my_dir"); 
// read lines
JavaDStream eventStream = lines.repartition(350).flatMap(new 
MyEventParser()); // parse lines into models
JavaPairDStream projectionStream = 
eventStream.flatMapToPair(new MyProjectionFunction()); // each model is splitted into 
projections
JavaPairDStream preAggStream = 
projectionStream.reduceByKeyAndWindow(new MySumFunction(), preAggDuration, 
preAggDuration); // pre-aggregated data
preAggStream.foreachRDD(SAVE_AS_OBJECT_FILE);



// computations in large windows
for(int windowSize : new int[]{60,1440})
{
JavaPairDStream windowStream = 
preAggStream.reduceByKeyAndWindow(new MySumFunction(), Durations.minutes(windowSize), 
windowComputationPeriod);
windowStream.count().print(); // here I have no data :(
JavaPairDStream windowMergedStream = 
windowStream.transformToPair(/* here goes merge of this window with historical data 
*/);
windowMergedStream.count.print(); // here I have zero :(


}







RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
Spark uses Tachyon internally ie all SERIALIZED IN-MEMORY RDDs are kept there – 
so if you have a BATCH RDD which is SERIALIZED IN_MEMORY then you are using 
Tachyon implicitly – the only difference is that if you are using Tachyon 
explicitly ie as a distributed, in-memory file system you can share data 
between Jobs, while an RDD is ALWAYS visible within Jobs using the same Spark 
Context 

 

From: Charles Earl [mailto:charles.ce...@gmail.com] 
Sent: Friday, June 5, 2015 12:10 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Would tachyon be appropriate here?

On Friday, June 5, 2015, Evo Eftimov  wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com 
 ] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org 
 
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas  > wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot  > wrote:

You can use it as a broadcast variable, but if it's "too" large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg  > a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 
For additional commands, e-mail: user-h...@spark.apache.org 
 

 

 



-- 
- Charles



Re: Saving calculation to single local file

2015-06-05 Thread ayan guha
Another option is merge partfiles after your app ends.
On 5 Jun 2015 20:37, "Akhil Das"  wrote:

> you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be
> efficient if your output data is huge since one task will be doing the
> whole writing.
>
> Thanks
> Best Regards
>
> On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo  wrote:
>
>> Hi all
>>
>> I'm running spark in a single local machine, no hadoop, just reading and
>> writing in local disk.
>>
>> I need to have a single file as output of my calculation.
>>
>> if I do "rdd.saveAsTextFile(...)" all runs ok but I get allot of files.
>> Since I need a single file I was considering to do something like:
>>
>>   Try {new FileWriter(outputPath)} match {
>> case Success(writer) =>
>>   try {
>> rdd.toLocalIterator.foreach({line =>
>>   val str = line.toString
>>   writer.write(str)
>> }
>>   }
>> }
>> ...
>>   }
>>
>>
>> I get:
>>
>> [error] o.a.s.e.Executor - Exception in task 0.0 in stage 41.0 (TID 32)
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>> ~[na:1.8.0_45]
>> at
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> ~[na:1.8.0_45]
>> at
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> ~[na:1.8.0_45]
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>> ~[na:1.8.0_45]
>> [error] o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception in
>> thread Thread[Executor task launch worker-1,5,main]
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>> ~[na:1.8.0_45]
>> at
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> ~[na:1.8.0_45]
>> at
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> ~[na:1.8.0_45]
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>> ~[na:1.8.0_45]
>> [error] o.a.s.s.TaskSetManager - Task 0 in stage 41.0 failed 1 times;
>> aborting job
>> [warn] application - Can't write to /tmp/err1433498283479.csv: {}
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in stage 41.0 failed 1 times, most recent failure: Lost task 0.0 in stage
>> 41.0 (TID 32, localhost): java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOf(Arrays.java:3236)
>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>> at
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>> at
>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at 
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>> ~[spark-core_2.10-1.3.1.jar:1.3.1]
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> ~[scala-library-2.10.5.jar:na]
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> ~[scala-library-2.10.5.jar:na]
>>
>>
>> if this rdd.toLocalIterator.foreach(...) doesn't work, what is the better
>> solution?
>>
>> Best Regards
>> Marcos
>>
>>
>>
>


Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Charles Earl
Would tachyon be appropriate here?

On Friday, June 5, 2015, Evo Eftimov  wrote:

> Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
> Batch Jobs (besides anyone can put something like that in 5 min), while I
> am under the impression that Dmytiy is working on Spark Streaming app
>
>
>
> Besides the Job Server is essentially for sharing the Spark Context
> between multiple threads
>
>
>
> Re Dmytiis intial question – you can load large data sets as Batch
> (Static) RDD from any Spark Streaming App and then join DStream RDDs
> against them to emulate “lookups” , you can also try the “Lookup RDD” –
> there is a git hub project
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com
> ]
> *Sent:* Friday, June 5, 2015 12:12 AM
> *To:* Yiannis Gkoufas
> *Cc:* Olivier Girardot; user@spark.apache.org
> 
> *Subject:* Re: How to share large resources like dictionaries while
> processing data with Spark ?
>
>
>
> Thanks so much, Yiannis, Olivier, Huang!
>
>
>
> On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas  > wrote:
>
> Hi there,
>
>
>
> I would recommend checking out
> https://github.com/spark-jobserver/spark-jobserver which I think gives
> the functionality you are looking for.
>
> I haven't tested it though.
>
>
>
> BR
>
>
>
> On 5 June 2015 at 01:35, Olivier Girardot  > wrote:
>
> You can use it as a broadcast variable, but if it's "too" large (more than
> 1Gb I guess), you may need to share it joining this using some kind of key
> to the other RDDs.
>
> But this is the kind of thing broadcast variables were designed for.
>
>
>
> Regards,
>
>
>
> Olivier.
>
>
>
> Le jeu. 4 juin 2015 à 23:50, dgoldenberg  > a écrit :
>
> We have some pipelines defined where sometimes we need to load potentially
> large resources such as dictionaries.
>
> What would be the best strategy for sharing such resources among the
> transformations/actions within a consumer?  Can they be shared somehow
> across the RDD's?
>
> I'm looking for a way to load such a resource once into the cluster memory
> and have it be available throughout the lifecycle of a consumer...
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> 
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>
>
>
>
>


-- 
- Charles


Re: Access several s3 buckets, with credentials containing "/"

2015-06-05 Thread Steve Loughran

> On 5 Jun 2015, at 08:03, Pierre B  
> wrote:
> 
> Hi list!
> 
> My problem is quite simple.
> I need to access several S3 buckets, using different credentials.:
> ```
> val c1 =
> sc.textFile("s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket1/file.csv").count
> val c2 =
> sc.textFile("s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket2/file.csv").count
> val c3 =
> sc.textFile("s3n://[ACCESS_KEY_ID3:SECRET_ACCESS_KEY3]@bucket3/file.csv").count
> ...
> ```
> 
> One/several of those AWS credentials might contain "/" in the private access
> key.
> This is a known problem and from my research, the only ways to deal with
> these "/" are:
> 1/ use environment variables to set the AWS credentials, then access the s3
> buckets without specifying the credentials
> 2/ set the hadoop configuration to contain the the credentials.
> 
> However, none of these solutions allow me to access different buckets, with
> different credentials.
> 
> Can anyone help me on this?
> 
> Thanks
> 
> Pierre

long known outstanding bug in Hadoop s3n, nobody has ever sat down to fix. One 
subtlety is its really hard to test -as you need credentials with a / in. 

The general best practise is recreate your credentials

Now, if you can get the patch to work against hadoop trunk, I promise I will 
commit it
https://issues.apache.org/jira/browse/HADOOP-3733

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Job always cause a node to reboot

2015-06-05 Thread Steve Loughran

> On 4 Jun 2015, at 15:59, Chao Chen  wrote:
> 
> But when I try to run the Pagerank from HiBench, it always cause a node to 
> reboot during the middle of the work for all scala, java, and python 
> versions. But works fine
> with the MapReduce version from the same benchmark. 

do you mean a real server reboot? Without warning?

That's a serious problem. If it was just one server I'd look at hardware 
problems, especially memory, whether you have mixed CPUs in a dual-socket 
server, or even potentially HDD issue.

if its all servers then its an OS or filesystem problem.

As well as the vm.swappiness, turn off huge pages in the kernel
http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.1.0/bk_using_Ambari_book/content/ambari-chap1-5-8.html
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-0/CDH4-Release-Notes/cdh4ki_topic_1_3.html

See also some Hadoop/HDFS notes on filesystems, 5 years old
http://wiki.apache.org/hadoop/DiskSetup

Everyone generally still recommends ext3 & maybe ext4 with noatime

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread Steve Loughran

On 2 Jun 2015, at 00:14, Dean Wampler 
mailto:deanwamp...@gmail.com>> wrote:

It would be nice to see the code for MapR FS Java API, but my google foo failed 
me (assuming it's open source)...


I know that MapRFS is closed source, don't know about the java JAR. Why not ask 
Ted Dunning (cc'd)  nicely to see if he can track down the stack trace for you.

So, shooting in the dark ;) there are a few things I would check, if you 
haven't already:

1. Could there be 1.2 versions of some Spark jars that get picked up at run 
time (but apparently not in local mode) on one or more nodes? (Side question: 
Does your node experiment fail on all nodes?) Put another way, are the 
classpaths good for all JVM tasks?
2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?

Incidentally, how are you combining Mesos and MapR? Are you running Spark in 
Mesos, but accessing data in MapR-FS?

Perhaps the MapR "shim" library doesn't support Spark 1.3.1.

HTH,

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd 
Edition (O'Reilly)
Typesafe
@deanwampler
http://polyglotprogramming.com

On Mon, Jun 1, 2015 at 2:49 PM, John Omernik 
mailto:j...@omernik.com>> wrote:
All -

I am facing and odd issue and I am not really sure where to go for support at 
this point.  I am running MapR which complicates things as it relates to Mesos, 
however this HAS worked in the past with no issues so I am stumped here.

So for starters, here is what I am trying to run. This is a simple show tables 
using the Hive Context:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row, HiveContext
sparkhc = HiveContext(sc)
test = sparkhc.sql("show tables")
for r in test.collect():
  print r

When I run it on 1.3.1 using ./bin/pyspark --master local  This works with no 
issues.

When I run it using Mesos with all the settings configured (as they had worked 
in the past) I get lost tasks and when I zoom in them, the error that is being 
reported is below.  Basically it's a NullPointerException on the 
com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance and 
compared both together, the class path, everything is exactly the same. Yet 
running in local mode works, and running in mesos fails.  Also of note, when 
the task is scheduled to run on the same node as when I run locally, that fails 
too! (Baffling).

Ok, for comparison, how I configured Mesos was to download the mapr4 package 
from spark.apache.org.  Using the exact same 
configuration file (except for changing the executor tgz from 1.2.0 to 1.3.1) 
from the 1.2.0.  When I run this example with the mapr4 for 1.2.0 there is no 
issue in Mesos, everything runs as intended. Using the same package for 1.3.1 
then it fails.

(Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as well).

So basically When I used 1.2.0 and followed a set of steps, it worked on Mesos 
and 1.3.1 fails.  Since this is a "current" version of Spark, MapR is supports 
1.2.1 only.  (Still working on that).

I guess I am at a loss right now on why this would be happening, any pointers 
on where I could look or what I could tweak would be greatly appreciated. 
Additionally, if there is something I could specifically draw to the attention 
of MapR on this problem please let me know, I am perplexed on the change from 
1.2.0 to 1.3.1.

Thank you,

John




Full Error on 1.3.1 on Mesos:
15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity 1060.3 MB 
java.lang.NullPointerException at 
com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at 
com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at 
com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at 
org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60) at 
java.lang.Class.forName0(Native Method) at 
java.lang.Class.forName(Class.java:274) at 
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
 at org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062) 
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272) 
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224) 
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141) at 
org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at 
org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at 
org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
 at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at 
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:220) at 
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) at 
org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959) at 
org.apache.spark.storage.BlockManager.(BlockManager.s

Re: Saving calculation to single local file

2015-06-05 Thread Akhil Das
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be
efficient if your output data is huge since one task will be doing the
whole writing.

Thanks
Best Regards

On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo  wrote:

> Hi all
>
> I'm running spark in a single local machine, no hadoop, just reading and
> writing in local disk.
>
> I need to have a single file as output of my calculation.
>
> if I do "rdd.saveAsTextFile(...)" all runs ok but I get allot of files.
> Since I need a single file I was considering to do something like:
>
>   Try {new FileWriter(outputPath)} match {
> case Success(writer) =>
>   try {
> rdd.toLocalIterator.foreach({line =>
>   val str = line.toString
>   writer.write(str)
> }
>   }
> }
> ...
>   }
>
>
> I get:
>
> [error] o.a.s.e.Executor - Exception in task 0.0 in stage 41.0 (TID 32)
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> ~[na:1.8.0_45]
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> ~[na:1.8.0_45]
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> ~[na:1.8.0_45]
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> ~[na:1.8.0_45]
> [error] o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception in
> thread Thread[Executor task launch worker-1,5,main]
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> ~[na:1.8.0_45]
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> ~[na:1.8.0_45]
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> ~[na:1.8.0_45]
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> ~[na:1.8.0_45]
> [error] o.a.s.s.TaskSetManager - Task 0 in stage 41.0 failed 1 times;
> aborting job
> [warn] application - Can't write to /tmp/err1433498283479.csv: {}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 41.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 41.0 (TID 32, localhost): java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
> ~[spark-core_2.10-1.3.1.jar:1.3.1]
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
> ~[spark-core_2.10-1.3.1.jar:1.3.1]
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> ~[spark-core_2.10-1.3.1.jar:1.3.1]
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> ~[scala-library-2.10.5.jar:na]
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> ~[scala-library-2.10.5.jar:na]
>
>
> if this rdd.toLocalIterator.foreach(...) doesn't work, what is the better
> solution?
>
> Best Regards
> Marcos
>
>
>


Re: Saving calculation to single local file

2015-06-05 Thread ayan guha
Just repartition to 1 partition before writing.

On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo  wrote:

> Hi all
>
> I'm running spark in a single local machine, no hadoop, just reading and
> writing in local disk.
>
> I need to have a single file as output of my calculation.
>
> if I do "rdd.saveAsTextFile(...)" all runs ok but I get allot of files.
> Since I need a single file I was considering to do something like:
>
>   Try {new FileWriter(outputPath)} match {
> case Success(writer) =>
>   try {
> rdd.toLocalIterator.foreach({line =>
>   val str = line.toString
>   writer.write(str)
> }
>   }
> }
> ...
>   }
>
>
> I get:
>
> [error] o.a.s.e.Executor - Exception in task 0.0 in stage 41.0 (TID 32)
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> ~[na:1.8.0_45]
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> ~[na:1.8.0_45]
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> ~[na:1.8.0_45]
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> ~[na:1.8.0_45]
> [error] o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception in
> thread Thread[Executor task launch worker-1,5,main]
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> ~[na:1.8.0_45]
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> ~[na:1.8.0_45]
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> ~[na:1.8.0_45]
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> ~[na:1.8.0_45]
> [error] o.a.s.s.TaskSetManager - Task 0 in stage 41.0 failed 1 times;
> aborting job
> [warn] application - Can't write to /tmp/err1433498283479.csv: {}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 41.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 41.0 (TID 32, localhost): java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
> ~[spark-core_2.10-1.3.1.jar:1.3.1]
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
> ~[spark-core_2.10-1.3.1.jar:1.3.1]
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> ~[spark-core_2.10-1.3.1.jar:1.3.1]
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> ~[scala-library-2.10.5.jar:na]
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> ~[scala-library-2.10.5.jar:na]
>
>
> if this rdd.toLocalIterator.foreach(...) doesn't work, what is the better
> solution?
>
> Best Regards
> Marcos
>
>
>


-- 
Best Regards,
Ayan Guha


Saving calculation to single local file

2015-06-05 Thread marcos rebelo
Hi all

I'm running spark in a single local machine, no hadoop, just reading and
writing in local disk.

I need to have a single file as output of my calculation.

if I do "rdd.saveAsTextFile(...)" all runs ok but I get allot of files.
Since I need a single file I was considering to do something like:

  Try {new FileWriter(outputPath)} match {
case Success(writer) =>
  try {
rdd.toLocalIterator.foreach({line =>
  val str = line.toString
  writer.write(str)
}
  }
}
...
  }


I get:

[error] o.a.s.e.Executor - Exception in task 0.0 in stage 41.0 (TID 32)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
~[na:1.8.0_45]
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
~[na:1.8.0_45]
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
~[na:1.8.0_45]
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
~[na:1.8.0_45]
[error] o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception in
thread Thread[Executor task launch worker-1,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[na:1.8.0_45]
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
~[na:1.8.0_45]
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
~[na:1.8.0_45]
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
~[na:1.8.0_45]
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
~[na:1.8.0_45]
[error] o.a.s.s.TaskSetManager - Task 0 in stage 41.0 failed 1 times;
aborting job
[warn] application - Can't write to /tmp/err1433498283479.csv: {}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 41.0 failed 1 times, most recent failure: Lost task 0.0 in stage
41.0 (TID 32, localhost): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
~[spark-core_2.10-1.3.1.jar:1.3.1]
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
~[spark-core_2.10-1.3.1.jar:1.3.1]
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
~[spark-core_2.10-1.3.1.jar:1.3.1]
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
~[scala-library-2.10.5.jar:na]
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
~[scala-library-2.10.5.jar:na]


if this rdd.toLocalIterator.foreach(...) doesn't work, what is the better
solution?

Best Regards
Marcos


RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov
The param is for “Default number of partitions in RDDs returned by 
transformations like join, reduceByKey, and parallelize when NOT set by user.”

 

While Deepak is setting the number of partitions EXPLICITLY 

 

From: 李铖 [mailto:lidali...@gmail.com] 
Sent: Friday, June 5, 2015 11:08 AM
To: ÐΞ€ρ@Ҝ (๏̯͡๏)
Cc: Evo Eftimov; user
Subject: Re: How to increase the number of tasks

 

just multiply 2-4 with the cpu core number of the node .

 

2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :

I did not change spark.default.parallelism,

What is recommended value for it. 

 

On Fri, Jun 5, 2015 at 3:31 PM, 李铖  wrote:

Did you have a change of the value of 'spark.default.parallelism'?be a bigger 
number.

 

2015-06-05 17:56 GMT+08:00 Evo Eftimov :

It may be that your system runs out of resources (ie 174 is the ceiling) due to 
the following 

 

1.   RDD Partition = (Spark) Task

2.   RDD Partition != (Spark) Executor

3.   (Spark) Task != (Spark) Executor

4.   (Spark) Task = JVM Thread

5.   (Spark) Executor = JVM instance 

 

From: ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com] 
Sent: Friday, June 5, 2015 10:48 AM
To: user
Subject: How to increase the number of tasks

 

I have a  stage that spawns 174 tasks when i run repartition on avro data. 

Tasks read between 512/317/316/214/173  MB of data. Even if i increase number 
of executors/ number of partitions (when calling repartition) the number of 
tasks launched remains fixed to 174.

 

1) I want to speed up this task. How do i do it ?

2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why is this 
behavior ?

Since this is a repartition stage, it should not depend on the nature of data.

 

Its taking more than 30 mins and i want to speed it up by throwing more 
executors at it.

 

Please suggest

 

Deepak

 

 





 

-- 

Deepak

 

 



Re: How to increase the number of tasks

2015-06-05 Thread 李铖
just multiply 2-4 with the cpu core number of the node .

2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :

> I did not change spark.default.parallelism,
> What is recommended value for it.
>
> On Fri, Jun 5, 2015 at 3:31 PM, 李铖  wrote:
>
>> Did you have a change of the value of 'spark.default.parallelism'?be a
>> bigger number.
>>
>> 2015-06-05 17:56 GMT+08:00 Evo Eftimov :
>>
>>> It may be that your system runs out of resources (ie 174 is the ceiling)
>>> due to the following
>>>
>>>
>>>
>>> 1.   RDD Partition = (Spark) Task
>>>
>>> 2.   RDD Partition != (Spark) Executor
>>>
>>> 3.   (Spark) Task != (Spark) Executor
>>>
>>> 4.   (Spark) Task = JVM Thread
>>>
>>> 5.   (Spark) Executor = JVM instance
>>>
>>>
>>>
>>> *From:* ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com]
>>> *Sent:* Friday, June 5, 2015 10:48 AM
>>> *To:* user
>>> *Subject:* How to increase the number of tasks
>>>
>>>
>>>
>>> I have a  stage that spawns 174 tasks when i run repartition on avro
>>> data.
>>>
>>> Tasks read between 512/317/316/214/173  MB of data. Even if i increase
>>> number of executors/ number of partitions (when calling repartition) the
>>> number of tasks launched remains fixed to 174.
>>>
>>>
>>>
>>> 1) I want to speed up this task. How do i do it ?
>>>
>>> 2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why
>>> is this behavior ?
>>>
>>> Since this is a repartition stage, it should not depend on the nature of
>>> data.
>>>
>>>
>>>
>>> Its taking more than 30 mins and i want to speed it up by throwing more
>>> executors at it.
>>>
>>>
>>>
>>> Please suggest
>>>
>>>
>>>
>>> Deepak
>>>
>>>
>>>
>>
>>
>
>
> --
> Deepak
>
>


Re: How to increase the number of tasks

2015-06-05 Thread ๏̯͡๏
I did not change spark.default.parallelism,
What is recommended value for it.

On Fri, Jun 5, 2015 at 3:31 PM, 李铖  wrote:

> Did you have a change of the value of 'spark.default.parallelism'?be a
> bigger number.
>
> 2015-06-05 17:56 GMT+08:00 Evo Eftimov :
>
>> It may be that your system runs out of resources (ie 174 is the ceiling)
>> due to the following
>>
>>
>>
>> 1.   RDD Partition = (Spark) Task
>>
>> 2.   RDD Partition != (Spark) Executor
>>
>> 3.   (Spark) Task != (Spark) Executor
>>
>> 4.   (Spark) Task = JVM Thread
>>
>> 5.   (Spark) Executor = JVM instance
>>
>>
>>
>> *From:* ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com]
>> *Sent:* Friday, June 5, 2015 10:48 AM
>> *To:* user
>> *Subject:* How to increase the number of tasks
>>
>>
>>
>> I have a  stage that spawns 174 tasks when i run repartition on avro
>> data.
>>
>> Tasks read between 512/317/316/214/173  MB of data. Even if i increase
>> number of executors/ number of partitions (when calling repartition) the
>> number of tasks launched remains fixed to 174.
>>
>>
>>
>> 1) I want to speed up this task. How do i do it ?
>>
>> 2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why is
>> this behavior ?
>>
>> Since this is a repartition stage, it should not depend on the nature of
>> data.
>>
>>
>>
>> Its taking more than 30 mins and i want to speed it up by throwing more
>> executors at it.
>>
>>
>>
>> Please suggest
>>
>>
>>
>> Deepak
>>
>>
>>
>
>


-- 
Deepak


RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas  wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot  wrote:

You can use it as a broadcast variable, but if it's "too" large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg  a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 



Re: How to increase the number of tasks

2015-06-05 Thread 李铖
Did you have a change of the value of 'spark.default.parallelism'?be a
bigger number.

2015-06-05 17:56 GMT+08:00 Evo Eftimov :

> It may be that your system runs out of resources (ie 174 is the ceiling)
> due to the following
>
>
>
> 1.   RDD Partition = (Spark) Task
>
> 2.   RDD Partition != (Spark) Executor
>
> 3.   (Spark) Task != (Spark) Executor
>
> 4.   (Spark) Task = JVM Thread
>
> 5.   (Spark) Executor = JVM instance
>
>
>
> *From:* ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com]
> *Sent:* Friday, June 5, 2015 10:48 AM
> *To:* user
> *Subject:* How to increase the number of tasks
>
>
>
> I have a  stage that spawns 174 tasks when i run repartition on avro data.
>
> Tasks read between 512/317/316/214/173  MB of data. Even if i increase
> number of executors/ number of partitions (when calling repartition) the
> number of tasks launched remains fixed to 174.
>
>
>
> 1) I want to speed up this task. How do i do it ?
>
> 2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why is
> this behavior ?
>
> Since this is a repartition stage, it should not depend on the nature of
> data.
>
>
>
> Its taking more than 30 mins and i want to speed it up by throwing more
> executors at it.
>
>
>
> Please suggest
>
>
>
> Deepak
>
>
>


RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov
It may be that your system runs out of resources (ie 174 is the ceiling) due to 
the following 

 

1.   RDD Partition = (Spark) Task

2.   RDD Partition != (Spark) Executor

3.   (Spark) Task != (Spark) Executor

4.   (Spark) Task = JVM Thread

5.   (Spark) Executor = JVM instance 

 

From: ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com] 
Sent: Friday, June 5, 2015 10:48 AM
To: user
Subject: How to increase the number of tasks

 

I have a  stage that spawns 174 tasks when i run repartition on avro data. 

Tasks read between 512/317/316/214/173  MB of data. Even if i increase number 
of executors/ number of partitions (when calling repartition) the number of 
tasks launched remains fixed to 174.

 

1) I want to speed up this task. How do i do it ?

2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why is this 
behavior ?

Since this is a repartition stage, it should not depend on the nature of data.

 

Its taking more than 30 mins and i want to speed it up by throwing more 
executors at it.

 

Please suggest

 

Deepak

 



Articles related with how spark handles spark components(Driver,Worker,Executor, Task) failure

2015-06-05 Thread bit1...@163.com
Hi,
I am looking for some articles/blogs on the topic about how spark handles the 
various failures,such as Driver,Worker,Executor, Task..etc
Are there some articles/blogs on this topic? Detailes into source code would be 
the best.

Thanks very much!



bit1...@163.com


How to increase the number of tasks

2015-06-05 Thread ๏̯͡๏
I have a  stage that spawns 174 tasks when i run repartition on avro data.
Tasks read between 512/317/316/214/173  MB of data. Even if i increase
number of executors/ number of partitions (when calling repartition) the
number of tasks launched remains fixed to 174.

1) I want to speed up this task. How do i do it ?
2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why is
this behavior ?
Since this is a repartition stage, it should not depend on the nature of
data.

Its taking more than 30 mins and i want to speed it up by throwing more
executors at it.

Please suggest

Deepak


Re: FetchFailed Exception

2015-06-05 Thread patcharee

Hi,

I has this problem before, and in my case it is because the 
executor/container was killed by yarn when it used more memory than 
allocated. You can check if your case is the same by checking yarn node 
manager log.


Best,
Patcharee

On 05. juni 2015 07:25, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:

I see this

Is this a problem with my code or the cluster ? Is there any way to 
fix it ?


FetchFailed(BlockManagerId(2, phxdpehdc9dn2441.stratus.phx.ebay.com 
, 59574), shuffleId=1, 
mapId=80, reduceId=20, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 

at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to 
phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 

at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: java.net.ConnectException: Connection refused: 
phxdpehdc9dn2441.stratus.phx.ebay.com/10.115.81.47:59574 


at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptim

Re: Spark ML decision list

2015-06-05 Thread Sateesh Kavuri
Is there an existing way in SparkML to convert a decision tree to a
decision list?

On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh  wrote:

> The closest algorithm to decision lists that we have is decision trees
> https://spark.apache.org/docs/latest/mllib-decision-tree.html
>
> On Thu, Jun 4, 2015 at 2:14 AM, Sateesh Kavuri 
> wrote:
>
>> Hi,
>>
>> I have used weka machine learning library for generating a model for my
>> training set. I have used the PART algorithm (decision lists) from weka.
>>
>> Now, I would like to use spark ML for the PART algo for my training set
>> and could not seem to find a parallel. Could anyone point out the
>> corresponding algorithm or even if its available in Spark ML?
>>
>> Thanks,
>> Sateesh
>>
>
>


Re: Setting S3 output file grantees for spark output files

2015-06-05 Thread Akhil Das
You could try adding the configuration in the spark-defaults.conf file. And
once you run the application you can actually check on the driver UI (runs
on 4040) Environment tab to see if the configuration is set properly.

Thanks
Best Regards

On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel  wrote:

> Hi all,
>
> I'm running Spark on AWS EMR and I'm having some issues getting the
> correct permissions on the output files using
> rdd.saveAsTextFile('').  In hive, I would add a line in the
> beginning of the script with
>
> set fs.s3.canned.acl=BucketOwnerFullControl
>
> and that would set the correct grantees for the files. For Spark, I tried
> adding the permissions as a --conf option:
>
> hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
> /home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master
> yarn-cluster \
> --conf "spark.driver.extraJavaOptions
> -Dfs.s3.canned.acl=BucketOwnerFullControl" \
> hdfs:///user/hadoop/spark.py
>
> But the permissions do not get set properly on the output files. What is
> the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or
> any of the S3 canned permissions to the spark job?
>
> Thanks in advance
>


Re: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"

2015-06-05 Thread Okehee Goh
I will..that will be great if simple UDF can return complex type.
Thanks!

On Fri, Jun 5, 2015 at 12:17 AM, Cheng, Hao  wrote:
> Confirmed, with latest master, we don't support complex data type for Simple 
> Hive UDF, do you mind file an issue in jira?
>
> -Original Message-
> From: Cheng, Hao [mailto:hao.ch...@intel.com]
> Sent: Friday, June 5, 2015 12:35 PM
> To: ogoh; user@spark.apache.org
> Subject: RE: SparkSQL : using Hive UDF returning Map throws "rror: 
> scala.MatchError: interface java.util.Map (of class java.lang.Class) 
> (state=,code=0)"
>
> Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?
>
> -Original Message-
> From: ogoh [mailto:oke...@gmail.com]
> Sent: Friday, June 5, 2015 10:10 AM
> To: user@spark.apache.org
> Subject: SparkSQL : using Hive UDF returning Map throws "rror: 
> scala.MatchError: interface java.util.Map (of class java.lang.Class) 
> (state=,code=0)"
>
>
> Hello,
> I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1).
> Some udfs work fine (access array parameter and returning int or string type).
> But my udf returning map type throws an error:
> "Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) 
> (state=,code=0)"
>
> I converted the code into Hive's GenericUDF since I worried that using 
> complex type parameter (array of map) and returning complex type (map) can be 
> supported in Hive's GenericUDF instead of simple UDF.
> But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
> java.lang.IllegalAccessException: Class
> org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).
>
> Below is my example udf code returning MAP type.
> I appreciate any advice.
> Thanks
>
> --
>
> public final class ArrayToMap extends UDF {
>
> public Map evaluate(ArrayList arrayOfString) {
> // add code to handle all index problem
>
> Map map = new HashMap();
>
> int count = 0;
> for (String element : arrayOfString) {
> map.put(count + "", element);
> count++;
>
> }
> return map;
> }
> }
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"

2015-06-05 Thread Cheng, Hao
Confirmed, with latest master, we don't support complex data type for Simple 
Hive UDF, do you mind file an issue in jira?

-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com] 
Sent: Friday, June 5, 2015 12:35 PM
To: ogoh; user@spark.apache.org
Subject: RE: SparkSQL : using Hive UDF returning Map throws "rror: 
scala.MatchError: interface java.util.Map (of class java.lang.Class) 
(state=,code=0)"

Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?

-Original Message-
From: ogoh [mailto:oke...@gmail.com] 
Sent: Friday, June 5, 2015 10:10 AM
To: user@spark.apache.org
Subject: SparkSQL : using Hive UDF returning Map throws "rror: 
scala.MatchError: interface java.util.Map (of class java.lang.Class) 
(state=,code=0)"


Hello,
I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1).
Some udfs work fine (access array parameter and returning int or string type). 
But my udf returning map type throws an error:
"Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) 
(state=,code=0)"

I converted the code into Hive's GenericUDF since I worried that using complex 
type parameter (array of map) and returning complex type (map) can be supported 
in Hive's GenericUDF instead of simple UDF.
But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
java.lang.IllegalAccessException: Class
org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).

Below is my example udf code returning MAP type.
I appreciate any advice.
Thanks

--

public final class ArrayToMap extends UDF {

public Map evaluate(ArrayList arrayOfString) {
// add code to handle all index problem

Map map = new HashMap();
   
int count = 0;
for (String element : arrayOfString) {
map.put(count + "", element);
count++;

}
return map;
}
}






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL : using Hive UDF returning Map throws "rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)"

2015-06-05 Thread Okehee Goh
It is Spark 1.3.1.e (it is AWS release .. I think it is close to Spark
1.3.1 with some bug fixes).

My report about GenericUDF not working in SparkSQL is wrong. I tested
with open-source GenericUDF and it worked fine. Just my GenericUDF
which returns Map type didn't work. Sorry about false reporting.



On Thu, Jun 4, 2015 at 9:35 PM, Cheng, Hao  wrote:
> Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?
>
> -Original Message-
> From: ogoh [mailto:oke...@gmail.com]
> Sent: Friday, June 5, 2015 10:10 AM
> To: user@spark.apache.org
> Subject: SparkSQL : using Hive UDF returning Map throws "rror: 
> scala.MatchError: interface java.util.Map (of class java.lang.Class) 
> (state=,code=0)"
>
>
> Hello,
> I tested some custom udf on SparkSql's ThriftServer & Beeline (Spark 1.3.1).
> Some udfs work fine (access array parameter and returning int or string type).
> But my udf returning map type throws an error:
> "Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) 
> (state=,code=0)"
>
> I converted the code into Hive's GenericUDF since I worried that using 
> complex type parameter (array of map) and returning complex type (map) can be 
> supported in Hive's GenericUDF instead of simple UDF.
> But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
> java.lang.IllegalAccessException: Class
> org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).
>
> Below is my example udf code returning MAP type.
> I appreciate any advice.
> Thanks
>
> --
>
> public final class ArrayToMap extends UDF {
>
> public Map evaluate(ArrayList arrayOfString) {
> // add code to handle all index problem
>
> Map map = new HashMap();
>
> int count = 0;
> for (String element : arrayOfString) {
> map.put(count + "", element);
> count++;
>
> }
> return map;
> }
> }
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Access several s3 buckets, with credentials containing "/"

2015-06-05 Thread Pierre B
Hi list!

My problem is quite simple.
I need to access several S3 buckets, using different credentials.:
```
val c1 =
sc.textFile("s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket1/file.csv").count
val c2 =
sc.textFile("s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket2/file.csv").count
val c3 =
sc.textFile("s3n://[ACCESS_KEY_ID3:SECRET_ACCESS_KEY3]@bucket3/file.csv").count
...
```

One/several of those AWS credentials might contain "/" in the private access
key.
This is a known problem and from my research, the only ways to deal with
these "/" are:
1/ use environment variables to set the AWS credentials, then access the s3
buckets without specifying the credentials
2/ set the hadoop configuration to contain the the credentials.

However, none of these solutions allow me to access different buckets, with
different credentials.

Can anyone help me on this?

Thanks

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Access-several-s3-buckets-with-credentials-containing-tp23172.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Access several s3 buckets, with credentials containing "/"

2015-06-05 Thread Pierre B
Hi list!

My problem is quite simple.
I need to access several S3 buckets, using different credentials.:
```
val c1 =
sc.textFile("s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket/file1.csv").count
val c2 =
sc.textFile("s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket/file1.csv").count
val c3 =
sc.textFile("s3n://[ACCESS_KEY_ID3:SECRET_ACCESS_KEY3]@bucket/file1.csv").count
...
```

One/several of those AWS credentials might contain "/" in the private access
key.
This is a known problem and from my research, the only ways to deal with
these "/" are:
1/ use environment variables to set the AWS credentials, then access the s3
buckets without specifying the credentials
2/ set the hadoop configuration to contain the the credentials.

However, none of these solutions allow me to access different buckets, with
different credentials.

Can anyone help me on this?

Thanks

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Access-several-s3-buckets-with-credentials-containing-tp23171.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Avro or Parquet ?

2015-06-05 Thread ๏̯͡๏
We currently have data in avro format and we do joins between avro and
sequence file data.
Will storing these datasets in Parquet make joins any faster ?

The dataset sizes are beyond are between 500 to 1000 GB.
-- 
Deepak