Re: Spark on Mesos 0.20
Hi Gurvinder, Is there a SPARK ticket tracking the issue you describe? On Mon, Oct 6, 2014 at 2:44 AM, Gurvinder Singh wrote: > On 10/06/2014 08:19 AM, Fairiz Azizi wrote: > > The Spark online docs indicate that Spark is compatible with Mesos 0.18.1 > > > > I've gotten it to work just fine on 0.18.1 and 0.18.2 > > > > Has anyone tried Spark on a newer version of Mesos, i.e. Mesos v0.20.0? > > > > -Fi > > > Yeah we are using Spark 1.1.0 with Mesos 0.20.1. It runs fine in coarse > mode, in fine grain mode there is an issue with blockmanager names > conflict. I have been waiting for it to be fixed but it is still there. > > -Gurvinder > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: Spark on Mesos 0.20
On 10/06/2014 08:19 AM, Fairiz Azizi wrote: > The Spark online docs indicate that Spark is compatible with Mesos 0.18.1 > > I've gotten it to work just fine on 0.18.1 and 0.18.2 > > Has anyone tried Spark on a newer version of Mesos, i.e. Mesos v0.20.0? > > -Fi > Yeah we are using Spark 1.1.0 with Mesos 0.20.1. It runs fine in coarse mode, in fine grain mode there is an issue with blockmanager names conflict. I have been waiting for it to be fixed but it is still there. -Gurvinder - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark on Mesos 0.20
The Spark online docs indicate that Spark is compatible with Mesos 0.18.1 I've gotten it to work just fine on 0.18.1 and 0.18.2 Has anyone tried Spark on a newer version of Mesos, i.e. Mesos v0.20.0? -Fi
Too big data Spark SQL on Hive table on version 1.0.2 has some strange output
Dear Developers, I'm limited in using Spark 1.0.2 currently. I use Spark SQL on Hive table to load amplab benchmark, which is 25.6GiB approximately. I run: CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING, visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING, languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\001" STORED AS SEQUENCEFILE LOCATION "/public//data/uservisits" okay! I run: SELECT COUNT(*) from uservisits okay! the result is correct but when I run: SELECT SUBSTR(sourceIP, 1, 8), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 8) There are some error messages (i stronger and underline some important message) mainly two problems: akka => Timed out GC => Out of memory what should I do? ... 14/10/05 23:45:18 INFO MemoryStore: Block broadcast_2 of size 158188 dropped from memory (free 308752285) 14/10/05 23:45:40 ERROR BlockManagerMaster: Failed to remove shuffle 4 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) 14/10/05 23:45:47 ERROR BlockManagerMaster: Failed to remove shuffle 0 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) 14/10/05 23:45:46 ERROR BlockManagerMaster: Failed to remove shuffle 2 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) 14/10/05 23:45:45 ERROR BlockManagerMaster: Failed to remove shuffle 1 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) 14/10/05 23:45:40 ERROR BlockManagerMaster: Failed to remove shuffle 3 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHol
Re: SPARK-3660 : Initial RDD for updateStateByKey transformation
Hello, I have submitted a pull request (Adding support of initial value for state update. #2665), please review and let me know. Excited to submit my first pull request. -Soumitra. - Original Message - From: "Soumitra Kumar" To: dev@spark.apache.org Sent: Tuesday, September 23, 2014 1:28:21 PM Subject: SPARK-3660 : Initial RDD for updateStateByKey transformation Hello fellow developers, Thanks TD for relevant pointers. I have created an issue : https://issues.apache.org/jira/browse/SPARK-3660 Copying the description from JIRA: " How to initialize state tranformation updateStateByKey? I have word counts from previous spark-submit run, and want to load that in next spark-submit job to start counting over that. One proposal is to add following argument to updateStateByKey methods. initial : Option [RDD [(K, S)]] = None This will maintain the backward compatibility as well. I have a working code as well. This thread started on spark-user list at: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-updateStateByKey-operation-td14772.html " Please let me know if I shall add a parameter "initial : Option [RDD [(K, S)]] = None" to all updateStateByKey methods or create new ones? Thanks, -Soumitra. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Hyper Parameter Tuning Algorithms
Found this thread from April.. http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccabjxkq6b7sfaxie4+aqtcmd8jsqbznsxsfw6v5o0wwwouob...@mail.gmail.com%3E Wondering what the status of this.. We are thinking about implementing these algorithms.. Would be a waste if they are already available? Please advice. Thanks. Lochana - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Parquet schema migrations
Hi Cody, Assuming you are talking about 'safe' changes to the schema (i.e. existing column names are never reused with incompatible types), this is something I'd love to support. Perhaps you can describe more what sorts of changes you are making, and if simple merging of the schemas would be sufficient. If so, we can open a JIRA, though I'm not sure when we'll have resources to dedicate to this. In the near term, I'd suggest writing converters for each version of the schema, that translate to some desired master schema. You can then union all of these together and avoid the cost of batch conversion. It seems like in most cases this should be pretty efficient, at least now that we have good pushdown past union operators :) Michael On Sun, Oct 5, 2014 at 3:58 PM, Andrew Ash wrote: > Hi Cody, > > I wasn't aware there were different versions of the parquet format. What's > the difference between "raw parquet" and the Hive-written parquet files? > > As for your migration question, the approaches I've often seen are > convert-on-read and convert-all-at-once. Apache Cassandra for example does > both -- when upgrading between Cassandra versions that change the on-disk > sstable format, it will do a convert-on-read as you access the sstables, or > you can run the upgradesstables command to convert them all at once > post-upgrade. > > Andrew > > On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger wrote: > > > Wondering if anyone has thoughts on a path forward for parquet schema > > migrations, especially for people (like us) that are using raw parquet > > files rather than Hive. > > > > So far we've gotten away with reading old files, converting, and writing > to > > new directories, but that obviously becomes problematic above a certain > > data size. > > >
Re: Jython importing pyspark?
PySpark doesn't attempt to support Jython at present. IMO while it might be a bit faster, it would lose a lot of the benefits of Python, which are the very strong data processing libraries (NumPy, SciPy, Pandas, etc). So I'm not sure it's worth supporting unless someone demonstrates a really major performance benefit. There was actually a recent patch to add PyPy support (https://github.com/apache/spark/pull/2144), which is worth a try if you want Python applications to run faster. It might actually be faster overall than Jython. Matei On Oct 5, 2014, at 10:16 AM, Robert C Senkbeil wrote: > > > Hi there, > > I wanted to ask whether or not anyone has successfully used Jython with the > pyspark library. I wasn't sure if the C extension support was needed for > pyspark itself or was just a bonus of using Cython. > > There was a claim ( > http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html#a7269 > ) that using Jython would be better - if you didn't need C extension > support - because the cost of serialization is lower. However, I have not > been able to import pyspark into a Jython session. I'm using version 2.7b3 > of Jython and version 1.1.0 of Spark for reference. > > Jython 2.7b3 (default:e81256215fb0, Aug 4 2014, 02:39:51) > [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51 > Type "help", "copyright", "credits" or "license" for more information. from pyspark import SparkContext, SparkConf > Traceback (most recent call last): > File "", line 1, in > File "pyspark/__init__.py", line 63, in > File "pyspark/context.py", line 25, in > File "pyspark/accumulators.py", line 94, in > File "pyspark/serializers.py", line 341, in > File "pyspark/serializers.py", line 328, in _hijack_namedtuple > RuntimeError: maximum recursion depth exceeded (Java StackOverflowError) > > Is there something I am missing with this? Did Jython ever work for > pyspark? The same error happens regardless of whether I use the Python > files or compile them down to Java class files using Jython first. > > I know that previous documentation (0.9.1) indicated, "PySpark requires > Python 2.6 or higher. PySpark applications are executed using a standard > CPython interpreter in order to support Python modules that use C > extensions. We have not tested PySpark with Python 3 or with alternative > Python interpreters, such as PyPy or Jython." > > In later versions, it now reflects, "Spark 1.1.0 works with Python 2.6 or > higher (but not Python 3). It uses the standard CPython interpreter, so C > libraries like NumPy can be used." > > I'm assuming this means that attempts to use other interpreters failed. If > so, are there any plans to support something like Jython in the future? > > Signed, > Chip Senkbeil - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Parquet schema migrations
Hi Cody, I wasn't aware there were different versions of the parquet format. What's the difference between "raw parquet" and the Hive-written parquet files? As for your migration question, the approaches I've often seen are convert-on-read and convert-all-at-once. Apache Cassandra for example does both -- when upgrading between Cassandra versions that change the on-disk sstable format, it will do a convert-on-read as you access the sstables, or you can run the upgradesstables command to convert them all at once post-upgrade. Andrew On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger wrote: > Wondering if anyone has thoughts on a path forward for parquet schema > migrations, especially for people (like us) that are using raw parquet > files rather than Hive. > > So far we've gotten away with reading old files, converting, and writing to > new directories, but that obviously becomes problematic above a certain > data size. >
Re: Impact of input format on timing
Hi Tom, HDFS and Spark don't actually have a minimum block size -- so in that first dataset, the files won't each be costing you 64 MB. However, the main reason for difference in performance here is probably the number of RDD partitions. In the first case, Spark will create an RDD with 1 partitions, one per file, while in the second case it will likely have only 1-2 of them. The number of partitions affects the level of parallelism of operations like reduceByKey (by default, reduceByKey uses as many partitions as the parent RDD it runs on), and in this case, I think it's causing reduceByKey to spill to disk within tasks in the second case and not in the first case, because each task has more input data. You can see the number of partitions in each stage of your computation on the application web UI at http://:4040 if you want to confirm this. Also, for both programs, you can manually set the number of partitions for the reduce by passing a second argument to reduceByKey (e.g. reduceByKey(myFunc, 100)). In general it's best to choose this so that the input data for each reduce task fits in memory to avoid spilling. Matei On Oct 5, 2014, at 1:58 PM, Tom Hubregtsen wrote: > Hi, > > I ran the same version of a program with two different types of input > containing equivalent information. > Program 1: 10,000 files with on average 50 IDs, one every line > Program 2: 1 file containing 10,000 lines. On average 50 IDs per line > > My program takes the input, creates key/value pairs of them, and performs > about 7 more steps. I have compared the sets of key/value pairs after the > initialization phase, and they are equivalent. All other steps are equal. > The only difference is using wholeTextFile versus textFile and the > initialization input map function itself. > > Since I am reading in from HDFS, and the minimum part size there is 64MB, > every file in program 1 will take 64MB, even though they are KBs original. > Because of this, I expected program 2 to be faster, or equivalent as the > initialization phase may be neglectable. > > In my first comparison, I provided the program with sufficient memory, and > they both take around 2 minutes. No surprises here. > > In my second comparison, I limit the memory to in this case 4 GB. Program 1 > executes in a little over 4 minutes, but program 2 takes over 15 minutes > (after which I terminated the program, as I see it is getting there, but > spilling massively in every stage). The difference between the two runs is > the amount of spilling in phases later on in the program (*not* in the > initialization phase). Program 1 spills 2 chunks per stage: > 14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory > map of 327 MB to disk (1 time so far) > 14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 242 spilling in-memory > map of 328 MB to disk (1 time so far) > Program 2 also spills these 2 chunks: > 14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory > map of 320 MB to disk (1 time so far) > 14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory > map of 323 MB to disk (1 time so far) > But then spills 2 * ~15,000 chuncks of <1MB: > 14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory > map of 0 MB to disk (15561 time so far) > 14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory > map of 0 MB to disk (13866 times so far) > > I understand that RDDs with narrow dependencies to their parents are > pipelined, and form a stage. These all point to the parent RDD. This parent > RDD will have different partitioning and different data placement between > the two programs. Because of this, I did expect a difference in this stage, > but as we change from JavaRDD to JavaPairRDD and do a reduceByKey after the > initialization, I expected this difference to be gone in the next stages. > Can anyone explain this behavior, or point me in a direction? > > Thanks in advance! > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Impact-of-input-format-on-timing-tp8655.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Impact of input format on timing
Hi, I ran the same version of a program with two different types of input containing equivalent information. Program 1: 10,000 files with on average 50 IDs, one every line Program 2: 1 file containing 10,000 lines. On average 50 IDs per line My program takes the input, creates key/value pairs of them, and performs about 7 more steps. I have compared the sets of key/value pairs after the initialization phase, and they are equivalent. All other steps are equal. The only difference is using wholeTextFile versus textFile and the initialization input map function itself. Since I am reading in from HDFS, and the minimum part size there is 64MB, every file in program 1 will take 64MB, even though they are KBs original. Because of this, I expected program 2 to be faster, or equivalent as the initialization phase may be neglectable. In my first comparison, I provided the program with sufficient memory, and they both take around 2 minutes. No surprises here. In my second comparison, I limit the memory to in this case 4 GB. Program 1 executes in a little over 4 minutes, but program 2 takes over 15 minutes (after which I terminated the program, as I see it is getting there, but spilling massively in every stage). The difference between the two runs is the amount of spilling in phases later on in the program (*not* in the initialization phase). Program 1 spills 2 chunks per stage: 14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory map of 327 MB to disk (1 time so far) 14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 242 spilling in-memory map of 328 MB to disk (1 time so far) Program 2 also spills these 2 chunks: 14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory map of 320 MB to disk (1 time so far) 14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory map of 323 MB to disk (1 time so far) But then spills 2 * ~15,000 chuncks of <1MB: 14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory map of 0 MB to disk (15561 time so far) 14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory map of 0 MB to disk (13866 times so far) I understand that RDDs with narrow dependencies to their parents are pipelined, and form a stage. These all point to the parent RDD. This parent RDD will have different partitioning and different data placement between the two programs. Because of this, I did expect a difference in this stage, but as we change from JavaRDD to JavaPairRDD and do a reduceByKey after the initialization, I expected this difference to be gone in the next stages. Can anyone explain this behavior, or point me in a direction? Thanks in advance! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Impact-of-input-format-on-timing-tp8655.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Jython importing pyspark?
Hi there, I wanted to ask whether or not anyone has successfully used Jython with the pyspark library. I wasn't sure if the C extension support was needed for pyspark itself or was just a bonus of using Cython. There was a claim ( http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html#a7269 ) that using Jython would be better - if you didn't need C extension support - because the cost of serialization is lower. However, I have not been able to import pyspark into a Jython session. I'm using version 2.7b3 of Jython and version 1.1.0 of Spark for reference. Jython 2.7b3 (default:e81256215fb0, Aug 4 2014, 02:39:51) [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51 Type "help", "copyright", "credits" or "license" for more information. >>> from pyspark import SparkContext, SparkConf Traceback (most recent call last): File "", line 1, in File "pyspark/__init__.py", line 63, in File "pyspark/context.py", line 25, in File "pyspark/accumulators.py", line 94, in File "pyspark/serializers.py", line 341, in File "pyspark/serializers.py", line 328, in _hijack_namedtuple RuntimeError: maximum recursion depth exceeded (Java StackOverflowError) Is there something I am missing with this? Did Jython ever work for pyspark? The same error happens regardless of whether I use the Python files or compile them down to Java class files using Jython first. I know that previous documentation (0.9.1) indicated, "PySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython." In later versions, it now reflects, "Spark 1.1.0 works with Python 2.6 or higher (but not Python 3). It uses the standard CPython interpreter, so C libraries like NumPy can be used." I'm assuming this means that attempts to use other interpreters failed. If so, are there any plans to support something like Jython in the future? Signed, Chip Senkbeil