[jira] [Commented] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-02 Thread Milan Straka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156125#comment-14156125
 ] 

Milan Straka commented on SPARK-3731:
-

I will get to it later today and attach a dataset and program which exhibit 
this behaviour locally. I believe I will find it because I saw this behaviour 
in many local runs.

 RDD caching stops working in pyspark after some time
 

 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, both in local mode or in standalone 
 cluster mode
Reporter: Milan Straka
 Attachments: worker.log


 Consider a file F which when loaded with sc.textFile and cached takes up 
 slightly more than half of free memory for RDD cache.
 When in PySpark the following is executed:
   1) a = sc.textFile(F)
   2) a.cache().count()
   3) b = sc.textFile(F)
   4) b.cache().count()
 and then the following is repeated (for example 10 times):
   a) a.unpersist().cache().count()
   b) b.unpersist().cache().count()
 after some time, there are no RDD cached in memory.
 Also, since that time, no other RDD ever gets cached (the worker always 
 reports something like WARN CacheManager: Not enough space to cache 
 partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if 
 rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
 all executors have 0MB memory used (which is consistent with the CacheManager 
 warning).
 When doing the same in scala, everything works perfectly.
 I understand that this is a vague description, but I do no know how to 
 describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3573) Dataset

2014-10-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156129#comment-14156129
 ] 

Patrick Wendell commented on SPARK-3573:


I think people are hung up on the term SQL - SchemaRDD is designed to simply 
represent richer types on top of the core RDD API. In fact we though originally 
of naming the package schema instead of sql for exactly this reason. 
SchemaRDD is in the sql/core package right now, but we could pull the public 
interface of a Schema RDD into another package in the future (and maybe we'd 
drop exposing anything about the logical plan here).

I'd like to see a common representation of typed data be used across both SQL 
and MLlib and longer term other libraries as well. I don't see an 
insurmountable semantic gap between an R-style data frame and a relational 
table. In fact, if you look across other projects today - almost all of them 
are trying to unify these types of data representations.

So I'd support seeing where maybe we can enhance or extend SchemaRDD to better 
support numeric data sets. And if we find there is just too large of a gap 
here, then we could look at implementing a second dataset abstraction. If 
nothing else this is a test of whether SchemaRDD is sufficiently extensible to 
be useful in contexts beyond SQL (which is its original design).

 Dataset
 ---

 Key: SPARK-3573
 URL: https://issues.apache.org/jira/browse/SPARK-3573
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
 ML-specific metadata embedded in its schema.
 .Sample code
 Suppose we have training events stored on HDFS and user/ad features in Hive, 
 we want to assemble features for training and then apply decision tree.
 The proposed pipeline with dataset looks like the following (need more 
 refinements):
 {code}
 sqlContext.jsonFile(/path/to/training/events, 
 0.01).registerTempTable(event)
 val training = sqlContext.sql(
   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
 event.action AS label,
  user.gender AS userGender, user.country AS userCountry, 
 user.features AS userFeatures,
  ad.targetGender AS targetGender
 FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
 ad.id;).cache()
 val indexer = new Indexer()
 val interactor = new Interactor()
 val fvAssembler = new FeatureVectorAssembler()
 val treeClassifer = new DecisionTreeClassifer()
 val paramMap = new ParamMap()
   .put(indexer.features, Map(userCountryIndex - userCountry))
   .put(indexer.sortByFrequency, true)
   .put(interactor.features, Map(genderMatch - Array(userGender, 
 targetGender)))
   .put(fvAssembler.features, Map(features - Array(genderMatch, 
 userCountryIndex, userFeatures)))
   .put(fvAssembler.dense, true)
   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
 features and label columns.
 val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
 treeClassifier)
 val model = pipeline.fit(training, paramMap)
 sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event)
 val test = sqlContext.sql(
   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
  user.gender AS userGender, user.country AS userCountry, 
 user.features AS userFeatures,
  ad.targetGender AS targetGender
 FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
 ad.id;)
 val prediction = model.transform(test).select('eventId, 'prediction)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error

2014-10-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3371.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2511
[https://github.com/apache/spark/pull/2511]

 Spark SQL: Renaming a function expression with group by gives error
 ---

 Key: SPARK-3371
 URL: https://issues.apache.org/jira/browse/SPARK-3371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Pei-Lun Lee
 Fix For: 1.2.0


 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val rdd = sc.parallelize(List({foo:bar}))
 sqlContext.jsonRDD(rdd).registerAsTable(t1)
 sqlContext.registerFunction(len, (s: String) = s.length)
 sqlContext.sql(select len(foo) as a, count(1) from t1 group by 
 len(foo)).collect()
 {code}
 running above code in spark-shell gives the following error
 {noformat}
 14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: foo#0
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 {noformat}
 remove as a in the query causes no error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3762) clear all SparkEnv references after stop

2014-10-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3762:
-

 Summary: clear all SparkEnv references after stop
 Key: SPARK-3762
 URL: https://issues.apache.org/jira/browse/SPARK-3762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu
Priority: Critical


SparkEnv is cached in ThreadLocal object, so after stop and create a new 
SparkContext, old SparkEnv is still used by some threads, it will trigger many 
problems.

We should clear all the references after stop a SparkEnv.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3759) SparkSubmitDriverBootstrapper should return exit code of driver process

2014-10-02 Thread Eric Eijkelenboom (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156157#comment-14156157
 ] 

Eric Eijkelenboom commented on SPARK-3759:
--

Yes, no problem!

 SparkSubmitDriverBootstrapper should return exit code of driver process
 ---

 Key: SPARK-3759
 URL: https://issues.apache.org/jira/browse/SPARK-3759
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.0
 Environment: Linux, Windows, Scala/Java
Reporter: Eric Eijkelenboom
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 SparkSubmitDriverBootstrapper.scala currently always returns exit code 0. 
 Instead, it should return the exit code of the driver process.
 Suggested code change in SparkSubmitDriverBootstrapper, line 157: 
 {code}
 val returnCode = process.waitFor()
 sys.exit(returnCode)
 {code}
 Workaround for this issue: 
 Instead of specifying 'driver.extra*' properties in spark-defaults.conf, pass 
 these properties to spark-submit directly. This will launch the driver 
 program without the use of SparkSubmitDriverBootstrapper. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3762) clear all SparkEnv references after stop

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156159#comment-14156159
 ] 

Apache Spark commented on SPARK-3762:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2624

 clear all SparkEnv references after stop
 

 Key: SPARK-3762
 URL: https://issues.apache.org/jira/browse/SPARK-3762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 SparkEnv is cached in ThreadLocal object, so after stop and create a new 
 SparkContext, old SparkEnv is still used by some threads, it will trigger 
 many problems.
 We should clear all the references after stop a SparkEnv.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-02 Thread Milan Straka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Straka updated SPARK-3731:

Attachment: spark-3731.py

 RDD caching stops working in pyspark after some time
 

 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, both in local mode or in standalone 
 cluster mode
Reporter: Milan Straka
 Attachments: spark-3731.py, worker.log


 Consider a file F which when loaded with sc.textFile and cached takes up 
 slightly more than half of free memory for RDD cache.
 When in PySpark the following is executed:
   1) a = sc.textFile(F)
   2) a.cache().count()
   3) b = sc.textFile(F)
   4) b.cache().count()
 and then the following is repeated (for example 10 times):
   a) a.unpersist().cache().count()
   b) b.unpersist().cache().count()
 after some time, there are no RDD cached in memory.
 Also, since that time, no other RDD ever gets cached (the worker always 
 reports something like WARN CacheManager: Not enough space to cache 
 partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if 
 rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
 all executors have 0MB memory used (which is consistent with the CacheManager 
 warning).
 When doing the same in scala, everything works perfectly.
 I understand that this is a vague description, but I do no know how to 
 describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2461) Add a toString method to GeneralizedLinearModel

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156175#comment-14156175
 ] 

Apache Spark commented on SPARK-2461:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2625

 Add a toString method to GeneralizedLinearModel
 ---

 Key: SPARK-2461
 URL: https://issues.apache.org/jira/browse/SPARK-2461
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-02 Thread Milan Straka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156174#comment-14156174
 ] 

Milan Straka commented on SPARK-3731:
-

I have attached reproducible program, input file and log from the local run. 
You have to uncompress the input file before executing. I have used unmodified 
spark-1.1.0-bin-hadoop2.4.tgz.

Before the for loop, 3 out of 4 partitions are cached. After first iteration of 
the for loop, only 2 are cached, after second iteration only 1 and after third 
iteration 0 partitions are cached.

I believe this behaviour is not dependent on StorageLevel used, I am using 
MEMORY_ONLY for the cached partitions to be big even on easily compresible 
file, but I have encountered the issue when using MEMORY_ONLY_SER.

I also believe that this behaviour is triggered only when some RDD partition 
does _not_ fit into memory. Before that happens, caching and uncaching work as 
expected.

An equivalent scala program seems to be working fine.

 RDD caching stops working in pyspark after some time
 

 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, both in local mode or in standalone 
 cluster mode
Reporter: Milan Straka
 Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
 worker.log


 Consider a file F which when loaded with sc.textFile and cached takes up 
 slightly more than half of free memory for RDD cache.
 When in PySpark the following is executed:
   1) a = sc.textFile(F)
   2) a.cache().count()
   3) b = sc.textFile(F)
   4) b.cache().count()
 and then the following is repeated (for example 10 times):
   a) a.unpersist().cache().count()
   b) b.unpersist().cache().count()
 after some time, there are no RDD cached in memory.
 Also, since that time, no other RDD ever gets cached (the worker always 
 reports something like WARN CacheManager: Not enough space to cache 
 partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if 
 rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
 all executors have 0MB memory used (which is consistent with the CacheManager 
 warning).
 When doing the same in scala, everything works perfectly.
 I understand that this is a vague description, but I do no know how to 
 describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1767) Prefer HDFS-cached replicas when scheduling data-local tasks

2014-10-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1767.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Colin Patrick McCabe

Fixed by:https://github.com/apache/spark/pull/1486

 Prefer HDFS-cached replicas when scheduling data-local tasks
 

 Key: SPARK-1767
 URL: https://issues.apache.org/jira/browse/SPARK-1767
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Colin Patrick McCabe
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3007) Add Dynamic Partition support to Spark Sql hive

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156225#comment-14156225
 ] 

Apache Spark commented on SPARK-3007:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2626

 Add Dynamic Partition support  to  Spark Sql hive
 ---

 Key: SPARK-3007
 URL: https://issues.apache.org/jira/browse/SPARK-3007
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: baishuo
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-10-02 Thread Ziv Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156236#comment-14156236
 ] 

Ziv Huang commented on SPARK-3687:
--

The following is the jstack dump of one CoarseGrainedExecutorBackend when the 
job hangs (the spark version is 1.1.0):

Attach Listener daemon prio=10 tid=0x7fded0001000 nid=0x7836 waiting on 
condition [0x]
   java.lang.Thread.State: RUNNABLE

Hashed wheel timer #1 daemon prio=10 tid=0x7fde9c001000 nid=0x7811 
waiting on condition [0x7fdf26a84000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:503)
at 
org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:401)
at 
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at java.lang.Thread.run(Thread.java:745)

New I/O server boss #6 daemon prio=10 tid=0x7fdeb4084000 nid=0x7810 
runnable [0x7fdf26b85000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked 0x0007db53acc0 (a sun.nio.ch.Util$2)
- locked 0x0007db53acb0 (a java.util.Collections$UnmodifiableSet)
- locked 0x0007db53ab98 (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
at 
org.jboss.netty.channel.socket.nio.NioServerBoss.select(NioServerBoss.java:163)
at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206)
at 
org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
at 
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at 
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

New I/O worker #5 daemon prio=10 tid=0x7fdeb4037000 nid=0x780f runnable 
[0x7fdf26c86000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked 0x0007db529f98 (a sun.nio.ch.Util$2)
- locked 0x0007db529f88 (a java.util.Collections$UnmodifiableSet)
- locked 0x0007db529e70 (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at 
org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:64)
at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409)
at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206)
at 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at 
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at 
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

New I/O worker #4 daemon prio=10 tid=0x7fdeb4032800 nid=0x780e runnable 
[0x7fdf26d87000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked 0x0007db528610 (a sun.nio.ch.Util$2)
- locked 0x0007db528600 (a java.util.Collections$UnmodifiableSet)
- locked 0x0007db5284e8 (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at 
org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:64)
at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409)
at 

[jira] [Created] (SPARK-3763) The example of building with sbt should be sbt assembly instead of sbt compile

2014-10-02 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3763:
-

 Summary: The example of building with sbt should be sbt assembly 
instead of sbt compile
 Key: SPARK-3763
 URL: https://issues.apache.org/jira/browse/SPARK-3763
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Trivial


In building-spark.md, there are some examples for making assembled package with 
maven but the example for building with sbt is only about for compiling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3759) SparkSubmitDriverBootstrapper should return exit code of driver process

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156239#comment-14156239
 ] 

Apache Spark commented on SPARK-3759:
-

User 'ericeijkelenboom' has created a pull request for this issue:
https://github.com/apache/spark/pull/2628

 SparkSubmitDriverBootstrapper should return exit code of driver process
 ---

 Key: SPARK-3759
 URL: https://issues.apache.org/jira/browse/SPARK-3759
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.0
 Environment: Linux, Windows, Scala/Java
Reporter: Eric Eijkelenboom
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 SparkSubmitDriverBootstrapper.scala currently always returns exit code 0. 
 Instead, it should return the exit code of the driver process.
 Suggested code change in SparkSubmitDriverBootstrapper, line 157: 
 {code}
 val returnCode = process.waitFor()
 sys.exit(returnCode)
 {code}
 Workaround for this issue: 
 Instead of specifying 'driver.extra*' properties in spark-defaults.conf, pass 
 these properties to spark-submit directly. This will launch the driver 
 program without the use of SparkSubmitDriverBootstrapper. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3763) The example of building with sbt should be sbt assembly instead of sbt compile

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156238#comment-14156238
 ] 

Apache Spark commented on SPARK-3763:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2627

 The example of building with sbt should be sbt assembly instead of sbt 
 compile
 --

 Key: SPARK-3763
 URL: https://issues.apache.org/jira/browse/SPARK-3763
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Trivial

 In building-spark.md, there are some examples for making assembled package 
 with maven but the example for building with sbt is only about for compiling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-3764:


 Summary: Invalid dependencies of artifacts in Maven Central 
Repository.
 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin


While testing my spark applications locally using spark artifacts downloaded 
from Maven Central, the following exception was thrown:

{quote}
ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
Thread[Executor task launch worker-2,5,main]
java.lang.IncompatibleClassChangeError: Found class 
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at 
org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{quote}

This is because the hadoop class {{TaskAttemptContext}} is incompatible between 
hadoop-1 and hadoop-2.

I guess the spark artifacts in Maven Central were built against hadoop-2 with 
Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the 
hadoop version mismatch is happend.

FYI:
sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2809) update chill to version 0.5.0

2014-10-02 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2809:
---
Summary: update chill to version 0.5.0  (was: update chill to version 0.4)

 update chill to version 0.5.0
 -

 Key: SPARK-2809
 URL: https://issues.apache.org/jira/browse/SPARK-2809
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 First twitter chill_2.11 0.4 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156311#comment-14156311
 ] 

Sean Owen commented on SPARK-3764:
--

This is correct and as intended. Without any additional flags, yes, the version 
of Hadoop referenced by Spark would be 1.0.4. You should not rely on this 
though. If your app uses Spark but not Hadoop, it's not relevant as you are not 
packaging Spark or Hadoop dependencies in your app. If you use Spark and Hadoop 
APIs, you need to explicitly depend on the version of Hadoop you use on your 
cluster (but still not bundle with your app).

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2809) update chill to version 0.5.0

2014-10-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156362#comment-14156362
 ] 

Sean Owen commented on SPARK-2809:
--

PS chill 0.5.0 is the first to support Scala 2.11, so now this is actionable.
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22chill_2.11%22

 update chill to version 0.5.0
 -

 Key: SPARK-2809
 URL: https://issues.apache.org/jira/browse/SPARK-2809
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
Assignee: Guoqiang Li

 First twitter chill_2.11 0.4 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-02 Thread Igor Tkachenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156365#comment-14156365
 ] 

Igor Tkachenko edited comment on SPARK-3761 at 10/2/14 10:55 AM:
-

I've tried sbt 12.4, but unfortunately with no luck. I'd like to emphasize that 
I am using spark-assembly lib with version _2.10;1.0.0-cdh5.1.0 from  
repository http://repository.cloudera.com/artifactory/repo/ as we are using 
cloudera image CDH 5.1.0 and can't use any other version due to serialization 
issues.


was (Author: legart):
I've tried sbt 12.4, but unfortunately with no luck. I'd like to emphasize that 
I am using  with version _2.10;1.0.0-cdh5.1.0 from  repository 
http://repository.cloudera.com/artifactory/repo/ as we are using cloudera image 
CDH 5.1.0 and can't use any other version due to serialization issues.

 Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
 -

 Key: SPARK-3761
 URL: https://issues.apache.org/jira/browse/SPARK-3761
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Igor Tkachenko
Priority: Blocker

 I have Scala code:
 val master = spark://server address:7077
 val sc = new SparkContext(new SparkConf()
   .setMaster(master)
   .setAppName(SparkQueryDemo 01)
   .set(spark.executor.memory, 512m))
 val count2 = sc .textFile(hdfs://server 
 address:8020/tmp/data/risk/account.txt)
   .filter(line  = line.contains(Word))
   .count()
 I've got such an error:
 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
 stage failure: Task 0.0:0 failed 4 times, most
 recent failure: Exception failure in TID 6 on host server address: 
 java.lang.ClassNotFoundExcept
 ion: SimpleApp$$anonfun$1
 My dependencies :
 object Version {
   val spark= 1.0.0-cdh5.1.0
 }
 object Library {
   val sparkCore  = org.apache.spark  % spark-assembly_2.10  % 
 Version.spark
 }
 My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-02 Thread Igor Tkachenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156365#comment-14156365
 ] 

Igor Tkachenko commented on SPARK-3761:
---

I've tried sbt 12.4, but unfortunately with no luck. I'd like to emphasize that 
I am using  with version _2.10;1.0.0-cdh5.1.0 from  repository 
http://repository.cloudera.com/artifactory/repo/ as we are using cloudera image 
CDH 5.1.0 and can't use any other version due to serialization issues.

 Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
 -

 Key: SPARK-3761
 URL: https://issues.apache.org/jira/browse/SPARK-3761
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Igor Tkachenko
Priority: Blocker

 I have Scala code:
 val master = spark://server address:7077
 val sc = new SparkContext(new SparkConf()
   .setMaster(master)
   .setAppName(SparkQueryDemo 01)
   .set(spark.executor.memory, 512m))
 val count2 = sc .textFile(hdfs://server 
 address:8020/tmp/data/risk/account.txt)
   .filter(line  = line.contains(Word))
   .count()
 I've got such an error:
 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
 stage failure: Task 0.0:0 failed 4 times, most
 recent failure: Exception failure in TID 6 on host server address: 
 java.lang.ClassNotFoundExcept
 ion: SimpleApp$$anonfun$1
 My dependencies :
 object Version {
   val spark= 1.0.0-cdh5.1.0
 }
 object Library {
   val sparkCore  = org.apache.spark  % spark-assembly_2.10  % 
 Version.spark
 }
 My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-02 Thread Igor Tkachenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156373#comment-14156373
 ] 

Igor Tkachenko commented on SPARK-3761:
---

Created the same bug in Cloudera Jira: 
https://issues.cloudera.org/browse/DISTRO-647

 Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
 -

 Key: SPARK-3761
 URL: https://issues.apache.org/jira/browse/SPARK-3761
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Igor Tkachenko
Priority: Blocker

 I have Scala code:
 val master = spark://server address:7077
 val sc = new SparkContext(new SparkConf()
   .setMaster(master)
   .setAppName(SparkQueryDemo 01)
   .set(spark.executor.memory, 512m))
 val count2 = sc .textFile(hdfs://server 
 address:8020/tmp/data/risk/account.txt)
   .filter(line  = line.contains(Word))
   .count()
 I've got such an error:
 [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
 stage failure: Task 0.0:0 failed 4 times, most
 recent failure: Exception failure in TID 6 on host server address: 
 java.lang.ClassNotFoundExcept
 ion: SimpleApp$$anonfun$1
 My dependencies :
 object Version {
   val spark= 1.0.0-cdh5.1.0
 }
 object Library {
   val sparkCore  = org.apache.spark  % spark-assembly_2.10  % 
 Version.spark
 }
 My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2809) update chill to version 0.5.0

2014-10-02 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156440#comment-14156440
 ] 

Guoqiang Li commented on SPARK-2809:


The related work.
https://github.com/apache/spark/pull/2615

 update chill to version 0.5.0
 -

 Key: SPARK-2809
 URL: https://issues.apache.org/jira/browse/SPARK-2809
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
Assignee: Guoqiang Li

 First twitter chill_2.11 0.4 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java

2014-10-02 Thread Alexis Seigneurin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156464#comment-14156464
 ] 

Alexis Seigneurin commented on SPARK-1834:
--

Same issue here with Spark 1.1.0: reduce() is not implemented on JavaPairRDD.

{code}
Tuple2String, Long r = sc.textFile(filename)
.mapToPair(s - new Tuple2String, Long(s[3], 1L))
.reduceByKey((x, y) - x + y)
.reduce((t1, t2) - t1._2  t2._2 ? t1 : t2);
{code}

Produces:

{code}
Exception in thread main java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
at fr.ippon.dojo.spark.AnalyseParisTrees.main(AnalyseParisTrees.java:33)
{code}

However, reduce() is implemented on JavaRDD. I have had to add an intermediate 
map() operation:

{code}
Tuple2String, Long r = sc.textFile(filename)
.mapToPair(s - new Tuple2String, Long(s[3], 1L))
.reduceByKey((x, y) - x + y)
.map(t - t)
.reduce((t1, t2) - t1._2  t2._2 ? t1 : t2);
{code}

 NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
 

 Key: SPARK-1834
 URL: https://issues.apache.org/jira/browse/SPARK-1834
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4
Reporter: John Snodgrass

 I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here 
 is the partial stack trace:
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39)
 at 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
 at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)...
 I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same 
 version of Spark as I am running on the cluster. The reduce() method works 
 fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that 
 exhibits the problem: 
   ArrayListInteger array = new ArrayList();
   for (int i = 0; i  10; ++i) {
 array.add(i);
   }
   JavaRDDInteger rdd = javaSparkContext.parallelize(array);
   JavaPairRDDString, Integer testRDD = rdd.map(new 
 PairFunctionInteger, String, Integer() {
 @Override
 public Tuple2String, Integer call(Integer t) throws Exception {
   return new Tuple2( + t, t);
 }
   }).cache();
   
   testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, 
 Integer, Tuple2String, Integer() {
 @Override
 public Tuple2String, Integer call(Tuple2String, Integer arg0, 
 Tuple2String, Integer arg1) throws Exception { 
   return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2);
 }
   });



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java

2014-10-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156477#comment-14156477
 ] 

Sean Owen commented on SPARK-1834:
--

Weird, I can reproduce this. I have a new test case for {{JavaAPISuite}} and am 
investigating. It compiles fine but fails at runtime. I sense Scala shenanigans.

 NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
 

 Key: SPARK-1834
 URL: https://issues.apache.org/jira/browse/SPARK-1834
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4
Reporter: John Snodgrass

 I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here 
 is the partial stack trace:
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39)
 at 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
 at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)...
 I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same 
 version of Spark as I am running on the cluster. The reduce() method works 
 fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that 
 exhibits the problem: 
   ArrayListInteger array = new ArrayList();
   for (int i = 0; i  10; ++i) {
 array.add(i);
   }
   JavaRDDInteger rdd = javaSparkContext.parallelize(array);
   JavaPairRDDString, Integer testRDD = rdd.map(new 
 PairFunctionInteger, String, Integer() {
 @Override
 public Tuple2String, Integer call(Integer t) throws Exception {
   return new Tuple2( + t, t);
 }
   }).cache();
   
   testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, 
 Integer, Tuple2String, Integer() {
 @Override
 public Tuple2String, Integer call(Tuple2String, Integer arg0, 
 Tuple2String, Integer arg1) throws Exception { 
   return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2);
 }
   });



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java

2014-10-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156477#comment-14156477
 ] 

Sean Owen edited comment on SPARK-1834 at 10/2/14 12:46 PM:


Weird, I can reproduce this. It compiles fine but fails at runtime. Another 
example, that doesn't even use lambdas:

{code}
  @Test
  public void pairReduce() {
JavaRDDInteger rdd = sc.parallelize(Arrays.asList(1, 1, 2, 3, 5, 8, 13));
JavaPairRDDInteger,Integer pairRDD = rdd.mapToPair(
new PairFunctionInteger, Integer, Integer() {
  @Override
  public Tuple2Integer, Integer call(Integer i) {
return new Tuple2Integer, Integer(i, i + 1);
  }
});

// See SPARK-1834
Tuple2Integer, Integer reduced = pairRDD.reduce(
new Function2Tuple2Integer,Integer, Tuple2Integer,Integer, 
Tuple2Integer,Integer() {
  @Override
  public Tuple2Integer, Integer call(Tuple2Integer, Integer t1,
   Tuple2Integer, Integer t2) {
return new Tuple2Integer, Integer(t1._1() + t2._1(), t1._2() + 
t2._2());
  }
});

Assert.assertEquals(33, reduced._1().intValue());
Assert.assertEquals(40, reduced._1().intValue());
  }
{code}

but...

{code}
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
{code}

I decompiled the class and it really looks like the method is there with the 
expected signature:

{code}
  public scala.Tuple2K, V 
reduce(org.apache.spark.api.java.function.Function2scala.Tuple2K, V, 
scala.Tuple2K, V, scala.Tuple2K, V);
{code}

Color me pretty confused.


was (Author: srowen):
Weird, I can reproduce this. I have a new test case for {{JavaAPISuite}} and am 
investigating. It compiles fine but fails at runtime. I sense Scala shenanigans.

 NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
 

 Key: SPARK-1834
 URL: https://issues.apache.org/jira/browse/SPARK-1834
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4
Reporter: John Snodgrass

 I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here 
 is the partial stack trace:
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39)
 at 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
 at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)...
 I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same 
 version of Spark as I am running on the cluster. The reduce() method works 
 fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that 
 exhibits the problem: 
   ArrayListInteger array = new ArrayList();
   for (int i = 0; i  10; ++i) {
 array.add(i);
   }
   JavaRDDInteger rdd = javaSparkContext.parallelize(array);
   JavaPairRDDString, Integer testRDD = rdd.map(new 
 PairFunctionInteger, String, Integer() {
 @Override
 public Tuple2String, Integer call(Integer t) throws Exception {
   return new Tuple2( + t, t);
 }
   }).cache();
   
   testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, 
 Integer, Tuple2String, Integer() {
 @Override
 public Tuple2String, Integer call(Tuple2String, Integer arg0, 
 Tuple2String, Integer arg1) throws Exception { 
   return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2);
 }
   });



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3765) add testing with sbt to doc

2014-10-02 Thread wangfei (JIRA)
wangfei created SPARK-3765:
--

 Summary: add testing with sbt to doc
 Key: SPARK-3765
 URL: https://issues.apache.org/jira/browse/SPARK-3765
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: wangfei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3765) add testing with sbt to doc

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156569#comment-14156569
 ] 

Apache Spark commented on SPARK-3765:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2629

 add testing with sbt to doc
 ---

 Key: SPARK-3765
 URL: https://issues.apache.org/jira/browse/SPARK-3765
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: wangfei





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156576#comment-14156576
 ] 

Takuya Ueshin commented on SPARK-3764:
--

Ah, so, these artifacts are only for hadoop-2 and if I want to use Hadoop APIs, 
I need to explicitly add dependencies to hadoop, right?

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156578#comment-14156578
 ] 

Takuya Ueshin commented on SPARK-3764:
--

Now I found the instruction 
[here|http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spark]
 but this would not work for hadoop-1.
I think we need some notice to lead hadoop-1 users to [Building Spark with 
Maven|http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version],
 and also need it at [Download Spark |http://spark.apache.org/downloads.html].

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156581#comment-14156581
 ] 

Sean Owen commented on SPARK-3764:
--

The artifacts themselves don't contain any Hadoop code. The default disposition 
of the pom would link to Hadoop 1, but apps are not meant to depend on this 
(this is generally good Maven practice). Yes, you always need to add Hadoop 
dependencies if you use Hadoop APIs. That's not specific to Spark.

In fact, you will want to mark Spark and Hadoop as provided dependencies when 
making an app for use with spark-submit. You can use the Spark artifacts to 
build a Spark app that works with Hadoop 2 or Hadoop 1.

The instructions you see are really about creating a build of Spark itself to 
deploy on a cluster, rather than an app for Spark.

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156588#comment-14156588
 ] 

Takuya Ueshin commented on SPARK-3764:
--

But there are some codes using binary-incompatible APIs of Hadoop in Spark 
itself, which causes my original problem, so hadoop-1 users need to rebuild 
Spark itself.

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156599#comment-14156599
 ] 

Takuya Ueshin commented on SPARK-3764:
--

Ah, I see that {{context.getTaskAttemptID}} at 
[ParquetTableOperations.scala:334|https://github.com/apache/spark/blob/6e27cb630de69fa5acb510b4e2f6b980742b1957/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L334]
 is breaking binary-compatibility of Spark itself.

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-10-02 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156623#comment-14156623
 ] 

Guoqiang Li commented on SPARK-1405:


Hi everyone
[The PR 2388|https://github.com/apache/spark/pull/2388] is OK to review.

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2811) update algebird to 0.8

2014-10-02 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2811:
---
Summary: update algebird to 0.8  (was: update algebird to 0.7)

 update algebird to 0.8
 --

 Key: SPARK-2811
 URL: https://issues.apache.org/jira/browse/SPARK-2811
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 First algebird_2.11 0.7.0 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2811) update algebird to 0.8

2014-10-02 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2811:
---
Description: First algebird_2.11 0.8.1 has to be released  (was: First 
algebird_2.11 0.7.0 has to be released)

 update algebird to 0.8
 --

 Key: SPARK-2811
 URL: https://issues.apache.org/jira/browse/SPARK-2811
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 First algebird_2.11 0.8.1 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2811) update algebird to 0.8.1

2014-10-02 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2811:
---
Summary: update algebird to 0.8.1  (was: update algebird to 0.8)

 update algebird to 0.8.1
 

 Key: SPARK-2811
 URL: https://issues.apache.org/jira/browse/SPARK-2811
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 First algebird_2.11 0.8.1 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3623) Graph should support the checkpoint operation

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156664#comment-14156664
 ] 

Apache Spark commented on SPARK-3623:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2631

 Graph should support the checkpoint operation
 -

 Key: SPARK-3623
 URL: https://issues.apache.org/jira/browse/SPARK-3623
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2, 1.1.0
Reporter: Guoqiang Li
Priority: Critical

 Consider the following code:
 {code}
 for (i - 0 until totalIter) {
   val previousCorpus = corpus
   logInfo(Start Gibbs sampling (Iteration %d/%d).format(i, totalIter))
   val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, 
 sumTerms,
 numTerms, numTopics, alpha, beta).persist(storageLevel)
   val corpusSampleTopics = sampleTopics(corpusTopicDist, 
 globalTopicCounter, sumTerms, numTerms,
 numTopics, alpha, beta).persist(storageLevel)
   corpus = updateCounter(corpusSampleTopics, 
 numTopics).persist(storageLevel)
   globalTopicCounter = collectGlobalCounter(corpus, numTopics)
   assert(bsum(globalTopicCounter) == sumTerms)
   previousCorpus.unpersistVertices()
   corpusTopicDist.unpersistVertices()
   corpusSampleTopics.unpersistVertices()
 }
 {code}
 If there is no checkpoint operation will appear the following problems.
 1. The RDD of corpus dependencies are too deep
 2. The shuffle files are too large.
 3. Any of a server crash will cause the algorithm to recalculate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156663#comment-14156663
 ] 

Apache Spark commented on SPARK-3625:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2631

 In some cases, the RDD.checkpoint does not work
 ---

 Key: SPARK-3625
 URL: https://issues.apache.org/jira/browse/SPARK-3625
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 The reproduce code:
 {code}
 sc.setCheckpointDir(checkpointDir)
 val c = sc.parallelize((1 to 1000)).map(_ + 1)
 c.count
 val dep = c.dependencies.head.rdd
 c.checkpoint()
 c.count
 assert(dep != c.dependencies.head.rdd)
 {code}
 This limit is too strict , This makes it difficult to implement SPARK-3623 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3766) Snappy is also the default compression codec for broadcast variables

2014-10-02 Thread wangfei (JIRA)
wangfei created SPARK-3766:
--

 Summary: Snappy is also the default compression codec for 
broadcast variables
 Key: SPARK-3766
 URL: https://issues.apache.org/jira/browse/SPARK-3766
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: wangfei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3766) Snappy is also the default compression codec for broadcast variables

2014-10-02 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3766:
---
Component/s: Documentation

 Snappy is also the default compression codec for broadcast variables
 

 Key: SPARK-3766
 URL: https://issues.apache.org/jira/browse/SPARK-3766
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: wangfei





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3765) add testing with sbt to doc

2014-10-02 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3765:
---
Component/s: Documentation

 add testing with sbt to doc
 ---

 Key: SPARK-3765
 URL: https://issues.apache.org/jira/browse/SPARK-3765
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: wangfei





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3766) Snappy is also the default compression codec for broadcast variables

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156680#comment-14156680
 ] 

Apache Spark commented on SPARK-3766:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2632

 Snappy is also the default compression codec for broadcast variables
 

 Key: SPARK-3766
 URL: https://issues.apache.org/jira/browse/SPARK-3766
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: wangfei





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-10-02 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156699#comment-14156699
 ] 

Evan Sparks commented on SPARK-1405:


Hi Guoqiang - is it correct that your runtimes are reported in minutes as 
opposed to seconds? In your tests, have you cached the input data? 45 minutes 
for 150 iterations over this small dataset seems slow to me. It would be great 
to get an idea of where the bottleneck is coming from. Is it the Gibbs step or 
something else?

Is it possible to share the dataset you used for these experiments?

Thanks!

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156723#comment-14156723
 ] 

Sean Owen commented on SPARK-3764:
--

I'm not sure what you mean. Spark compiles versus most versions of Hadoop 1 and 
2. You can see the profiles in the build that help support this. These are 
however not relevant to someone that is just building a Spark app.

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3706) Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset

2014-10-02 Thread cocoatomo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156776#comment-14156776
 ] 

cocoatomo commented on SPARK-3706:
--

Thank you for the comment and modification, [~joshrosen].

Taking a quick look, this regression created at the commit 
[f38fab97c7970168f1bd81d4dc202e36322c95e3|https://github.com/apache/spark/commit/f38fab97c7970168f1bd81d4dc202e36322c95e3#diff-5dbcb82caf8131d60c73e82cf8d12d8aR107]
 on master branch.
Pushing ipython aside into a default value force us to set PYSPARK_PYTHON as 
ipython, since PYSPARK_PYTHON defaults to python at the top of the 
./bin/pyspark script.
This issue is a regression between 1.1.0 and 1.2.0, therefore affects only 
1.2.0.

 Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset
 

 Key: SPARK-3706
 URL: https://issues.apache.org/jira/browse/SPARK-3706
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo
  Labels: pyspark

 h3. Problem
 The section Using the shell in Spark Programming Guide 
 (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) 
 says that we can run pyspark REPL through IPython.
 But a folloing command does not run IPython but a default Python executable.
 {quote}
 $ IPYTHON=1 ./bin/pyspark
 Python 2.7.8 (default, Jul  2 2014, 10:14:46) 
 ...
 {quote}
 the spark/bin/pyspark script on the commit 
 b235e013638685758885842dc3268e9800af3678 decides which executable and options 
 it use folloing way.
 # if PYSPARK_PYTHON unset
 #* → defaulting to python
 # if IPYTHON_OPTS set
 #* → set IPYTHON 1
 # some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
 #* out of this issues scope
 # if IPYTHON set as 1
 #* → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
 #* otherwise execute $PYSPARK_PYTHON
 Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is 
 1.
 In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no 
 effect on decide which command to use.
 ||PYSPARK_PYTHON||IPYTHON_OPTS||IPYTHON||resulting command||expected command||
 |(unset → defaults to python)|(unset)|(unset)|python|(same)|
 |(unset → defaults to python)|(unset)|1|python|ipython|
 |(unset → defaults to python)|an_option|(unset → set to 1)|python 
 an_option|ipython an_option|
 |(unset → defaults to python)|an_option|1|python an_option|ipython an_option|
 |ipython|(unset)|(unset)|ipython|(same)|
 |ipython|(unset)|1|ipython|(same)|
 |ipython|an_option|(unset → set to 1)|ipython an_option|(same)|
 |ipython|an_option|1|ipython an_option|(same)|
 h3. Suggestion
 The pyspark script should determine firstly whether a user wants to run 
 IPython or other executables.
 # if IPYTHON_OPTS set
 #* set IPYTHON 1
 # if IPYTHON has a value 1
 #* PYSPARK_PYTHON defaults to ipython if not set
 # PYSPARK_PYTHON defaults to python if not set
 See the pull request for more detailed modification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-10-02 Thread Norman He (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156840#comment-14156840
 ] 

Norman He commented on SPARK-2447:
--

HI Ted,

I am very glad to see the hbase RDD work. I am probably going to use  it in 
current form. 

I like the idea of worker node to manage HBaseConnection, Somehow I havenot 
seen   anycode related to HConnectionStaticCache?


 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-10-02 Thread Ted Malaska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156865#comment-14156865
 ] 

Ted Malaska commented on SPARK-2447:


Hey Norman,

Yes the github project has been used by a couple of client now.  It should be 
pretty harden.  Let me know if you find any issues.  

I will hopefully run into TD and Hadoop World and I will work out how to get 
this into Spark.

Thanks for the comment.

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3767) Support wildcard in Spark properties

2014-10-02 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3767:


 Summary: Support wildcard in Spark properties
 Key: SPARK-3767
 URL: https://issues.apache.org/jira/browse/SPARK-3767
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


If the user sets spark.executor.extraJavaOptions, he/she may want to express 
the value in terms of the executor ID, for instance. In general it would be a 
feature that many will find useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2014-10-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156904#comment-14156904
 ] 

Nicholas Chammas commented on SPARK-2870:
-

[~marmbrus] - A related feature that I think would be very important and useful 
is the ability to infer a complete schema as described here, but do so by key. 
i.e. something like {{inferSchemaByKey()}}

Say I have a large, single RDD of data that includes many different event 
types. I want to key the RDD by event type and make a single pass over it to 
get the schema for each event type. This would probably yield something like a 
{{keyedSchemaRDD}} which I would want to register as multiple tables (one table 
per key/schema) in one go.

Do you think this would be a useful feature? If so, should I track it in a 
separate JIRA issue?

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3706) Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset

2014-10-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3706.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2554
[https://github.com/apache/spark/pull/2554]

 Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset
 

 Key: SPARK-3706
 URL: https://issues.apache.org/jira/browse/SPARK-3706
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo
  Labels: pyspark
 Fix For: 1.2.0


 h3. Problem
 The section Using the shell in Spark Programming Guide 
 (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) 
 says that we can run pyspark REPL through IPython.
 But a folloing command does not run IPython but a default Python executable.
 {quote}
 $ IPYTHON=1 ./bin/pyspark
 Python 2.7.8 (default, Jul  2 2014, 10:14:46) 
 ...
 {quote}
 the spark/bin/pyspark script on the commit 
 b235e013638685758885842dc3268e9800af3678 decides which executable and options 
 it use folloing way.
 # if PYSPARK_PYTHON unset
 #* → defaulting to python
 # if IPYTHON_OPTS set
 #* → set IPYTHON 1
 # some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
 #* out of this issues scope
 # if IPYTHON set as 1
 #* → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
 #* otherwise execute $PYSPARK_PYTHON
 Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is 
 1.
 In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no 
 effect on decide which command to use.
 ||PYSPARK_PYTHON||IPYTHON_OPTS||IPYTHON||resulting command||expected command||
 |(unset → defaults to python)|(unset)|(unset)|python|(same)|
 |(unset → defaults to python)|(unset)|1|python|ipython|
 |(unset → defaults to python)|an_option|(unset → set to 1)|python 
 an_option|ipython an_option|
 |(unset → defaults to python)|an_option|1|python an_option|ipython an_option|
 |ipython|(unset)|(unset)|ipython|(same)|
 |ipython|(unset)|1|ipython|(same)|
 |ipython|an_option|(unset → set to 1)|ipython an_option|(same)|
 |ipython|an_option|1|ipython an_option|(same)|
 h3. Suggestion
 The pyspark script should determine firstly whether a user wants to run 
 IPython or other executables.
 # if IPYTHON_OPTS set
 #* set IPYTHON 1
 # if IPYTHON has a value 1
 #* PYSPARK_PYTHON defaults to ipython if not set
 # PYSPARK_PYTHON defaults to python if not set
 See the pull request for more detailed modification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3105) Calling cache() after RDDs are pipelined has no effect in PySpark

2014-10-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156936#comment-14156936
 ] 

Nicholas Chammas commented on SPARK-3105:
-

I think it's definitely important for the larger project that the 3 APIs 
(Scala, Java, and Python) have semantics that are as consistent as possible. 
And for what it's worth, the Scala/Java semantics in this case seem nicer. 

People learn about the DAG as a distinguishing feature of Spark, so it might 
seem strange in PySpark that caching an RDD earlier in a lineage confers no 
benefit on descendent RDDs. Whether the descendent RDDs were defined before or 
after the caching seems like something people shouldn't have to think about.

It sounds like this is a non-trivial change to make, and I don't appreciate the 
other implications it might have, but it seems like a good thing to me.

 Calling cache() after RDDs are pipelined has no effect in PySpark
 -

 Key: SPARK-3105
 URL: https://issues.apache.org/jira/browse/SPARK-3105
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 PySpark's PipelinedRDD decides whether to pipeline transformations by 
 checking whether those transformations are pipelinable _at the time that the 
 PipelinedRDD objects are created_ rather than at the time that we invoke 
 actions.  This might lead to problems if we call {{cache()}} on an RDD after 
 it's already been used in a pipeline:
 {code}
 rdd1 = sc.parallelize(range(100)).map(lambda x: x)
 rdd2 = rdd1.map(lambda x: 2 * x)
 rdd1.cache()
 rdd2.collect()
 {code}
 When I run this code, I'd expect {cache()}} to break the pipeline and cache 
 intermediate results, but instead the two transformations are pipelined 
 together in Python, effectively ignoring the {{cache()}}.
 Note that {{cache()}} works properly if we call it before performing any 
 other transformations on the RDD:
 {code}
 rdd1 = sc.parallelize(range(100)).map(lambda x: x).cache()
 rdd2 = rdd1.map(lambda x: 2 * x)
 rdd2.collect()
 {code}
 This works as expected and caches {{rdd1}}.
 To fix this, I think we dynamically decide whether to pipeline when we 
 actually perform actions, rather than statically deciding when we create the 
 RDDs.
 We should also add tests for this.
 (Thanks to [~tdas] for pointing out this issue.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1270) An optimized gradient descent implementation

2014-10-02 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156964#comment-14156964
 ] 

Peng Cheng commented on SPARK-1270:
---

Yo, any follow up story on this one?
I'm curious to know the local update part, as DistBelief has non-local model 
server shards.

 An optimized gradient descent implementation
 

 Key: SPARK-1270
 URL: https://issues.apache.org/jira/browse/SPARK-1270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xusen Yin
  Labels: GradientDescent, MLLib,
 Fix For: 1.0.0


 Current implementation of GradientDescent is inefficient in some aspects, 
 especially in high-latency network. I propose a new implementation of 
 GradientDescent, which follows a parallelism model called 
 GradientDescentWithLocalUpdate, inspired by Jeff Dean's DistBelief and Eric 
 Xing's SSP. With a few modifications of runMiniBatchSGD, the 
 GradientDescentWithLocalUpdate can outperform the original sequential version 
 by about 4x without sacrificing accuracy, and can be easily adopted by most 
 classification and regression algorithms in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3768) Modify default YARN memory_overhead-- from an additive constant to a multiplier

2014-10-02 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3768.
--
Resolution: Fixed

 Modify default YARN memory_overhead-- from an additive constant to a 
 multiplier
 ---

 Key: SPARK-3768
 URL: https://issues.apache.org/jira/browse/SPARK-3768
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Nishkam Ravi
 Fix For: 1.2.0


 Related to #894 and https://issues.apache.org/jira/browse/SPARK-2398 
 Experiments show that memory_overhead grows with container size. The 
 multiplier has been experimentally obtained and can potentially be improved 
 over time.
 https://github.com/apache/spark/pull/2485



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3768) Modify default YARN memory_overhead-- from an additive constant to a multiplier

2014-10-02 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3768:


 Summary: Modify default YARN memory_overhead-- from an additive 
constant to a multiplier
 Key: SPARK-3768
 URL: https://issues.apache.org/jira/browse/SPARK-3768
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Nishkam Ravi
 Fix For: 1.2.0


Related to #894 and https://issues.apache.org/jira/browse/SPARK-2398 
Experiments show that memory_overhead grows with container size. The multiplier 
has been experimentally obtained and can potentially be improved over time.


https://github.com/apache/spark/pull/2485



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-10-02 Thread David Martinez Rego (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157047#comment-14157047
 ] 

David Martinez Rego commented on SPARK-1473:


Sorry for having my name incomplete when I first posted. I am David Martinez 
currently at
UCL in London. We had this project abandoned for some time but we will restart 
a pull request
shortly. You can see the current version of the code at 
https://github.com/LIDIAgroup/SparkFeatureSelection.

In response to a past post, the framework that Dr Gavin Brown presents in a 
single unified framework because it does
not make any assumptions when stating the basic probabilistic model for the 
problem of FS. The problem of probability estimation that he mentions is not a 
philosophical question and is not only a shortcoming of feature selection. The 
problem is that the number of parameters that you need to estimate is 
exponentially proportional to the number of variables if you do not make any 
independence assumption (all possible events). So, to have good estimations of 
these parameters (probabilities of events), you need an 
exponential number of samples (to observe all possible events you need and 
exponential number of observations). That is why you need to make independence 
assumptions and follow a greedy strategy to be able to draw some conclusions.

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Assignee: Alexander Ulanov
Priority: Minor
  Labels: features

 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157115#comment-14157115
 ] 

Apache Spark commented on SPARK-3261:
-

User 'derrickburns' has created a pull request for this issue:
https://github.com/apache/spark/pull/2634

 KMeans clusterer can return duplicate cluster centers
 -

 Key: SPARK-3261
 URL: https://issues.apache.org/jira/browse/SPARK-3261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns

 This is a bad design choice.  I think that it is preferable to produce no 
 duplicate cluster centers. So instead of forcing the number of clusters to be 
 K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3218) K-Means clusterer can fail on degenerate data

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157113#comment-14157113
 ] 

Apache Spark commented on SPARK-3218:
-

User 'derrickburns' has created a pull request for this issue:
https://github.com/apache/spark/pull/2634

 K-Means clusterer can fail on degenerate data
 -

 Key: SPARK-3218
 URL: https://issues.apache.org/jira/browse/SPARK-3218
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns
Assignee: Derrick Burns

 The KMeans parallel implementation selects points to be cluster centers with 
 probability weighted by their distance to cluster centers.  However, if there 
 are fewer than k DISTINCT points in the data set, this approach will fail.  
 Further, the recent checkin to work around this problem results in selection 
 of the same point repeatedly as a cluster center. 
 The fix is to allow fewer than k cluster centers to be selected.  This 
 requires several changes to the code, as the number of cluster centers is 
 woven into the implementation.
 I have a version of the code that addresses this problem, AND generalizes the 
 distance metric.  However, I see that there are literally hundreds of 
 outstanding pull requests.  If someone will commit to working with me to 
 sponsor the pull request, I will create it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157114#comment-14157114
 ] 

Apache Spark commented on SPARK-3219:
-

User 'derrickburns' has created a pull request for this issue:
https://github.com/apache/spark/pull/2634

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3424) KMeans Plus Plus is too slow

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157116#comment-14157116
 ] 

Apache Spark commented on SPARK-3424:
-

User 'derrickburns' has created a pull request for this issue:
https://github.com/apache/spark/pull/2634

 KMeans Plus Plus is too slow
 

 Key: SPARK-3424
 URL: https://issues.apache.org/jira/browse/SPARK-3424
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns

 The  KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is 
 the rounds of the KMeansParallel algorithm and k is the number of clusters.  
 This can be dramatically improved by maintaining the distance the closest 
 cluster center from round to round and then incrementally updating that value 
 for each point. This incremental update is O(1) time, this reduces the 
 running time for K Means Plus Plus to O( m k ).  For large k, this is 
 significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Tom Weber (JIRA)
Tom Weber created SPARK-3769:


 Summary: SparkFiles.get gives me the wrong fully qualified path
 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.1.0, 1.0.2
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor



My spark pgm running on my host, (submitting work to my grid).

JavaSparkContext sc =new JavaSparkContext(conf);
final String path = args[1];
sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */

The log shows:
14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
/tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986

those are paths on my host machine. The location that this file gets on grid 
nodes is:
/opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas

While the call to get the path in my code that runs in my mapPartitions 
function on the grid nodes is:

String pgm = SparkFiles.get(path);

And this returns the following string:
/opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas


So, am I expected to take the qualified path that was given to me and parse it 
to get only the file name at the end, and then concatenate that to what I get 
from the SparkFiles.getRootDirectory() call in order to get this to work?
Or pass only the parsed file name to the SparkFiles.get method? Seems as though 
I should be able to pass the same file specification to both sc.addFile() and 
SparkFiles.get() and get the correct location of the file.

Thanks,
Tom






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3766) Snappy is also the default compression codec for broadcast variables

2014-10-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3766.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: wangfei

 Snappy is also the default compression codec for broadcast variables
 

 Key: SPARK-3766
 URL: https://issues.apache.org/jira/browse/SPARK-3766
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157156#comment-14157156
 ] 

Sean Owen commented on SPARK-3769:
--

My understanding is that you execute:

{code}
sc.addFile(/opt/tom/SparkFiles.sas);
...
SparkFiles.get(SparkFiles.sas);
{code}

I would not expect the key used by remote workers must be aware of the location 
on the driver that the file came from. The path may not be absolute in all 
cases anyway. I can see the argument that it feels like both should be the same 
key but really the key being set is the file name, not path.

You don't have to parse it by hand though. Usually you might do something like 
this anyway:

{code}
File myFile = new File(args[1]);
sc.addFile(myFile.getAbsolutePath());
String fileName = myFile.getName();
...
SparkFiles.get(fileName);
{code}

AFAIK this is as intended.

 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3495) Block replication fails continuously when the replication target node is dead

2014-10-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3495:
---
Target Version/s: 1.2.0  (was: 1.1.1, 1.2.0)

 Block replication fails continuously when the replication target node is dead
 -

 Key: SPARK-3495
 URL: https://issues.apache.org/jira/browse/SPARK-3495
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.2.0


 If a block manager (say, A) wants to replicate a block and the node chosen 
 for replication (say, B) is dead, then the attempt to send the block to B 
 fails. However, this continues to fail indefinitely. Even if the driver 
 learns about the demise of the B, A continues to try replicating to B and 
 failing miserably. 
 The reason behind this bug is that A initially fetches a list of peers from 
 the driver (when B was active), but never updates it after B is dead. This 
 affects Spark Streaming as its receiver uses block replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3496) Block replication can by mistake choose driver BlockManager as a peer for replication

2014-10-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3496.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Block replication can by mistake choose driver BlockManager as a peer for 
 replication
 -

 Key: SPARK-3496
 URL: https://issues.apache.org/jira/browse/SPARK-3496
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.2.0


 When selecting peer block managers for replicating a block, the driver block 
 manager can also get chosen accidentally. This is because 
 BlockManagerMasterActor did not filter out the driver block manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3495) Block replication fails continuously when the replication target node is dead

2014-10-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3495.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Block replication fails continuously when the replication target node is dead
 -

 Key: SPARK-3495
 URL: https://issues.apache.org/jira/browse/SPARK-3495
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.2.0


 If a block manager (say, A) wants to replicate a block and the node chosen 
 for replication (say, B) is dead, then the attempt to send the block to B 
 fails. However, this continues to fail indefinitely. Even if the driver 
 learns about the demise of the B, A continues to try replicating to B and 
 failing miserably. 
 The reason behind this bug is that A initially fetches a list of peers from 
 the driver (when B was active), but never updates it after B is dead. This 
 affects Spark Streaming as its receiver uses block replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3496) Block replication can by mistake choose driver BlockManager as a peer for replication

2014-10-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3496:
---
Target Version/s: 1.2.0  (was: 1.1.1, 1.2.0)

 Block replication can by mistake choose driver BlockManager as a peer for 
 replication
 -

 Key: SPARK-3496
 URL: https://issues.apache.org/jira/browse/SPARK-3496
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.2.0


 When selecting peer block managers for replicating a block, the driver block 
 manager can also get chosen accidentally. This is because 
 BlockManagerMasterActor did not filter out the driver block manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3632) ConnectionManager can run out of receive threads with authentication on

2014-10-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3632.

   Resolution: Fixed
Fix Version/s: 1.2.0

 ConnectionManager can run out of receive threads with authentication on
 ---

 Key: SPARK-3632
 URL: https://issues.apache.org/jira/browse/SPARK-3632
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical
 Fix For: 1.2.0


 If you turn authentication on and you are using a lot of executors. There is 
 a chance that all the of the threads in the handleMessageExecutor could be 
 waiting to send a message because they are blocked waiting on authentication 
 to happen. This can cause a temporary deadlock until the connection times out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3770) The userFeatures RDD from MatrixFactorizationModel isn't accessible from the python bindings

2014-10-02 Thread Michelangelo D'Agostino (JIRA)
Michelangelo D'Agostino created SPARK-3770:
--

 Summary: The userFeatures RDD from MatrixFactorizationModel isn't 
accessible from the python bindings
 Key: SPARK-3770
 URL: https://issues.apache.org/jira/browse/SPARK-3770
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Michelangelo D'Agostino


We need access to the underlying latent user features from python.  However, 
the userFeatures RDD from the MatrixFactorizationModel isn't accessible from 
the python bindings.  I've fixed this with a PR that I'll submit shortly that 
adds a method to the underlying scala class to turn the RDD[(Int, 
Array[Double])] to an RDD[String].  This is then accessed from the python 
recommendation.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1284) pyspark hangs after IOError on Executor

2014-10-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-1284.
---
   Resolution: Fixed
Fix Version/s: 1.1.0

I think this is an logging issue ,should be fixed by 
https://github.com/apache/spark/pull/1625, so close it.

If anyone meet this again, we can reopen it.

 pyspark hangs after IOError on Executor
 ---

 Key: SPARK-1284
 URL: https://issues.apache.org/jira/browse/SPARK-1284
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Jim Blomo
Assignee: Davies Liu
 Fix For: 1.1.0


 When running a reduceByKey over a cached RDD, Python fails with an exception, 
 but the failure is not detected by the task runner.  Spark and the pyspark 
 shell hang waiting for the task to finish.
 The error is:
 {code}
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in 
 dump_stream
 self._write_with_length(obj, stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in 
 _write_with_length
 stream.write(serialized)
 IOError: [Errno 104] Connection reset by peer
 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 
 4257 bytes in 47 ms
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in 
 launch_worker
 worker(listen_sock)
   File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}
 I can reproduce the error by running take(10) on the cached RDD before 
 running reduceByKey (which looks at the whole input file).
 Affects Version 1.0.0-SNAPSHOT (4d88030486)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3770) The userFeatures RDD from MatrixFactorizationModel isn't accessible from the python bindings

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157242#comment-14157242
 ] 

Apache Spark commented on SPARK-3770:
-

User 'mdagost' has created a pull request for this issue:
https://github.com/apache/spark/pull/2636

 The userFeatures RDD from MatrixFactorizationModel isn't accessible from the 
 python bindings
 

 Key: SPARK-3770
 URL: https://issues.apache.org/jira/browse/SPARK-3770
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Michelangelo D'Agostino

 We need access to the underlying latent user features from python.  However, 
 the userFeatures RDD from the MatrixFactorizationModel isn't accessible from 
 the python bindings.  I've fixed this with a PR that I'll submit shortly that 
 adds a method to the underlying scala class to turn the RDD[(Int, 
 Array[Double])] to an RDD[String].  This is then accessed from the python 
 recommendation.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Tom Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157263#comment-14157263
 ] 

Tom Weber commented on SPARK-3769:
--

Thanks for the quick turnaround!

I can see that it wouldn't necessarily make sense that a fully qualified path 
(relative to the driver programs filesystem)
would be what the .get method would take on the worker node systems. But, at 
the same time, the .get seems like it
just takes what you give it and blindly concatenates it to the 
.getRootDirectory result w/out even validating it or failing
if that file doesn't exist.

I appreciate the File object methods for pulling the path name apart; I'll use 
that and that will work just fine. First time
playing around with all of this, so sometimes what you expect it to do is just 
a matter of thinking about it a particular way :)

You can close this ticket out, as I'm sure I'll be able to work fine by using 
the full path on the driver side and only the file
name on the worker side. Seems like it might be convenient though if these 
matched set of routines did this themselves
since the driver side needs a qualified path to find the file, and the worker 
side, by definition, strips that off and only put's
the file in the designated work directory (which make sense of course). No big 
deal though.

Thanks again,
Tom




 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-10-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157351#comment-14157351
 ] 

Marcelo Vanzin commented on SPARK-3633:
---

Hey [~pwendell] [~matei], is anyone activelly looking at this issue?

 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157389#comment-14157389
 ] 

Josh Rosen commented on SPARK-3769:
---

I think that {{SparkFiles.get()}} can be called from driver code, too, so 
that's one option if you'd like to achieve consistency between driver and 
executor code.

 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3769.
---
Resolution: Not a Problem

 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-1671) Cached tables should follow write-through policy

2014-10-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1671:
---

Assignee: Michael Armbrust

 Cached tables should follow write-through policy
 

 Key: SPARK-1671
 URL: https://issues.apache.org/jira/browse/SPARK-1671
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian
Assignee: Michael Armbrust
  Labels: cache, column

 Writing (insert / load) to a cached table causes cache inconsistency, and 
 user have to unpersist and cache the whole table again.
 The write-through policy may be implemented with {{RDD.union}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157418#comment-14157418
 ] 

Takuya Ueshin commented on SPARK-3764:
--

{{AppendingParquetOutputFormat}} is using {{TaskAttemptContext}}, which is a 
class in 
[hadoop-1|https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html]
 but is an interface in 
[hadoop-2|http://hadoop.apache.org/docs/r2.5.1/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html],
 so the {{context.getTaskAttemptID}} is source-compatible but not 
binary-compatible. If Spark itself is built against hadoop-1, the artifact is 
for only hadoop-1, and vice versa.

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3771) AppendingParquetOutputFormat should use reflection to prevent breaking binary-compatibility.

2014-10-02 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-3771:


 Summary: AppendingParquetOutputFormat should use reflection to 
prevent breaking binary-compatibility.
 Key: SPARK-3771
 URL: https://issues.apache.org/jira/browse/SPARK-3771
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Takuya Ueshin


Original problem is 
[SPARK-3764|https://issues.apache.org/jira/browse/SPARK-3764].

{{AppendingParquetOutputFormat}} uses a binary-incompatible method 
{{context.getTaskAttemptID}}.
This causes binary-incompatible of Spark itself, i.e. if Spark itself is built 
against hadoop-1, the artifact is for only hadoop-1, and vice versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3771) AppendingParquetOutputFormat should use reflection to prevent breaking binary-compatibility.

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157441#comment-14157441
 ] 

Apache Spark commented on SPARK-3771:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2638

 AppendingParquetOutputFormat should use reflection to prevent breaking 
 binary-compatibility.
 

 Key: SPARK-3771
 URL: https://issues.apache.org/jira/browse/SPARK-3771
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 Original problem is 
 [SPARK-3764|https://issues.apache.org/jira/browse/SPARK-3764].
 {{AppendingParquetOutputFormat}} uses a binary-incompatible method 
 {{context.getTaskAttemptID}}.
 This causes binary-incompatible of Spark itself, i.e. if Spark itself is 
 built against hadoop-1, the artifact is for only hadoop-1, and vice versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157447#comment-14157447
 ] 

Takuya Ueshin commented on SPARK-3764:
--

I filed a new issue SPARK-3771 and close this. Thanks, [~srowen]!

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-02 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin closed SPARK-3764.

Resolution: Fixed

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number

2014-10-02 Thread cocoatomo (JIRA)
cocoatomo created SPARK-3772:


 Summary: RDD operation on IPython REPL failed with an illegal port 
number
 Key: SPARK-3772
 URL: https://issues.apache.org/jira/browse/SPARK-3772
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo


To reproduce this issue, we should execute following commands.

{quote}
$ PYSPARK_PYTHON=ipython ./bin/pyspark
...
In [1]: file = sc.textFile('README.md')
In [2]: file.first()
...
14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334
14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) 
with 1 output partitions (allowLocal=true)
14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
PythonRDD.scala:334)
14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at 
PythonRDD.scala:44), which has no missing parents
14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
curMem=57388, maxMem=278019440
14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 4.4 KB, free 265.1 MB)
14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
(PythonRDD[2] at RDD at PythonRDD.scala:44)
14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1207 bytes)
14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: port out of range:1027423549
at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
at java.net.InetSocketAddress.init(InetSocketAddress.java:188)
at java.net.Socket.init(Socket.java:244)
at 
org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
at 
org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3572) Support register UserType in SQL

2014-10-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3572:
-
Assignee: Joseph K. Bradley

 Support register UserType in SQL
 

 Key: SPARK-3572
 URL: https://issues.apache.org/jira/browse/SPARK-3572
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley

 If a user knows how to map a class to a struct type in Spark SQL, he should 
 be able to register this mapping through sqlContext and hence SQL can figure 
 out the schema automatically.
 {code}
 trait RowSerializer[T] {
   def dataType: StructType
   def serialize(obj: T): Row
   def deserialize(row: Row): T
 }
 sqlContext.registerUserType[T](clazz: classOf[T], serializer: 
 classOf[RowSerializer[T]])
 {code}
 In sqlContext, we can maintain a class-to-serializer map and use it for 
 conversion. The serializer class can be embedded into the metadata, so when 
 `select` is called, we know we want to deserialize the result.
 {code}
 sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer])
 val points: RDD[LabeledPoint] = ...
 val features: RDD[Vector] = points.select('features).map { case Row(v: 
 Vector) = v }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2066) Better error message for non-aggregated attributes with aggregates

2014-10-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2066:

Priority: Critical  (was: Major)

 Better error message for non-aggregated attributes with aggregates
 --

 Key: SPARK-2066
 URL: https://issues.apache.org/jira/browse/SPARK-2066
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Cheng Lian
Priority: Critical

 [~marmbrus]
 Run the following query
 {code}
 scala c.hql(select key, count(*) from src).collect()
 {code}
 Got the following exception at runtime
 {code}
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function 
 to evaluate expression. type: AttributeReference, tree: key#61
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:157)
   at 
 org.apache.spark.sql.catalyst.expressions.Projection.apply(Projection.scala:35)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:154)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134)
   at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558)
   at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 This should either fail in analysis time, or pass at runtime. Definitely 
 shouldn't fail at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3755) Do not bind port 1 - 1024 to server in spark

2014-10-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3755.

   Resolution: Fixed
Fix Version/s: 1.1.1

 Do not bind port 1 - 1024 to server in spark
 

 Key: SPARK-3755
 URL: https://issues.apache.org/jira/browse/SPARK-3755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.1.1, 1.2.0


 Non-root user use port 1- 1024 to start jetty server will get the exception  
 java.net.SocketException: Permission denied



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3773) Sphinx build warnings

2014-10-02 Thread cocoatomo (JIRA)
cocoatomo created SPARK-3773:


 Summary: Sphinx build warnings
 Key: SPARK-3773
 URL: https://issues.apache.org/jira/browse/SPARK-3773
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0, 
Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, docutils==0.12, 
numpy==1.9.0
Reporter: cocoatomo
Priority: Minor


When building Sphinx documents for PySpark, we have 12 warnings.
Their causes are almost docstrings in broken ReST format.

To reproduce this issue, we should run following commands.

{quote}
$ cd ./python/docs
$ make clean html
...
/Users/user/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of 
pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation.
/Users/user/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of 
pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation.
/Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
 of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: 
Unexpected indentation.
/Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
 of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: 
Definition list ends without a blank line; unexpected unindent.
/Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
 of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: 
Block quote ends without a blank line; unexpected unindent.
/Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
 of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected 
indentation.
/Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
 of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list 
ends without a blank line; unexpected unindent.
/Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
 of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends 
without a blank line; unexpected unindent.
/Users/user/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: 
missing attribute mentioned in :members: or __all__: module 
pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD
/Users/user/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of 
pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation.
...
checking consistency... 
/Users/user/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document 
isn't included in any toctree
...
copying static files... WARNING: html_static_path entry 
u'/Users/user/MyRepos/Scala/spark/python/docs/_static' does not exist
...
build succeeded, 12 warnings.
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3759) SparkSubmitDriverBootstrapper should return exit code of driver process

2014-10-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3759.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Assignee: Eric Eijkelenboom
Target Version/s: 1.1.1, 1.2.0

 SparkSubmitDriverBootstrapper should return exit code of driver process
 ---

 Key: SPARK-3759
 URL: https://issues.apache.org/jira/browse/SPARK-3759
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.0
 Environment: Linux, Windows, Scala/Java
Reporter: Eric Eijkelenboom
Assignee: Eric Eijkelenboom
Priority: Minor
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 SparkSubmitDriverBootstrapper.scala currently always returns exit code 0. 
 Instead, it should return the exit code of the driver process.
 Suggested code change in SparkSubmitDriverBootstrapper, line 157: 
 {code}
 val returnCode = process.waitFor()
 sys.exit(returnCode)
 {code}
 Workaround for this issue: 
 Instead of specifying 'driver.extra*' properties in spark-defaults.conf, pass 
 these properties to spark-submit directly. This will launch the driver 
 program without the use of SparkSubmitDriverBootstrapper. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number

2014-10-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157548#comment-14157548
 ] 

Josh Rosen commented on SPARK-3772:
---

Can you post the SHA of the commit that you were using when you saw this?

 RDD operation on IPython REPL failed with an illegal port number
 

 Key: SPARK-3772
 URL: https://issues.apache.org/jira/browse/SPARK-3772
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo
  Labels: pyspark

 To reproduce this issue, we should execute following commands.
 {quote}
 $ PYSPARK_PYTHON=ipython ./bin/pyspark
 ...
 In [1]: file = sc.textFile('README.md')
 In [2]: file.first()
 ...
 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:334
 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at 
 PythonRDD.scala:334) with 1 output partitions (allowLocal=true)
 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
 PythonRDD.scala:334)
 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD 
 at PythonRDD.scala:44), which has no missing parents
 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
 curMem=57388, maxMem=278019440
 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 4.4 KB, free 265.1 MB)
 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[2] at RDD at PythonRDD.scala:44)
 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1207 bytes)
 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: port out of range:1027423549
   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
   at java.net.InetSocketAddress.init(InetSocketAddress.java:188)
   at java.net.Socket.init(Socket.java:244)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:744)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number

2014-10-02 Thread cocoatomo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cocoatomo updated SPARK-3772:
-
Description: 
To reproduce this issue, we should execute following commands on the commit: 
6e27cb630de69fa5acb510b4e2f6b980742b1957.

{quote}
$ PYSPARK_PYTHON=ipython ./bin/pyspark
...
In [1]: file = sc.textFile('README.md')
In [2]: file.first()
...
14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334
14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) 
with 1 output partitions (allowLocal=true)
14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
PythonRDD.scala:334)
14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at 
PythonRDD.scala:44), which has no missing parents
14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
curMem=57388, maxMem=278019440
14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 4.4 KB, free 265.1 MB)
14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
(PythonRDD[2] at RDD at PythonRDD.scala:44)
14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1207 bytes)
14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: port out of range:1027423549
at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
at java.net.InetSocketAddress.init(InetSocketAddress.java:188)
at java.net.Socket.init(Socket.java:244)
at 
org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
at 
org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)
{quote}

  was:
To reproduce this issue, we should execute following commands.

{quote}
$ PYSPARK_PYTHON=ipython ./bin/pyspark
...
In [1]: file = sc.textFile('README.md')
In [2]: file.first()
...
14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334
14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) 
with 1 output partitions (allowLocal=true)
14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
PythonRDD.scala:334)
14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at 
PythonRDD.scala:44), which has no missing parents
14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
curMem=57388, maxMem=278019440
14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 4.4 KB, free 265.1 MB)
14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
(PythonRDD[2] at RDD at PythonRDD.scala:44)
14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 1207 bytes)
14/10/03 08:50:13 INFO Executor: Running 

[jira] [Commented] (SPARK-3773) Sphinx build warnings

2014-10-02 Thread cocoatomo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157590#comment-14157590
 ] 

cocoatomo commented on SPARK-3773:
--

Using Sphinx to generate API docs for PySpark

 Sphinx build warnings
 -

 Key: SPARK-3773
 URL: https://issues.apache.org/jira/browse/SPARK-3773
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0, 
 Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
 docutils==0.12, numpy==1.9.0
Reporter: cocoatomo
Priority: Minor
  Labels: docs, docstrings, pyspark

 When building Sphinx documents for PySpark, we have 12 warnings.
 Their causes are almost docstrings in broken ReST format.
 To reproduce this issue, we should run following commands on the commit: 
 6e27cb630de69fa5acb510b4e2f6b980742b1957.
 {quote}
 $ cd ./python/docs
 $ make clean html
 ...
 /Users/user/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of 
 pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation.
 /Users/user/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of 
 pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation.
 /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
  of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: 
 Unexpected indentation.
 /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
  of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: 
 Definition list ends without a blank line; unexpected unindent.
 /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
  of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: 
 Block quote ends without a blank line; unexpected unindent.
 /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
  of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected 
 indentation.
 /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
  of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition 
 list ends without a blank line; unexpected unindent.
 /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring
  of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote 
 ends without a blank line; unexpected unindent.
 /Users/user/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: 
 missing attribute mentioned in :members: or __all__: module 
 pyspark.mllib.regression, attribute 
 RidgeRegressionModelLinearRegressionWithSGD
 /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation.
 ...
 checking consistency... 
 /Users/user/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document 
 isn't included in any toctree
 ...
 copying static files... WARNING: html_static_path entry 
 u'/Users/user/MyRepos/Scala/spark/python/docs/_static' does not exist
 ...
 build succeeded, 12 warnings.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3771) AppendingParquetOutputFormat should use reflection to prevent from breaking binary-compatibility.

2014-10-02 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-3771:
-
Summary: AppendingParquetOutputFormat should use reflection to prevent from 
breaking binary-compatibility.  (was: AppendingParquetOutputFormat should use 
reflection to prevent breaking binary-compatibility.)

 AppendingParquetOutputFormat should use reflection to prevent from breaking 
 binary-compatibility.
 -

 Key: SPARK-3771
 URL: https://issues.apache.org/jira/browse/SPARK-3771
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 Original problem is 
 [SPARK-3764|https://issues.apache.org/jira/browse/SPARK-3764].
 {{AppendingParquetOutputFormat}} uses a binary-incompatible method 
 {{context.getTaskAttemptID}}.
 This causes binary-incompatible of Spark itself, i.e. if Spark itself is 
 built against hadoop-1, the artifact is for only hadoop-1, and vice versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3774) typo comment in bin/utils.sh

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157593#comment-14157593
 ] 

Apache Spark commented on SPARK-3774:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/2639

 typo comment in bin/utils.sh
 

 Key: SPARK-3774
 URL: https://issues.apache.org/jira/browse/SPARK-3774
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Shell
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI
Priority: Trivial

 typo comment in bin/utils.sh
 {code}
 # Gather all all spark-submit options into SUBMISSION_OPTS
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number

2014-10-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157602#comment-14157602
 ] 

Josh Rosen commented on SPARK-3772:
---

Ah, I see the problem:

PythonWorkerFactory also passes the -u flag when creating Python workers and 
daemons, which doesn't work in IPython.  I noticed one of the uses of -u when 
reviewing your PR, but missed these uses:

https://github.com/apache/spark/blob/42d5077fd3f2c37d1cd23f4c81aa89286a74cb40/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L111

https://github.com/apache/spark/blob/42d5077fd3f2c37d1cd23f4c81aa89286a74cb40/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L152

I guess we need to apply the same PYTHONUNBUFFERED fix here, too.  Sorry for 
overlooking this.

It's strange that PythonWorkerFactory didn't report this error in a more 
graceful way, though.  I think that IPython printed an error message before our 
code had a chance to run, and in startDaemon() we expected to read an integer 
from stdout but instead received text.  There are probably less brittle 
mechanisms for communicating the daemon process's port to its parent 
(SPARK-2313 is an issue that partially addresses this).

I can fix this and open a PR.  If you'd like to do it yourself, just let me 
know and I'd be glad to review it.

 RDD operation on IPython REPL failed with an illegal port number
 

 Key: SPARK-3772
 URL: https://issues.apache.org/jira/browse/SPARK-3772
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo
  Labels: pyspark

 To reproduce this issue, we should execute following commands on the commit: 
 6e27cb630de69fa5acb510b4e2f6b980742b1957.
 {quote}
 $ PYSPARK_PYTHON=ipython ./bin/pyspark
 ...
 In [1]: file = sc.textFile('README.md')
 In [2]: file.first()
 ...
 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:334
 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at 
 PythonRDD.scala:334) with 1 output partitions (allowLocal=true)
 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
 PythonRDD.scala:334)
 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD 
 at PythonRDD.scala:44), which has no missing parents
 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
 curMem=57388, maxMem=278019440
 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 4.4 KB, free 265.1 MB)
 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[2] at RDD at PythonRDD.scala:44)
 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1207 bytes)
 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: port out of range:1027423549
   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
   at java.net.InetSocketAddress.init(InetSocketAddress.java:188)
   at java.net.Socket.init(Socket.java:244)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 

[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number

2014-10-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157604#comment-14157604
 ] 

Josh Rosen commented on SPARK-3772:
---

The reason that we never hit this before is that setting IPYTHON=1 caused Spark 
to only use IPython on the master; the workers and daemons still launched 
through regular `python`.  The old behavior might actually be preferable from a 
performance standpoint, since `ipython` might take longer to start up (this is 
less of an issue nowadays thanks to the worker re-use patch).

 RDD operation on IPython REPL failed with an illegal port number
 

 Key: SPARK-3772
 URL: https://issues.apache.org/jira/browse/SPARK-3772
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo
  Labels: pyspark

 To reproduce this issue, we should execute following commands on the commit: 
 6e27cb630de69fa5acb510b4e2f6b980742b1957.
 {quote}
 $ PYSPARK_PYTHON=ipython ./bin/pyspark
 ...
 In [1]: file = sc.textFile('README.md')
 In [2]: file.first()
 ...
 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:334
 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at 
 PythonRDD.scala:334) with 1 output partitions (allowLocal=true)
 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
 PythonRDD.scala:334)
 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD 
 at PythonRDD.scala:44), which has no missing parents
 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
 curMem=57388, maxMem=278019440
 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 4.4 KB, free 265.1 MB)
 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[2] at RDD at PythonRDD.scala:44)
 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1207 bytes)
 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: port out of range:1027423549
   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
   at java.net.InetSocketAddress.init(InetSocketAddress.java:188)
   at java.net.Socket.init(Socket.java:244)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:744)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org